Gemma 4 31B GGUF — Quantized by BatiAI

Optimized GGUF quantizations of google/gemma-4-31B-it for on-device AI on Mac. Built and verified by BatiAI for BatiFlow — free, unlimited, on-device AI automation for Mac.

Quick Start

# 48GB+ Mac — Best speed and quality (recommended)
ollama pull batiai/gemma4-31b:iq4

# 48GB+ Mac — Smaller, still fast
ollama pull batiai/gemma4-31b:iq3

# 48GB+ Mac — Standard 4-bit
ollama pull batiai/gemma4-31b:q4

# 128GB Mac — Highest quality (but slow due to bandwidth)
ollama pull batiai/gemma4-31b:q6

Available Quantizations

Quant	Type	Size	M4 Pro 48GB	M4 Max 128GB	Recommended For
IQ4_XS	imatrix 4-bit	16GB	13.5 t/s	22.8 t/s	48GB+ Mac, best
IQ3_M	imatrix 3-bit	13GB	12.2 t/s	20.7 t/s	48GB+ Mac
Q4_K_M	K-quant 4-bit	17GB	⚠️ tight	19.1 t/s	64GB+ Mac
Q6_K	K-quant 6-bit	23GB	❌ won't fit	6.6 t/s	128GB Mac

Benchmarks on Real Hardware

M4 Pro 48GB (MacBook Pro, consumer)

Metric	IQ3_M	IQ4_XS
Token generation	12.2 t/s	13.5 t/s
VRAM	~24 GB	26.1 GB
System free	—	37%
Cold start	—	~40s
Simple response	~1.7s	~1.5s
Coding task	~39s	~28s
Reasoning (thinking)	~24s	~13s

M4 Max 128GB (MacBook Pro)

Metric	IQ3_M	IQ4_XS	Q4_K_M	Q6_K
Token generation	20.7 t/s	22.8 t/s	19.1 t/s	6.6 t/s
VRAM	39 GB	41 GB	43 GB	49 GB
Korean output	✅	✅	✅	✅
Tool call JSON	✅	✅	✅	✅

Note: IQ4 is faster than IQ3 on Apple Silicon despite larger file size. 4-bit aligns cleanly with SIMD and has simpler dequantization than 3-bit's packed lookup tables.

RAM Requirements

Your Mac RAM	IQ3 (13GB)	IQ4 (16GB)	Q4 (17GB)	Q6 (23GB)
16GB	❌	❌	❌	❌
32GB	❌ swap	❌ swap	❌ swap	❌
48GB	✅ Tight	✅ Fits	⚠️ Tight	❌
64GB	✅ Fast	✅ Fast	✅ Fast	⚠️ Tight
128GB	20.7 t/s	22.8 t/s	19.1 t/s	6.6 t/s

31B Dense vs 26B MoE — Real Hardware Comparison

Measured on the same M4 Pro 48GB Mac:

Metric	31B IQ4 Dense	26B IQ4 MoE
Speed	13.5 t/s	58–63 t/s (4x faster)
VRAM	26.1 GB	15.1 GB
System free	37%	58%
Cold start	40 seconds	1.7 seconds (23x faster)
Simple response	1.5s	0.4s
Coding task	28.5s	6.8s
Reasoning	13.4s	4.1s

For most 48GB Mac users: batiai/gemma4-26b:iq4 is the clear winner.

The 26B MoE only activates 3.8B params per token, while 31B Dense activates all 30.7B. Combined with imatrix quantization, 26B IQ4 is 4x faster with cleaner memory profile.

Use 31B only when:

You have 64GB+ RAM for comfortable headroom
The specific task benefits from dense model reasoning quality
Speed is not a primary concern

Why Q6_K is Slow

31B Dense Q6_K is bandwidth-bound on Apple Silicon — even with 128GB RAM, the model can only be read from memory at ~800 GB/s, limiting token generation to ~6 t/s. Use IQ4_XS or Q4_K_M for practical speed.

Why BatiAI?

Unlike third-party re-quantizations (e.g., unsloth), BatiAI models are:

	BatiAI	Third-party
Source	Quantized directly from official Google weights	Re-quantized from other GGUFs
Compatibility	✅ Verified on Ollama 0.20+	❌ Known issues (#15235)
Tested on	Real MacBook Pro M4 Max (128GB)	Untested on consumer hardware
Tool Calling	✅ Verified	Often untested
Korean	✅ Validated	Not tested

About BatiFlow

BatiFlow is a macOS-native AI desktop automation app — just 5MB, built with Swift.

Free & Unlimited — On-device AI via Ollama
100% Private — All data stays on your Mac
57 built-in tools — calendar, notes, reminders, files, email, browser, messaging

Technical Details

Original Model: google/gemma-4-31B-it
Architecture: Dense (30.7B params, all active)
Modalities: Text (primary). Vision pending Ollama fix (#15352)
Context Window: 256K tokens
License: Gemma (same as original)
Quantized with: llama.cpp (build 400ac8e)
Quantized by: BatiAI

License

Quantized from google/gemma-4-31B-it. License: Gemma.

BatiAI quantization pipeline is provided under MIT License.

Downloads last month: 257

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

6-bit

Model tree for batiai/gemma-4-31B-it-GGUF

Base model

google/gemma-4-31B-it

Quantized

(107)

this model