Gemma 4 31B GGUF — Quantized by BatiAI

BatiFlow Ollama

Optimized GGUF quantizations of google/gemma-4-31B-it for on-device AI on Mac. Built and verified by BatiAI for BatiFlow — free, unlimited, on-device AI automation for Mac.

Quick Start

# 48GB+ Mac — Best speed and quality (recommended)
ollama pull batiai/gemma4-31b:iq4

# 48GB+ Mac — Smaller, still fast
ollama pull batiai/gemma4-31b:iq3

# 48GB+ Mac — Standard 4-bit
ollama pull batiai/gemma4-31b:q4

# 128GB Mac — Highest quality (but slow due to bandwidth)
ollama pull batiai/gemma4-31b:q6

Available Quantizations

Quant Type Size M4 Pro 48GB M4 Max 128GB Recommended For
IQ4_XS imatrix 4-bit 16GB 13.5 t/s 22.8 t/s 48GB+ Mac, best
IQ3_M imatrix 3-bit 13GB 12.2 t/s 20.7 t/s 48GB+ Mac
Q4_K_M K-quant 4-bit 17GB ⚠️ tight 19.1 t/s 64GB+ Mac
Q6_K K-quant 6-bit 23GB ❌ won't fit 6.6 t/s 128GB Mac

Benchmarks on Real Hardware

M4 Pro 48GB (MacBook Pro, consumer)

Metric IQ3_M IQ4_XS
Token generation 12.2 t/s 13.5 t/s
VRAM ~24 GB 26.1 GB
System free 37%
Cold start ~40s
Simple response ~1.7s ~1.5s
Coding task ~39s ~28s
Reasoning (thinking) ~24s ~13s

M4 Max 128GB (MacBook Pro)

Metric IQ3_M IQ4_XS Q4_K_M Q6_K
Token generation 20.7 t/s 22.8 t/s 19.1 t/s 6.6 t/s
VRAM 39 GB 41 GB 43 GB 49 GB
Korean output
Tool call JSON

Note: IQ4 is faster than IQ3 on Apple Silicon despite larger file size. 4-bit aligns cleanly with SIMD and has simpler dequantization than 3-bit's packed lookup tables.

RAM Requirements

Your Mac RAM IQ3 (13GB) IQ4 (16GB) Q4 (17GB) Q6 (23GB)
16GB
32GB ❌ swap ❌ swap ❌ swap
48GB ✅ Tight ✅ Fits ⚠️ Tight
64GB ✅ Fast ✅ Fast ✅ Fast ⚠️ Tight
128GB 20.7 t/s 22.8 t/s 19.1 t/s 6.6 t/s

31B Dense vs 26B MoE — Real Hardware Comparison

Measured on the same M4 Pro 48GB Mac:

Metric 31B IQ4 Dense 26B IQ4 MoE
Speed 13.5 t/s 58–63 t/s (4x faster)
VRAM 26.1 GB 15.1 GB
System free 37% 58%
Cold start 40 seconds 1.7 seconds (23x faster)
Simple response 1.5s 0.4s
Coding task 28.5s 6.8s
Reasoning 13.4s 4.1s

For most 48GB Mac users: batiai/gemma4-26b:iq4 is the clear winner.

The 26B MoE only activates 3.8B params per token, while 31B Dense activates all 30.7B. Combined with imatrix quantization, 26B IQ4 is 4x faster with cleaner memory profile.

Use 31B only when:

  • You have 64GB+ RAM for comfortable headroom
  • The specific task benefits from dense model reasoning quality
  • Speed is not a primary concern

Why Q6_K is Slow

31B Dense Q6_K is bandwidth-bound on Apple Silicon — even with 128GB RAM, the model can only be read from memory at ~800 GB/s, limiting token generation to ~6 t/s. Use IQ4_XS or Q4_K_M for practical speed.

Why BatiAI?

Unlike third-party re-quantizations (e.g., unsloth), BatiAI models are:

BatiAI Third-party
Source Quantized directly from official Google weights Re-quantized from other GGUFs
Compatibility ✅ Verified on Ollama 0.20+ ❌ Known issues (#15235)
Tested on Real MacBook Pro M4 Max (128GB) Untested on consumer hardware
Tool Calling ✅ Verified Often untested
Korean ✅ Validated Not tested

About BatiFlow

BatiFlow is a macOS-native AI desktop automation app — just 5MB, built with Swift.

  • Free & Unlimited — On-device AI via Ollama
  • 100% Private — All data stays on your Mac
  • 57 built-in tools — calendar, notes, reminders, files, email, browser, messaging

Technical Details

  • Original Model: google/gemma-4-31B-it
  • Architecture: Dense (30.7B params, all active)
  • Modalities: Text (primary). Vision pending Ollama fix (#15352)
  • Context Window: 256K tokens
  • License: Gemma (same as original)
  • Quantized with: llama.cpp (build 400ac8e)
  • Quantized by: BatiAI

License

Quantized from google/gemma-4-31B-it. License: Gemma.

BatiAI quantization pipeline is provided under MIT License.

Downloads last month
257
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for batiai/gemma-4-31B-it-GGUF

Quantized
(107)
this model