Gemma 4 26B MoE GGUF — Quantized by BatiAI

Optimized GGUF quantizations of google/gemma-4-26B-A4B-it for on-device AI on Mac. Quantized directly from official Google weights by BatiAI for BatiFlow — free, unlimited, on-device AI automation for Mac. Just 5MB.

Quick Start

# 24GB Mac — Best quality+speed (recommended)
ollama pull batiai/gemma4-26b:iq4

# 24GB Mac — Smaller, imatrix optimized
ollama pull batiai/gemma4-26b:iq3

# 24GB Mac — Standard
ollama pull batiai/gemma4-26b:q3

# 32GB+ Mac — Higher quality
ollama pull batiai/gemma4-26b:q4

# 36GB+ Mac — Highest quality
ollama pull batiai/gemma4-26b:q6

Available Quantizations

Quant	Type	Size	M4 Pro 48GB	M4 Max 128GB	Recommended For
IQ4_XS	imatrix 4-bit	13GB	58–63 t/s ✅	85.8 t/s	24GB+ Mac, recommended
IQ3_M	imatrix 3-bit	12GB	—	77.0 t/s	24GB Mac, slightly smaller
Q3_K_M	K-quant 3-bit	13GB	—	70.7 t/s	Standard, stable
Q4_K_M	K-quant 4-bit	16GB	—	74.9 t/s	Higher quality
Q6_K	K-quant 6-bit	21GB	48–50 t/s	74.8 t/s	Highest quality

⚠️ 16GB Mac note: These models load on 16GB Macs but run at ~0.3 tokens/s due to swap. For 16GB Macs, use batiai/gemma4-e4b (5GB, 57 t/s) instead.

Benchmarks

M4 Pro 48GB (MacBook Pro) — Consumer Mac

Real-world measurements from an actual user setup:

Metric	IQ4_XS	Q6_K	Ollama 26B (official)
Token generation	58–63 t/s	48–50 t/s	56 t/s
VRAM	15.1 GB	23.9 GB	19.3 GB
System memory free	58%	40%	50%
Cold start	1.7s	5.8s	3.4s
Simple response	0.4s	0.5s	0.5s
Coding task	6.8s	7.3s	6.0s
Reasoning (thinking)	4.1s	5.3s	4.8s
Tool calling	✅ verified	✅ verified	⚠️ untested
Korean language	✅ verified	✅ verified	⚠️ untested

BatiAI IQ4 outperforms Ollama's official 26B on the same hardware — both in raw speed (58-63 vs 56 t/s) and memory efficiency (15.1 vs 19.3 GB VRAM).

MacBook Pro M4 Max (128GB) — Developer Mac

Metric	IQ4_XS	IQ3_M	Q3_K_M	Q4_K_M	Q6_K
Token generation	85.8 t/s	77.0 t/s	70.7 t/s	74.9 t/s	74.8 t/s
Prompt eval	114.9 t/s	250 t/s	250 t/s	250 t/s	164.6 t/s
VRAM	22 GB	19 GB	20 GB	23 GB	31 GB
Korean output	✅	✅	✅	✅	✅
Tool call JSON	✅	✅	✅	✅	✅

Mac mini M4 (16GB) — Real-world test

Metric	IQ3_M	Q3_K_M
Token generation	~0.3 t/s	0.30 t/s
Usable?	⚠️ Very slow (swap)	⚠️ Very slow (swap)

16GB Mac에서 26B MoE는 모델 + macOS = 16GB로 스왑이 발생합니다. 24GB 이상의 Mac에서 사용을 권장합니다.

RAM Requirements

Your Mac RAM	IQ3 (12GB)	IQ4 (13GB)	Q3 (13GB)	Q4 (16GB)	Q6 (21GB)
16GB	❌ swap	❌ swap	❌ swap	❌ Won't fit	❌ Won't fit
24GB	✅ Fast	✅ Fits	⚠️ Tight	❌ Barely	❌ No
32GB	✅ Fast	✅ Fast	✅ Fast	✅ OK	❌ No
36GB+	✅ Fast	✅ Fast	✅ Fast	✅ Fast	✅ Fits
128GB	77 t/s	85.8 t/s	70.7 t/s	74.9 t/s	74.8 t/s

Why No Q2? — Benchmark Evidence

We tested Q2_K quantization extensively. It produces broken, unusable output on 26B MoE models — infinite repetition loops. At 2-bit precision, the MoE expert routing weights lose too much information. Q3 (3-bit) is the minimum viable quantization for this model.

Why BatiAI Quantizations?

	BatiAI	Third-party (unsloth, etc.)
Source	Quantized directly from official Google weights	Re-quantized from other GGUF files
Compatibility	✅ Verified on Ollama 0.19~0.20+	❌ Known issues with Ollama 0.20+ (#15235)
Tested on	Real Mac mini M4 (16GB) + MacBook Pro M4 Max (128GB)	Untested on consumer hardware
Tool Calling	✅ Verified with BatiFlow's 57 tool functions	Often broken on MoE models
Korean	✅ Validated Korean text generation	Not tested
imatrix	✅ IQ3_M with calibration data	UD- prefix custom format

About BatiFlow

flow.bati.ai

BatiFlow is a macOS-native AI desktop automation app — just 5MB, built with Swift.

Free & Unlimited — On-device AI via Ollama, no API costs
100% Private — All data stays on your Mac
Ultra Lightweight — Native macOS app, only 5MB
57 built-in tools — calendar, notes, reminders, files, email, browser, messaging, and more

Technical Details

Original Model: google/gemma-4-26B-A4B-it
Architecture: Gemma 4 Mixture-of-Experts (26B total, 3.8B active per token)
Modalities: Text (primary). Vision mmproj included — Ollama vision support pending (#15352, #21402)
Context Window: 128K tokens
License: Apache 2.0 (same as original)
Quantized with: llama.cpp (build 400ac8e)
Quantized by: BatiAI

How We Quantize

Google official weights (BF16, 50.5GB)
  ↓ llama.cpp convert_hf_to_gguf.py
BF16 GGUF (50.5GB)
  ↓ llama-imatrix (calibration data)
Importance Matrix (imatrix.dat)
  ↓ llama-quantize (Q3_K_M, Q4_K_M, Q6_K)
  ↓ llama-quantize --imatrix (IQ3_M)
Quantized GGUF files
  ↓ Tested on real Mac hardware (M4, M4 Max)
Published to Ollama & HuggingFace

No third-party intermediaries. Direct from source, verified on real hardware.

License

This model is quantized from google/gemma-4-26B-A4B-it and follows the original model's license: Apache 2.0.

BatiAI quantization pipeline is provided under MIT License.

Downloads last month: 8,889

GGUF

Model size

25B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

6-bit

Model tree for batiai/Gemma-4-26B-A4B-it-GGUF

Base model

google/gemma-4-26B-A4B-it

Quantized

(89)

this model