Gemma 4 26B MoE GGUF — Quantized by BatiAI
Optimized GGUF quantizations of google/gemma-4-26B-A4B-it for on-device AI on Mac. Quantized directly from official Google weights by BatiAI for BatiFlow — free, unlimited, on-device AI automation for Mac. Just 5MB.
Quick Start
# 24GB Mac — Best quality+speed (recommended)
ollama pull batiai/gemma4-26b:iq4
# 24GB Mac — Smaller, imatrix optimized
ollama pull batiai/gemma4-26b:iq3
# 24GB Mac — Standard
ollama pull batiai/gemma4-26b:q3
# 32GB+ Mac — Higher quality
ollama pull batiai/gemma4-26b:q4
# 36GB+ Mac — Highest quality
ollama pull batiai/gemma4-26b:q6
Available Quantizations
| Quant | Type | Size | M4 Pro 48GB | M4 Max 128GB | Recommended For |
|---|---|---|---|---|---|
| IQ4_XS | imatrix 4-bit | 13GB | 58–63 t/s ✅ | 85.8 t/s | 24GB+ Mac, recommended |
| IQ3_M | imatrix 3-bit | 12GB | — | 77.0 t/s | 24GB Mac, slightly smaller |
| Q3_K_M | K-quant 3-bit | 13GB | — | 70.7 t/s | Standard, stable |
| Q4_K_M | K-quant 4-bit | 16GB | — | 74.9 t/s | Higher quality |
| Q6_K | K-quant 6-bit | 21GB | 48–50 t/s | 74.8 t/s | Highest quality |
⚠️ 16GB Mac note: These models load on 16GB Macs but run at ~0.3 tokens/s due to swap. For 16GB Macs, use batiai/gemma4-e4b (5GB, 57 t/s) instead.
Benchmarks
M4 Pro 48GB (MacBook Pro) — Consumer Mac
Real-world measurements from an actual user setup:
| Metric | IQ4_XS | Q6_K | Ollama 26B (official) |
|---|---|---|---|
| Token generation | 58–63 t/s | 48–50 t/s | 56 t/s |
| VRAM | 15.1 GB | 23.9 GB | 19.3 GB |
| System memory free | 58% | 40% | 50% |
| Cold start | 1.7s | 5.8s | 3.4s |
| Simple response | 0.4s | 0.5s | 0.5s |
| Coding task | 6.8s | 7.3s | 6.0s |
| Reasoning (thinking) | 4.1s | 5.3s | 4.8s |
| Tool calling | ✅ verified | ✅ verified | ⚠️ untested |
| Korean language | ✅ verified | ✅ verified | ⚠️ untested |
BatiAI IQ4 outperforms Ollama's official 26B on the same hardware — both in raw speed (58-63 vs 56 t/s) and memory efficiency (15.1 vs 19.3 GB VRAM).
MacBook Pro M4 Max (128GB) — Developer Mac
| Metric | IQ4_XS | IQ3_M | Q3_K_M | Q4_K_M | Q6_K |
|---|---|---|---|---|---|
| Token generation | 85.8 t/s | 77.0 t/s | 70.7 t/s | 74.9 t/s | 74.8 t/s |
| Prompt eval | 114.9 t/s | 250 t/s | 250 t/s | 250 t/s | 164.6 t/s |
| VRAM | 22 GB | 19 GB | 20 GB | 23 GB | 31 GB |
| Korean output | ✅ | ✅ | ✅ | ✅ | ✅ |
| Tool call JSON | ✅ | ✅ | ✅ | ✅ | ✅ |
Mac mini M4 (16GB) — Real-world test
| Metric | IQ3_M | Q3_K_M |
|---|---|---|
| Token generation | ~0.3 t/s | 0.30 t/s |
| Usable? | ⚠️ Very slow (swap) | ⚠️ Very slow (swap) |
16GB Mac에서 26B MoE는 모델 + macOS = 16GB로 스왑이 발생합니다. 24GB 이상의 Mac에서 사용을 권장합니다.
RAM Requirements
| Your Mac RAM | IQ3 (12GB) | IQ4 (13GB) | Q3 (13GB) | Q4 (16GB) | Q6 (21GB) |
|---|---|---|---|---|---|
| 16GB | ❌ swap | ❌ swap | ❌ swap | ❌ Won't fit | ❌ Won't fit |
| 24GB | ✅ Fast | ✅ Fits | ⚠️ Tight | ❌ Barely | ❌ No |
| 32GB | ✅ Fast | ✅ Fast | ✅ Fast | ✅ OK | ❌ No |
| 36GB+ | ✅ Fast | ✅ Fast | ✅ Fast | ✅ Fast | ✅ Fits |
| 128GB | 77 t/s | 85.8 t/s | 70.7 t/s | 74.9 t/s | 74.8 t/s |
Why No Q2? — Benchmark Evidence
We tested Q2_K quantization extensively. It produces broken, unusable output on 26B MoE models — infinite repetition loops. At 2-bit precision, the MoE expert routing weights lose too much information. Q3 (3-bit) is the minimum viable quantization for this model.
Why BatiAI Quantizations?
| BatiAI | Third-party (unsloth, etc.) | |
|---|---|---|
| Source | Quantized directly from official Google weights | Re-quantized from other GGUF files |
| Compatibility | ✅ Verified on Ollama 0.19~0.20+ | ❌ Known issues with Ollama 0.20+ (#15235) |
| Tested on | Real Mac mini M4 (16GB) + MacBook Pro M4 Max (128GB) | Untested on consumer hardware |
| Tool Calling | ✅ Verified with BatiFlow's 57 tool functions | Often broken on MoE models |
| Korean | ✅ Validated Korean text generation | Not tested |
| imatrix | ✅ IQ3_M with calibration data | UD- prefix custom format |
About BatiFlow
BatiFlow is a macOS-native AI desktop automation app — just 5MB, built with Swift.
- Free & Unlimited — On-device AI via Ollama, no API costs
- 100% Private — All data stays on your Mac
- Ultra Lightweight — Native macOS app, only 5MB
- 57 built-in tools — calendar, notes, reminders, files, email, browser, messaging, and more
Technical Details
- Original Model: google/gemma-4-26B-A4B-it
- Architecture: Gemma 4 Mixture-of-Experts (26B total, 3.8B active per token)
- Modalities: Text (primary). Vision mmproj included — Ollama vision support pending (#15352, #21402)
- Context Window: 128K tokens
- License: Apache 2.0 (same as original)
- Quantized with: llama.cpp (build 400ac8e)
- Quantized by: BatiAI
How We Quantize
Google official weights (BF16, 50.5GB)
↓ llama.cpp convert_hf_to_gguf.py
BF16 GGUF (50.5GB)
↓ llama-imatrix (calibration data)
Importance Matrix (imatrix.dat)
↓ llama-quantize (Q3_K_M, Q4_K_M, Q6_K)
↓ llama-quantize --imatrix (IQ3_M)
Quantized GGUF files
↓ Tested on real Mac hardware (M4, M4 Max)
Published to Ollama & HuggingFace
No third-party intermediaries. Direct from source, verified on real hardware.
License
This model is quantized from google/gemma-4-26B-A4B-it and follows the original model's license: Apache 2.0.
BatiAI quantization pipeline is provided under MIT License.
- Downloads last month
- 8,889
3-bit
4-bit
6-bit
Model tree for batiai/Gemma-4-26B-A4B-it-GGUF
Base model
google/gemma-4-26B-A4B-it