Gemma 4 26B MoE GGUF — Quantized by BatiAI

BatiFlow Ollama

Optimized GGUF quantizations of google/gemma-4-26B-A4B-it for on-device AI on Mac. Quantized directly from official Google weights by BatiAI for BatiFlow — free, unlimited, on-device AI automation for Mac. Just 5MB.

Quick Start

# 24GB Mac — Best quality+speed (recommended)
ollama pull batiai/gemma4-26b:iq4

# 24GB Mac — Smaller, imatrix optimized
ollama pull batiai/gemma4-26b:iq3

# 24GB Mac — Standard
ollama pull batiai/gemma4-26b:q3

# 32GB+ Mac — Higher quality
ollama pull batiai/gemma4-26b:q4

# 36GB+ Mac — Highest quality
ollama pull batiai/gemma4-26b:q6

Available Quantizations

Quant Type Size M4 Pro 48GB M4 Max 128GB Recommended For
IQ4_XS imatrix 4-bit 13GB 58–63 t/s 85.8 t/s 24GB+ Mac, recommended
IQ3_M imatrix 3-bit 12GB 77.0 t/s 24GB Mac, slightly smaller
Q3_K_M K-quant 3-bit 13GB 70.7 t/s Standard, stable
Q4_K_M K-quant 4-bit 16GB 74.9 t/s Higher quality
Q6_K K-quant 6-bit 21GB 48–50 t/s 74.8 t/s Highest quality

⚠️ 16GB Mac note: These models load on 16GB Macs but run at ~0.3 tokens/s due to swap. For 16GB Macs, use batiai/gemma4-e4b (5GB, 57 t/s) instead.

Benchmarks

M4 Pro 48GB (MacBook Pro) — Consumer Mac

Real-world measurements from an actual user setup:

Metric IQ4_XS Q6_K Ollama 26B (official)
Token generation 58–63 t/s 48–50 t/s 56 t/s
VRAM 15.1 GB 23.9 GB 19.3 GB
System memory free 58% 40% 50%
Cold start 1.7s 5.8s 3.4s
Simple response 0.4s 0.5s 0.5s
Coding task 6.8s 7.3s 6.0s
Reasoning (thinking) 4.1s 5.3s 4.8s
Tool calling ✅ verified ✅ verified ⚠️ untested
Korean language ✅ verified ✅ verified ⚠️ untested

BatiAI IQ4 outperforms Ollama's official 26B on the same hardware — both in raw speed (58-63 vs 56 t/s) and memory efficiency (15.1 vs 19.3 GB VRAM).

MacBook Pro M4 Max (128GB) — Developer Mac

Metric IQ4_XS IQ3_M Q3_K_M Q4_K_M Q6_K
Token generation 85.8 t/s 77.0 t/s 70.7 t/s 74.9 t/s 74.8 t/s
Prompt eval 114.9 t/s 250 t/s 250 t/s 250 t/s 164.6 t/s
VRAM 22 GB 19 GB 20 GB 23 GB 31 GB
Korean output
Tool call JSON

Mac mini M4 (16GB) — Real-world test

Metric IQ3_M Q3_K_M
Token generation ~0.3 t/s 0.30 t/s
Usable? ⚠️ Very slow (swap) ⚠️ Very slow (swap)

16GB Mac에서 26B MoE는 모델 + macOS = 16GB로 스왑이 발생합니다. 24GB 이상의 Mac에서 사용을 권장합니다.

RAM Requirements

Your Mac RAM IQ3 (12GB) IQ4 (13GB) Q3 (13GB) Q4 (16GB) Q6 (21GB)
16GB ❌ swap ❌ swap ❌ swap ❌ Won't fit ❌ Won't fit
24GB ✅ Fast ✅ Fits ⚠️ Tight ❌ Barely ❌ No
32GB ✅ Fast ✅ Fast ✅ Fast ✅ OK ❌ No
36GB+ ✅ Fast ✅ Fast ✅ Fast ✅ Fast ✅ Fits
128GB 77 t/s 85.8 t/s 70.7 t/s 74.9 t/s 74.8 t/s

Why No Q2? — Benchmark Evidence

We tested Q2_K quantization extensively. It produces broken, unusable output on 26B MoE models — infinite repetition loops. At 2-bit precision, the MoE expert routing weights lose too much information. Q3 (3-bit) is the minimum viable quantization for this model.

Why BatiAI Quantizations?

BatiAI Third-party (unsloth, etc.)
Source Quantized directly from official Google weights Re-quantized from other GGUF files
Compatibility ✅ Verified on Ollama 0.19~0.20+ ❌ Known issues with Ollama 0.20+ (#15235)
Tested on Real Mac mini M4 (16GB) + MacBook Pro M4 Max (128GB) Untested on consumer hardware
Tool Calling ✅ Verified with BatiFlow's 57 tool functions Often broken on MoE models
Korean ✅ Validated Korean text generation Not tested
imatrix ✅ IQ3_M with calibration data UD- prefix custom format

About BatiFlow

flow.bati.ai

BatiFlow is a macOS-native AI desktop automation app — just 5MB, built with Swift.

  • Free & Unlimited — On-device AI via Ollama, no API costs
  • 100% Private — All data stays on your Mac
  • Ultra Lightweight — Native macOS app, only 5MB
  • 57 built-in tools — calendar, notes, reminders, files, email, browser, messaging, and more

Download BatiFlow

Technical Details

  • Original Model: google/gemma-4-26B-A4B-it
  • Architecture: Gemma 4 Mixture-of-Experts (26B total, 3.8B active per token)
  • Modalities: Text (primary). Vision mmproj included — Ollama vision support pending (#15352, #21402)
  • Context Window: 128K tokens
  • License: Apache 2.0 (same as original)
  • Quantized with: llama.cpp (build 400ac8e)
  • Quantized by: BatiAI

How We Quantize

Google official weights (BF16, 50.5GB)
  ↓ llama.cpp convert_hf_to_gguf.py
BF16 GGUF (50.5GB)
  ↓ llama-imatrix (calibration data)
Importance Matrix (imatrix.dat)
  ↓ llama-quantize (Q3_K_M, Q4_K_M, Q6_K)
  ↓ llama-quantize --imatrix (IQ3_M)
Quantized GGUF files
  ↓ Tested on real Mac hardware (M4, M4 Max)
Published to Ollama & HuggingFace

No third-party intermediaries. Direct from source, verified on real hardware.

License

This model is quantized from google/gemma-4-26B-A4B-it and follows the original model's license: Apache 2.0.

BatiAI quantization pipeline is provided under MIT License.

Downloads last month
8,889
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for batiai/Gemma-4-26B-A4B-it-GGUF

Quantized
(89)
this model