Gemma 4 26B-A4B-it - TurboQuant MLX 8-bit

8-bit weight-quantized MLX version of google/gemma-4-26B-A4B-it with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~26 GB

Model Specifications

Property	Value
Base Model	google/gemma-4-26B-A4B-it
Parameters	26 billion total (4 billion active per token)
Architecture	Mixture-of-Experts (MoE) (4B active per token)
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	8-bit (~26 GB)
KV-Cache Quantization	TurboQuant
Framework	MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")

prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

For multimodal usage with images:

from mlx_vlm import load, generate

model, processor = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")

prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation.

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Gemma 4 26B-A4B-it)

Precision	Approximate Size	MLX Variant
FP16 (original)	~52 GB	--
8-bit quantized	~26 GB	This model
4-bit quantized	~14 GB	TurboQuant-MLX-4bit
2-bit quantized	~7 GB	TurboQuant-MLX-2bit

Hardware Requirements

This model requires approximately 26 GB of unified memory. Recommended hardware:

Apple M2 Max (32 GB+)
Apple M3 Max (48 GB+)
Apple M4 Max (48 GB+)

Model tree for majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit

Base model

google/gemma-4-26B-A4B-it

Quantized

(153)

this model

Paper for majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32

majentik
/

gemma-4-26B-A4B-it-TurboQuant-MLX-8bit