Gemma 4 26B-A4B-it - TurboQuant MLX 8-bit
8-bit weight-quantized MLX version of google/gemma-4-26B-A4B-it with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.
Approximate model size: ~26 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-26B-A4B-it |
| Parameters | 26 billion total (4 billion active per token) |
| Architecture | Mixture-of-Experts (MoE) (4B active per token) |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | 8-bit (~26 GB) |
| KV-Cache Quantization | TurboQuant |
| Framework | MLX (Apple Silicon) |
Quickstart
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")
prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
For multimodal usage with images:
from mlx_vlm import load, generate
model, processor = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")
prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation.
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
Memory Estimates (Gemma 4 26B-A4B-it)
| Precision | Approximate Size | MLX Variant |
|---|---|---|
| FP16 (original) | ~52 GB | -- |
| 8-bit quantized | ~26 GB | This model |
| 4-bit quantized | ~14 GB | TurboQuant-MLX-4bit |
| 2-bit quantized | ~7 GB | TurboQuant-MLX-2bit |
Hardware Requirements
This model requires approximately 26 GB of unified memory. Recommended hardware:
- Apple M2 Max (32 GB+)
- Apple M3 Max (48 GB+)
- Apple M4 Max (48 GB+)
See Also
- google/gemma-4-26B-A4B-it -- Base model
- majentik/gemma-4-26B-A4B-it-TurboQuant -- TurboQuant KV-cache only (transformers)
- majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-4bit -- MLX 4-bit variant
- majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-2bit -- MLX 2-bit variant
- majentik/gemma-4-26B-A4B-it-RotorQuant-MLX-8bit -- RotorQuant MLX 8-bit variant
- TurboQuant Paper (arXiv: 2504.19874)
- MLX Framework
- Downloads last month
- 370
8-bit
Model tree for majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit
Base model
google/gemma-4-26B-A4B-it