Gemma 4 26B-A4B-it - TurboQuant MLX 8-bit

8-bit weight-quantized MLX version of google/gemma-4-26B-A4B-it with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. Only 4B parameters are active per token despite 26B total, making this model significantly more efficient at inference time than its parameter count suggests.

Approximate model size: ~26 GB

Model Specifications

Property Value
Base Model google/gemma-4-26B-A4B-it
Parameters 26 billion total (4 billion active per token)
Architecture Mixture-of-Experts (MoE) (4B active per token)
Modality Multimodal: image + text input, text output
License Apache 2.0
Weight Quantization 8-bit (~26 GB)
KV-Cache Quantization TurboQuant
Framework MLX (Apple Silicon)

Quickstart

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")

prompt = "Describe this image in detail."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

For multimodal usage with images:

from mlx_vlm import load, generate

model, processor = load("majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit")

prompt = "What do you see in this image?"
output = generate(model, processor, prompt=prompt, image="path/to/image.jpg", max_tokens=512)
print(output)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy: smaller model weights for reduced disk and memory footprint, plus compressed KV cache for efficient long-context generation.

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

Memory Estimates (Gemma 4 26B-A4B-it)

Precision Approximate Size MLX Variant
FP16 (original) ~52 GB --
8-bit quantized ~26 GB This model
4-bit quantized ~14 GB TurboQuant-MLX-4bit
2-bit quantized ~7 GB TurboQuant-MLX-2bit

Hardware Requirements

This model requires approximately 26 GB of unified memory. Recommended hardware:

  • Apple M2 Max (32 GB+)
  • Apple M3 Max (48 GB+)
  • Apple M4 Max (48 GB+)

See Also

Downloads last month
370
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit

Quantized
(153)
this model

Paper for majentik/gemma-4-26B-A4B-it-TurboQuant-MLX-8bit