gemma-4-31B-it-oQ8

An oQ8 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.

Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.

Key Facts

Property Value
Base Model google/gemma-4-31b-it (31B dense, BF16)
Quantization oQ8 — sensitivity-driven mixed-precision
Effective bpw 8.6
Model Size ~31.4 GB (vs. 58.3 GB BF16)
Vision ✅ Preserved (vision weights kept in fp16)
Format Standard MLX safetensors
Quantized with oMLX v0.3.4+
Hardware Apple M2 Ultra 128 GB

Why oQ8?

oQ is not uniform quantization. Instead of applying the same bit depth to every layer, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. Critical layers (embeddings, LM head, sensitive transformer layers) receive higher precision while the bulk of weights are quantized to the target bit depth.

At 8-bit, this is near-lossless quality at roughly half the size of BF16 — with significantly faster token generation due to reduced memory bandwidth requirements on Apple Silicon.

Benchmarks

Tested on Apple M2 Ultra (128 GB) with oMLX. Generation length: 128 tokens.

oQ8 (this model, 31.8 GB)

Test TTFT (ms) TPOT (ms) pp TPS tg TPS E2E (s) Throughput Peak Mem
pp1024/tg128 5,777 57.4 177.3 tok/s 17.5 tok/s 13.1s 88.1 tok/s 31.80 GB
pp4096/tg128 22,680 63.6 180.6 tok/s 15.8 tok/s 30.8s 137.3 tok/s 33.65 GB
pp8192/tg128 45,656 73.2 179.4 tok/s 13.8 tok/s 55.0s 151.4 tok/s 33.96 GB

Continuous Batching (pp1024/tg128)

Batch tg TPS Speedup pp TPS pp TPS/req Avg TTFT (ms) E2E (s)
1x (baseline) 17.5 tok/s 1.00x 177.3 tok/s 177.3 tok/s 5,777 13.1
2x 28.9 tok/s 1.65x 176.6 tok/s 88.3 tok/s 11,402 20.5
4x 37.0 tok/s 2.11x 176.1 tok/s 44.0 tok/s 22,529 37.1

BF16 reference (58.5 GB)

Test TTFT (ms) TPOT (ms) pp TPS tg TPS E2E (s) Throughput Peak Mem
pp1024/tg128 3,974 98.2 257.7 tok/s 10.3 tok/s 16.4s 70.1 tok/s 58.53 GB
pp4096/tg128 16,061 102.9 255.0 tok/s 9.8 tok/s 29.1s 145.0 tok/s 60.31 GB
pp8192/tg128 32,348 115.2 253.2 tok/s 8.8 tok/s 47.0s 177.1 tok/s 60.63 GB

Continuous Batching (pp1024/tg128)

Batch tg TPS Speedup pp TPS pp TPS/req Avg TTFT (ms) E2E (s)
1x (baseline) 10.3 tok/s 1.00x 257.7 tok/s 257.7 tok/s 3,974 16.4
2x 9.1 tok/s 0.88x 256.8 tok/s 128.4 tok/s 7,713 36.1
4x 16.7 tok/s 1.62x 253.1 tok/s 63.3 tok/s 15,215 46.8

Summary: oQ8 vs BF16

Metric oQ8 BF16 Difference
Size 31.4 GB 58.3 GB -46%
Token Generation 17.5 tok/s 10.3 tok/s +70% faster
4x Batch Generation 37.0 tok/s 16.7 tok/s +122% faster
Prefill 177 tok/s 258 tok/s -31% (dequantization overhead)
Peak Memory 31.8 GB 58.5 GB -46%

oQ8 is near-lossless at half the memory, with significantly faster token generation. The prefill speed is slightly slower due to dequantization, but for interactive use the generation speed is what matters.

Usage

oMLX

Drop the model folder into your oMLX models directory. Auto-detected on server start.

mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mpe74/gemma-4-31B-it-oQ8")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

LM Studio

Search for the model and download. Works with MLX backend on Apple Silicon.

Quantization Details

Parameter Value
oQ Level oQ8
Base bits 8
Mode Affine quantization
Group size 64
Sensitivity model Source model (google/gemma-4-31b-it BF16)
Calibration data Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning)
Vision weights Preserved in fp16

Bug Note

During quantization, a bug in oMLX was encountered: Object of type set is not JSON serializable caused by _oq_non_quantizable (a Python set) not being removed from the output config before JSON serialization. Fix: add "_oq_non_quantizable" to the cleanup list in omlx/oq.py line ~1300. Issue will be reported upstream.


Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).

Downloads last month
374
Safetensors
Model size
9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support