gemma-4-31B-it-oQ8
An oQ8 mixed-precision quantization of google/gemma-4-31b-it using oMLX — a data-driven, sensitivity-aware quantization system for Apple Silicon.
Produces standard MLX safetensors compatible with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.
Key Facts
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31b-it (31B dense, BF16) |
| Quantization | oQ8 — sensitivity-driven mixed-precision |
| Effective bpw | 8.6 |
| Model Size | ~31.4 GB (vs. 58.3 GB BF16) |
| Vision | ✅ Preserved (vision weights kept in fp16) |
| Format | Standard MLX safetensors |
| Quantized with | oMLX v0.3.4+ |
| Hardware | Apple M2 Ultra 128 GB |
Why oQ8?
oQ is not uniform quantization. Instead of applying the same bit depth to every layer, oQ measures per-layer quantization sensitivity through calibration inference and allocates bits where they matter most. Critical layers (embeddings, LM head, sensitive transformer layers) receive higher precision while the bulk of weights are quantized to the target bit depth.
At 8-bit, this is near-lossless quality at roughly half the size of BF16 — with significantly faster token generation due to reduced memory bandwidth requirements on Apple Silicon.
Benchmarks
Tested on Apple M2 Ultra (128 GB) with oMLX. Generation length: 128 tokens.
oQ8 (this model, 31.8 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 5,777 | 57.4 | 177.3 tok/s | 17.5 tok/s | 13.1s | 88.1 tok/s | 31.80 GB |
| pp4096/tg128 | 22,680 | 63.6 | 180.6 tok/s | 15.8 tok/s | 30.8s | 137.3 tok/s | 33.65 GB |
| pp8192/tg128 | 45,656 | 73.2 | 179.4 tok/s | 13.8 tok/s | 55.0s | 151.4 tok/s | 33.96 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | Avg TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1x (baseline) | 17.5 tok/s | 1.00x | 177.3 tok/s | 177.3 tok/s | 5,777 | 13.1 |
| 2x | 28.9 tok/s | 1.65x | 176.6 tok/s | 88.3 tok/s | 11,402 | 20.5 |
| 4x | 37.0 tok/s | 2.11x | 176.1 tok/s | 44.0 tok/s | 22,529 | 37.1 |
BF16 reference (58.5 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 3,974 | 98.2 | 257.7 tok/s | 10.3 tok/s | 16.4s | 70.1 tok/s | 58.53 GB |
| pp4096/tg128 | 16,061 | 102.9 | 255.0 tok/s | 9.8 tok/s | 29.1s | 145.0 tok/s | 60.31 GB |
| pp8192/tg128 | 32,348 | 115.2 | 253.2 tok/s | 8.8 tok/s | 47.0s | 177.1 tok/s | 60.63 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | Avg TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1x (baseline) | 10.3 tok/s | 1.00x | 257.7 tok/s | 257.7 tok/s | 3,974 | 16.4 |
| 2x | 9.1 tok/s | 0.88x | 256.8 tok/s | 128.4 tok/s | 7,713 | 36.1 |
| 4x | 16.7 tok/s | 1.62x | 253.1 tok/s | 63.3 tok/s | 15,215 | 46.8 |
Summary: oQ8 vs BF16
| Metric | oQ8 | BF16 | Difference |
|---|---|---|---|
| Size | 31.4 GB | 58.3 GB | -46% |
| Token Generation | 17.5 tok/s | 10.3 tok/s | +70% faster |
| 4x Batch Generation | 37.0 tok/s | 16.7 tok/s | +122% faster |
| Prefill | 177 tok/s | 258 tok/s | -31% (dequantization overhead) |
| Peak Memory | 31.8 GB | 58.5 GB | -46% |
oQ8 is near-lossless at half the memory, with significantly faster token generation. The prefill speed is slightly slower due to dequantization, but for interactive use the generation speed is what matters.
Usage
oMLX
Drop the model folder into your oMLX models directory. Auto-detected on server start.
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("mpe74/gemma-4-31B-it-oQ8")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
LM Studio
Search for the model and download. Works with MLX backend on Apple Silicon.
Quantization Details
| Parameter | Value |
|---|---|
| oQ Level | oQ8 |
| Base bits | 8 |
| Mode | Affine quantization |
| Group size | 64 |
| Sensitivity model | Source model (google/gemma-4-31b-it BF16) |
| Calibration data | Built-in oMLX dataset (600 samples: code, multilingual, tool calling, reasoning) |
| Vision weights | Preserved in fp16 |
Bug Note
During quantization, a bug in oMLX was encountered: Object of type set is not JSON serializable caused by _oq_non_quantizable (a Python set) not being removed from the output config before JSON serialization. Fix: add "_oq_non_quantizable" to the cleanup list in omlx/oq.py line ~1300. Issue will be reported upstream.
Quantized by mpe74 using oMLX on Apple M2 Ultra (128 GB).
- Downloads last month
- 374
8-bit