gpt-oss-20b-tq3

TurboQuant 3-bit quantized version of openai/gpt-oss-20b using TurboQuant-MLX.

Model Details

  • Base Model: openai/gpt-oss-20b
  • Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook)
  • Parameters: ~22B
  • Size: ~9.3 GB (vs ~44 GB BF16)
  • Group Size: 64

Requirements

# macOS with Apple Silicon (M1/M2/M3/M4/M5)
pip install turboquant-mlx-full

Quick Start

Generate text

python -m turboquant_mlx.generate \
    --model ~/path/to/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in simple terms." \
    --max-tokens 200 \
    --temp 0.7

With KV Cache Compression

For longer prompts/context, use KV cache compression:

python -m turboquant_mlx.demo_kv \
    --model ~/path/to/gpt-oss-20b-tq3 \
    --prompt "Your long prompt here..." \
    --max-tokens 200 \
    --tq-bits 3

Results

Configuration Size Speed (M4 Max)
BF16 (original) ~44 GB ~55 tok/s
TurboQuant 3-bit ~9.3 GB 73 tok/s

How It Works

TurboQuant applies:

  1. Hadamard rotation - random +/- 1 scaling to decorrelate weights
  2. Lloyd-Max codebook - optimal scalar quantization via k-means
  3. Group-wise scaling - per-group float16 scales for precision

This achieves much better quality than standard affine quantization at the same bit-width.

License

Apache 2.0 (same as base model)

Citation

@article{zandieh2025turboquant,
  title={TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
  author={Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
  year={2025}
}
Downloads last month
211
Safetensors
Model size
22B params
Tensor type
F16
·
U32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manjunathshiva/gpt-oss-20b-tq3

Quantized
(192)
this model