gpt-oss-20b-tq3
TurboQuant 3-bit quantized version of openai/gpt-oss-20b using TurboQuant-MLX.
Model Details
- Base Model: openai/gpt-oss-20b
- Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook)
- Parameters: ~22B
- Size: ~9.3 GB (vs ~44 GB BF16)
- Group Size: 64
Requirements
# macOS with Apple Silicon (M1/M2/M3/M4/M5)
pip install turboquant-mlx-full
Quick Start
Generate text
python -m turboquant_mlx.generate \
--model ~/path/to/gpt-oss-20b-tq3 \
--prompt "Why is the sky blue? Explain in simple terms." \
--max-tokens 200 \
--temp 0.7
With KV Cache Compression
For longer prompts/context, use KV cache compression:
python -m turboquant_mlx.demo_kv \
--model ~/path/to/gpt-oss-20b-tq3 \
--prompt "Your long prompt here..." \
--max-tokens 200 \
--tq-bits 3
Results
| Configuration | Size | Speed (M4 Max) |
|---|---|---|
| BF16 (original) | ~44 GB | ~55 tok/s |
| TurboQuant 3-bit | ~9.3 GB | 73 tok/s |
How It Works
TurboQuant applies:
- Hadamard rotation - random +/- 1 scaling to decorrelate weights
- Lloyd-Max codebook - optimal scalar quantization via k-means
- Group-wise scaling - per-group float16 scales for precision
This achieves much better quality than standard affine quantization at the same bit-width.
License
Apache 2.0 (same as base model)
Citation
@article{zandieh2025turboquant,
title={TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
author={Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
year={2025}
}
- Downloads last month
- 211
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for manjunathshiva/gpt-oss-20b-tq3
Base model
openai/gpt-oss-20b