gpt-oss-20b-tq3

TurboQuant 3-bit quantized version of openai/gpt-oss-20b using TurboQuant-MLX.

Model Details

Base Model: openai/gpt-oss-20b
Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook)
Parameters: ~22B
Size: ~9.3 GB (vs ~44 GB BF16)
Group Size: 64

Requirements

# macOS with Apple Silicon (M1/M2/M3/M4/M5)
pip install turboquant-mlx-full

Quick Start

Generate text

python -m turboquant_mlx.generate \
    --model ~/path/to/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in simple terms." \
    --max-tokens 200 \
    --temp 0.7

With KV Cache Compression

For longer prompts/context, use KV cache compression:

python -m turboquant_mlx.demo_kv \
    --model ~/path/to/gpt-oss-20b-tq3 \
    --prompt "Your long prompt here..." \
    --max-tokens 200 \
    --tq-bits 3

Results

Configuration	Size	Speed (M4 Max)
BF16 (original)	~44 GB	~55 tok/s
TurboQuant 3-bit	~9.3 GB	73 tok/s

How It Works

TurboQuant applies:

Hadamard rotation - random +/- 1 scaling to decorrelate weights
Lloyd-Max codebook - optimal scalar quantization via k-means
Group-wise scaling - per-group float16 scales for precision

This achieves much better quality than standard affine quantization at the same bit-width.

License

Apache 2.0 (same as base model)

Citation

@article{zandieh2025turboquant,
  title={TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
  author={Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
  year={2025}
}

Downloads last month: 211

Safetensors

Model size

22B params

Tensor type

F16

U32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manjunathshiva/gpt-oss-20b-tq3

Base model

openai/gpt-oss-20b

Quantized

(192)

this model