GPT-OSS-120B-MLX-3.6bit

Mixed-precision MLX quantization of OpenAI GPT-OSS-120B — OpenAI's open-source MoE model (120B total / 5.1B active parameters).

3.557 BPW | 48 GB

🚀 Hardware Optimization

This model brings OpenAI's 120B-class open-source performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from MXFP4's 65GB down to 48GB — a 17GB reduction — while actually improving generation speed thanks to the smaller memory footprint.

This optimization unlocks two distinct local inference experiences:

64GB Unified Memory (Minimum): The original MXFP4 model weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 120B model into 64GB for the first time, pushing the hardware boundaries to make local 120B inference possible on edge devices.
96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities.

Quantization

4-tier mixed precision by functional sensitivity:

Bits	Layers	% Params	Description
BF16	—	~0.02%	norm, router, bias, sinks — tiny count, cannot tolerate precision loss
6-bit	—	~3.3%	Embeddings, lm_head, v/o_proj (all layers), edge layers (0–5, 30–35) attention, full_attention q/k
4-bit	—	~0.5%	Middle sliding_attention (6–29) q/k_proj
3-bit	—	~96%	Expert FFN (128 experts, 4 active/token)

Benchmark (M2 Max 96GB)

	This (3.6bit)	Original MXFP4
Model size	48 GB	65 GB
Peak memory (pp1024/tg128)	48.6 GB	61.0 GB
Peak memory (pp4096/tg128)	48.7 GB	61.1 GB
Prefill (1k ctx)	171.9 tok/s	189.0 tok/s
Prefill (4k ctx)	191.9 tok/s	210.7 tok/s
Generation (1k ctx)	48.2 tok/s	43.9 tok/s
Generation (4k ctx)	40.5 tok/s	38.1 tok/s

Generation speed is faster than the original MXFP4 despite lower bit-width — smaller model size means better memory bandwidth utilization.

Usage

from mlx_lm import load, generate

model, tokenizer = load("MoringLabs/GPT-OSS-120B-MLX-3.6bit")

messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month: 506

Safetensors

Model size

117B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for MoringLabs/GPT-OSS-120B-MLX-3.6bit

Base model

openai/gpt-oss-120b

Quantized

(95)

this model