GPT-OSS-120B-MLX-3.6bit
Mixed-precision MLX quantization of OpenAI GPT-OSS-120B — OpenAI's open-source MoE model (120B total / 5.1B active parameters).
- 3.557 BPW | 48 GB
🚀 Hardware Optimization
This model brings OpenAI's 120B-class open-source performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from MXFP4's 65GB down to 48GB — a 17GB reduction — while actually improving generation speed thanks to the smaller memory footprint.
This optimization unlocks two distinct local inference experiences:
- 64GB Unified Memory (Minimum): The original MXFP4 model weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 120B model into 64GB for the first time, pushing the hardware boundaries to make local 120B inference possible on edge devices.
- 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities.
Quantization
4-tier mixed precision by functional sensitivity:
| Bits | Layers | % Params | Description |
|---|---|---|---|
| BF16 | — | ~0.02% | norm, router, bias, sinks — tiny count, cannot tolerate precision loss |
| 6-bit | — | ~3.3% | Embeddings, lm_head, v/o_proj (all layers), edge layers (0–5, 30–35) attention, full_attention q/k |
| 4-bit | — | ~0.5% | Middle sliding_attention (6–29) q/k_proj |
| 3-bit | — | ~96% | Expert FFN (128 experts, 4 active/token) |
Benchmark (M2 Max 96GB)
| This (3.6bit) | Original MXFP4 | |
|---|---|---|
| Model size | 48 GB | 65 GB |
| Peak memory (pp1024/tg128) | 48.6 GB | 61.0 GB |
| Peak memory (pp4096/tg128) | 48.7 GB | 61.1 GB |
| Prefill (1k ctx) | 171.9 tok/s | 189.0 tok/s |
| Prefill (4k ctx) | 191.9 tok/s | 210.7 tok/s |
| Generation (1k ctx) | 48.2 tok/s | 43.9 tok/s |
| Generation (4k ctx) | 40.5 tok/s | 38.1 tok/s |
Generation speed is faster than the original MXFP4 despite lower bit-width — smaller model size means better memory bandwidth utilization.
Usage
from mlx_lm import load, generate
model, tokenizer = load("MoringLabs/GPT-OSS-120B-MLX-3.6bit")
messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
License
Apache 2.0
- Downloads last month
- 506
Model size
117B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
3-bit
Model tree for MoringLabs/GPT-OSS-120B-MLX-3.6bit
Base model
openai/gpt-oss-120b