GPT-OSS-120B-MLX-3.6bit

Mixed-precision MLX quantization of OpenAI GPT-OSS-120B — OpenAI's open-source MoE model (120B total / 5.1B active parameters).

  • 3.557 BPW | 48 GB

🚀 Hardware Optimization

This model brings OpenAI's 120B-class open-source performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from MXFP4's 65GB down to 48GB — a 17GB reduction — while actually improving generation speed thanks to the smaller memory footprint.

This optimization unlocks two distinct local inference experiences:

  • 64GB Unified Memory (Minimum): The original MXFP4 model weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 120B model into 64GB for the first time, pushing the hardware boundaries to make local 120B inference possible on edge devices.
  • 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities.

Quantization

4-tier mixed precision by functional sensitivity:

Bits Layers % Params Description
BF16 ~0.02% norm, router, bias, sinks — tiny count, cannot tolerate precision loss
6-bit ~3.3% Embeddings, lm_head, v/o_proj (all layers), edge layers (0–5, 30–35) attention, full_attention q/k
4-bit ~0.5% Middle sliding_attention (6–29) q/k_proj
3-bit ~96% Expert FFN (128 experts, 4 active/token)

Benchmark (M2 Max 96GB)

This (3.6bit) Original MXFP4
Model size 48 GB 65 GB
Peak memory (pp1024/tg128) 48.6 GB 61.0 GB
Peak memory (pp4096/tg128) 48.7 GB 61.1 GB
Prefill (1k ctx) 171.9 tok/s 189.0 tok/s
Prefill (4k ctx) 191.9 tok/s 210.7 tok/s
Generation (1k ctx) 48.2 tok/s 43.9 tok/s
Generation (4k ctx) 40.5 tok/s 38.1 tok/s

Generation speed is faster than the original MXFP4 despite lower bit-width — smaller model size means better memory bandwidth utilization.

Usage

from mlx_lm import load, generate

model, tokenizer = load("MoringLabs/GPT-OSS-120B-MLX-3.6bit")

messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month
506
Safetensors
Model size
117B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoringLabs/GPT-OSS-120B-MLX-3.6bit

Quantized
(95)
this model