Nemotron-3-Super-120B-A12B-MLX-3.6bit

Mixed-precision MLX quantization of NVIDIA Nemotron-3-Super-120B-A12B — a hybrid Mamba2 + MoE + Attention architecture.

3.623 BPW | 51 GB

🚀 Hardware Optimization

This model brings 120B-class performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've successfully squeezed the memory footprint from 240GB BF16 down to 51GB while preserving near-lossless generation quality compared to standard 4-bit uniform quantization.

This optimization unlocks two distinct local inference experiences:

64GB Unified Memory (Minimum): Pushes the hardware boundaries to make local 120B model inference possible on edge devices.
96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities.

Quantization

4-tier mixed precision by functional sensitivity:

Bits	Layers	% Params	Description
BF16	—	~0.02%	norm, router, bias — tiny count, cannot tolerate precision loss
6-bit	94	~3%	Embeddings, lm_head, all Attention q/k/v/o, edge Mamba & MoE (layers 0–10, 77–87)
4-bit	180	~8%	Mid-layer Mamba in/out_proj, MoE shared_expert & latent_proj
3-bit	80	~89%	Expert FFN (512 experts × 40 layers, switch_mlp)

Benchmark (M2 Max 96GB, oMLX)

Test	pp TPS	tg TPS
pp1024/tg128	151.9 tok/s	31.6 tok/s
pp4096/tg128	171.7 tok/s	30.0 tok/s

Requirements

mlx-lm ≥ 0.31.2 (dev) — older versions lack Nemotron-H / latent projection support
Apple Silicon with ≥ 96GB unified memory recommended

Usage

from mlx_lm import load, generate

model, tokenizer = load("MoringLabs/Nemotron-3-Super-120B-A12B-MLX-3.6bit")

messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

License

NVIDIA Open Model License

Downloads last month: 1,037

Safetensors

Model size

121B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

3-bit