Nemotron-3-Super-120B-A12B-MLX-3.6bit
Mixed-precision MLX quantization of NVIDIA Nemotron-3-Super-120B-A12B — a hybrid Mamba2 + MoE + Attention architecture.
- 3.623 BPW | 51 GB
🚀 Hardware Optimization
This model brings 120B-class performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've successfully squeezed the memory footprint from 240GB BF16 down to 51GB while preserving near-lossless generation quality compared to standard 4-bit uniform quantization.
This optimization unlocks two distinct local inference experiences:
- 64GB Unified Memory (Minimum): Pushes the hardware boundaries to make local 120B model inference possible on edge devices.
- 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities.
Quantization
4-tier mixed precision by functional sensitivity:
| Bits | Layers | % Params | Description |
|---|---|---|---|
| BF16 | — | ~0.02% | norm, router, bias — tiny count, cannot tolerate precision loss |
| 6-bit | 94 | ~3% | Embeddings, lm_head, all Attention q/k/v/o, edge Mamba & MoE (layers 0–10, 77–87) |
| 4-bit | 180 | ~8% | Mid-layer Mamba in/out_proj, MoE shared_expert & latent_proj |
| 3-bit | 80 | ~89% | Expert FFN (512 experts × 40 layers, switch_mlp) |
Benchmark (M2 Max 96GB, oMLX)
| Test | pp TPS | tg TPS |
|---|---|---|
| pp1024/tg128 | 151.9 tok/s | 31.6 tok/s |
| pp4096/tg128 | 171.7 tok/s | 30.0 tok/s |
Requirements
- mlx-lm ≥ 0.31.2 (dev) — older versions lack Nemotron-H / latent projection support
- Apple Silicon with ≥ 96GB unified memory recommended
Usage
from mlx_lm import load, generate
model, tokenizer = load("MoringLabs/Nemotron-3-Super-120B-A12B-MLX-3.6bit")
messages = [{"role": "user", "content": "Hello!"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
License
- Downloads last month
- 1,037
Model size
121B params
Tensor type
BF16
·
U32 ·
F32 ·
Hardware compatibility
Log In to add your hardware
3-bit