MLX Studio — the only app that natively supports JANG models with reasoning

93% MMLU at same size as MLX 4-bit. JANG_4M matches MLX 4-bit quality with 8-bit attention protection. Hybrid Mamba-2 SSM + Latent MoE + Attention.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or `pip install "jang[mlx]>=2.1.5"`.

Nemotron-3-Super-120B-A12B — JANG_4M (4.1-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

93.0% MMLU (200 questions, reasoning mode) — matches MLX 4-bit at same size
55.1 tok/s generation, 154 tok/s prefill
63 GB on disk, 61.2 GB GPU RAM
Reasoning mode: `...` step-by-step problem solving
Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
bfloat16 compute: auto-detected for 512-expert models

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. All models tested with and without reasoning using identical methodology.

Subject	JANG_4M No-Think	JANG_4M Reasoning	JANG_2L No-Think	JANG_2L Reasoning	MLX 4-bit No-Think	MLX 4-bit Reasoning
Abstract Algebra	10/20	19/20	12/20	16/20	9/20	19/20
Anatomy	15/20	18/20	15/20	17/20	14/20	18/20
Astronomy	19/20	19/20	19/20	19/20	19/20	19/20
College CS	13/20	17/20	13/20	15/20	14/20	17/20
College Physics	14/20	19/20	14/20	18/20	13/20	20/20
HS Biology	19/20	20/20	19/20	18/20	18/20	20/20
HS Chemistry	15/20	18/20	15/20	16/20	16/20	19/20
HS Mathematics	6/20	18/20	8/20	18/20	6/20	18/20
Logical Fallacies	17/20	19/20	17/20	18/20	17/20	18/20
World Religions	17/20	19/20	18/20	17/20	16/20	19/20
Total	145/200 (72.5%)	186/200 (93.0%)	150/200 (75.0%)	172/200 (86.0%)	142/200 (71.0%)	187/200 (93.5%)

Summary

	JANG_4M	JANG_2L	MLX 4-bit	MLX 3-bit
MMLU (no-think)	72.5%	75.0%	71.0%	Crashes
MMLU (reasoning)	93.0%	86.0%	93.5%	Crashes
Size	63 GB	43 GB	63 GB	N/A
GPU RAM	61.2 GB	42.4 GB	63.3 GB	N/A
Speed	55.1 tok/s	51.6 tok/s	59.8 tok/s	N/A
Fits 64 GB?	YES	YES	YES	N/A

JANG_4M nearly ties MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size with 8-bit attention protection. MLX 3-bit cannot be created — `mlx_lm.convert` crashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations.

Also see: JANG_2L (43 GB) — 20 GB smaller, fits 64 GB Macs, 75% no-think / 86% reasoning.

Specs

Metric	Value
Source	NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Architecture	Hybrid Mamba-2 SSM + Latent MoE + Dense Attention
Layers	88 (40 Mamba-2 + 40 MoE + 8 Attention)
Experts	512 per MoE layer, top-22 active (12B active params)
Profile	JANG_4M (CRITICAL=8, IMPORTANT=4, COMPRESS=4)
Average bits	4.10 bpw
Disk size	63 GB
GPU RAM	61.2 GB (peak 66 GB)
Speed	55.1 tok/s generation, 154 tok/s prefill
Compute	bfloat16 (auto-detected)

Requirements

Apple Silicon Mac with 64+ GB unified memory
MLX Studio or `pip install "jang[mlx]>=2.1.5"`

Quick Start

```bash pip install "jang[mlx]>=2.1.5" ```

```python from jang_tools.loader import load_jang_model from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_4M")

With reasoning

messages = [{"role": "user", "content": "Explain quantum computing."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

Without reasoning (faster)

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) result = generate(model, tokenizer, prompt=prompt, max_tokens=100) ```

Technical Notes

Latent MoE: Nemotron-H compresses hidden states 4096→1024 before expert routing. JANG loader handles this automatically.
bfloat16: Auto-detected for 512-expert models. Prevents float16 overflow. Zero quality impact.
trust_remote_code: Custom Python files included (modeling_nemotron_h.py, configuration_nemotron_h.py).

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

한국어

Nemotron-3-Super-120B JANG_4M — MLX 4-bit과 동일한 크기(63 GB)에서 93% MMLU 달성.

	JANG_4M	JANG_2L	MLX 4-bit
MMLU (추론 없음)	72.5%	75.0%	71.0%
MMLU (추론 포함)	93.0%	86.0%	93.5%
크기	63 GB	43 GB	63 GB
속도	55.1 tok/s	51.6 tok/s	59.8 tok/s

```bash pip install "jang[mlx]>=2.1.5" ```

Downloads last month: 1,306

Safetensors

Model size

18B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_4M

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

Finetuned

(2)

this model