MLX Studio

MLX Studio App

MLX Studio — the only app that natively supports JANG models with reasoning


93% MMLU at same size as MLX 4-bit. JANG_4M matches MLX 4-bit quality with 8-bit attention protection. Hybrid Mamba-2 SSM + Latent MoE + Attention.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or `pip install "jang[mlx]>=2.1.5"`.


JANG

Nemotron-3-Super-120B-A12B — JANG_4M (4.1-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX


GitHub  PyPI  Website  X/Twitter

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

  • 93.0% MMLU (200 questions, reasoning mode) — matches MLX 4-bit at same size
  • 55.1 tok/s generation, 154 tok/s prefill
  • 63 GB on disk, 61.2 GB GPU RAM
  • Reasoning mode: `...` step-by-step problem solving
  • Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
  • bfloat16 compute: auto-detected for 512-expert models

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. All models tested with and without reasoning using identical methodology.

Subject JANG_4M No-Think JANG_4M Reasoning JANG_2L No-Think JANG_2L Reasoning MLX 4-bit No-Think MLX 4-bit Reasoning
Abstract Algebra 10/20 19/20 12/20 16/20 9/20 19/20
Anatomy 15/20 18/20 15/20 17/20 14/20 18/20
Astronomy 19/20 19/20 19/20 19/20 19/20 19/20
College CS 13/20 17/20 13/20 15/20 14/20 17/20
College Physics 14/20 19/20 14/20 18/20 13/20 20/20
HS Biology 19/20 20/20 19/20 18/20 18/20 20/20
HS Chemistry 15/20 18/20 15/20 16/20 16/20 19/20
HS Mathematics 6/20 18/20 8/20 18/20 6/20 18/20
Logical Fallacies 17/20 19/20 17/20 18/20 17/20 18/20
World Religions 17/20 19/20 18/20 17/20 16/20 19/20
Total 145/200 (72.5%) 186/200 (93.0%) 150/200 (75.0%) 172/200 (86.0%) 142/200 (71.0%) 187/200 (93.5%)

Summary

JANG_4M JANG_2L MLX 4-bit MLX 3-bit
MMLU (no-think) 72.5% 75.0% 71.0% Crashes
MMLU (reasoning) 93.0% 86.0% 93.5% Crashes
Size 63 GB 43 GB 63 GB N/A
GPU RAM 61.2 GB 42.4 GB 63.3 GB N/A
Speed 55.1 tok/s 51.6 tok/s 59.8 tok/s N/A
Fits 64 GB? YES YES YES N/A

JANG_4M nearly ties MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size with 8-bit attention protection. MLX 3-bit cannot be created — `mlx_lm.convert` crashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations.

Also see: JANG_2L (43 GB) — 20 GB smaller, fits 64 GB Macs, 75% no-think / 86% reasoning.

Specs

Metric Value
Source NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Architecture Hybrid Mamba-2 SSM + Latent MoE + Dense Attention
Layers 88 (40 Mamba-2 + 40 MoE + 8 Attention)
Experts 512 per MoE layer, top-22 active (12B active params)
Profile JANG_4M (CRITICAL=8, IMPORTANT=4, COMPRESS=4)
Average bits 4.10 bpw
Disk size 63 GB
GPU RAM 61.2 GB (peak 66 GB)
Speed 55.1 tok/s generation, 154 tok/s prefill
Compute bfloat16 (auto-detected)

Requirements

  • Apple Silicon Mac with 64+ GB unified memory
  • MLX Studio or `pip install "jang[mlx]>=2.1.5"`

Quick Start

```bash pip install "jang[mlx]>=2.1.5" ```

```python from jang_tools.loader import load_jang_model from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_4M")

With reasoning

messages = [{"role": "user", "content": "Explain quantum computing."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

Without reasoning (faster)

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) result = generate(model, tokenizer, prompt=prompt, max_tokens=100) ```

Technical Notes

  • Latent MoE: Nemotron-H compresses hidden states 4096→1024 before expert routing. JANG loader handles this automatically.
  • bfloat16: Auto-detected for 512-expert models. Prevents float16 overflow. Zero quality impact.
  • trust_remote_code: Custom Python files included (modeling_nemotron_h.py, configuration_nemotron_h.py).

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

한국어

Nemotron-3-Super-120B JANG_4M — MLX 4-bit과 동일한 크기(63 GB)에서 93% MMLU 달성.

JANG_4M JANG_2L MLX 4-bit
MMLU (추론 없음) 72.5% 75.0% 71.0%
MMLU (추론 포함) 93.0% 86.0% 93.5%
크기 63 GB 43 GB 63 GB
속도 55.1 tok/s 51.6 tok/s 59.8 tok/s

```bash pip install "jang[mlx]>=2.1.5" ```

Downloads last month
1,306
Safetensors
Model size
18B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_4M

Finetuned
(2)
this model