MLX Studio

MLX Studio App

MLX Studio — the only app that natively supports JANG models with reasoning


This model uses reasoning/thinking mode. The model reasons inside <think>...</think> tags before answering. This dramatically improves accuracy on hard questions (abstract algebra: 50% → 80%, math: 50% → 85%). MLX Studio is required to run this model — it handles JANG format, bfloat16 compute, and thinking mode natively.

LM Studio, Ollama, oMLX, Inferencer do NOT support JANG format. Use MLX Studio or pip install "jang[mlx]".


JANG

Qwen3.5-397B-A17B — JANG_2L (3.7-bit, 8-bit attention) — Reasoning + VLM

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX


GitHub  PyPI  Website  X/Twitter

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

  • 92.0% MMLU (200 questions, reasoning mode) — 397B intelligence on Apple Silicon
  • 36 tok/s generation speed on M4 Ultra 256 GB
  • Reasoning mode: <think>...</think> for step-by-step problem solving
  • Vision (VLM): 333 vision tensors, processes images and video
  • 187 GB on disk, 197 GB peak GPU RAM
  • bfloat16 compute: auto-detected by JANG loader for 512-expert models

Results: JANG_2L vs MLX 4-bit (200-question MMLU)

Per-subject comparison across all modes. Both JANG and MLX 4-bit tested with and without reasoning.

Subject JANG No-Think JANG Reasoning MLX 4-bit No-Think MLX 4-bit Reasoning
Abstract Algebra 10/20 16/20 10/20 17/20
Anatomy 17/20 19/20 18/20 19/20
Astronomy 19/20 19/20 19/20 19/20
College CS 18/20 19/20 15/20 18/20
College Physics 14/20 18/20 15/20 19/20
HS Biology 18/20 19/20 19/20 19/20
HS Chemistry 16/20 18/20 17/20 19/20
HS Mathematics 10/20 17/20 12/20 19/20
Logical Fallacies 19/20 20/20 19/20 20/20
World Religions 18/20 19/20 19/20 19/20
Total 159/200 (79.5%) 184/200 (92.0%) 163/200 (81.5%) 188/200 (94.0%)

Summary

JANG_2L JANG_1L MLX 4-bit MLX 2/3-bit
MMLU (no-think) 79.5% 81.0% 81.5% NaN -- cannot run
MMLU (reasoning) 92.0% 86.5% 94.0% NaN -- cannot run
Size 187 GB 112 GB 209 GB N/A
GPU RAM 184 GB 110 GB ~210 GB N/A
Speed 36.0 tok/s 36.1 tok/s ~36 tok/s N/A
Fits 128 GB? No (256 GB) YES No N/A

JANG_2L is 22 GB smaller than MLX 4-bit. MLX 4-bit with reasoning reaches 94.0%, while JANG_2L reaches 92.0% at significantly smaller size. MLX 2-bit and 3-bit produce NaN -- cannot run (float16 overflow on 512-expert models). JANG solves this with bfloat16.

Specs

Metric Value
Source Qwen3.5-397B-A17B
Architecture Hybrid MoE + SSM (GatedDeltaNet + Full Attention)
Experts 512 per layer, top-10 active (17B active params)
Layers 60 (45 GatedDeltaNet SSM + 15 Full Attention)
Profile JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2)
MLP Asymmetry gate_proj=4-bit, up_proj=2-bit, down_proj=3-bit
Average bits 3.72 bpw
Disk size 187 GB (43 shards)
GPU RAM 197 GB peak
Generation speed 36.0 tok/s (M4 Ultra 256 GB)
Prefill speed 94.5 tok/s
Compute dtype bfloat16 (auto-detected, prevents float16 overflow)
VLM 333 vision tensors, Qwen3VLProcessor

Requirements

  • Apple Silicon Mac with 256 GB unified memory (M3/M4 Ultra)
  • MLX Studio (recommended) or pip install "jang[mlx]>=2.1.5"
  • Python 3.11+ for CLI usage

Quick Start (Python)

pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Qwen3.5-397B-A17B-JANG_2L")
# bfloat16 is auto-applied for 512-expert models

# With reasoning (recommended for hard questions)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

# Without reasoning (faster for simple questions)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)

VLM Usage

from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate as vlm_generate

model, processor = load_jang_vlm_model("JANGQ-AI/Qwen3.5-397B-A17B-JANG_2L")

# Format prompt with image tokens
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image."},
]}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = vlm_generate(model, processor, prompt=prompt, image=["photo.jpg"], max_tokens=200)

What is JANG?

JANG (Jang Adaptive N-bit Grading) is a mixed-precision quantization format for Apple Silicon that classifies every weight tensor by sensitivity:

  • CRITICAL (8-bit): Full attention Q/K/V/O, MoE routers, output head
  • IMPORTANT (6-bit): Embeddings, GatedDeltaNet (linear attention)
  • COMPRESS (2-bit): MoE expert MLP (512 experts provide redundancy)
  • MLP Asymmetry: gate_proj=4-bit (SiLU amplifier), down_proj=3-bit (residual projection)

This gives 397B-level intelligence at 187 GB — fitting on a single M4 Ultra Mac Studio.

Technical Notes

  • bfloat16 compute: 512-expert models with hidden_size=4096 overflow float16 (max 65,504) at the shared expert down_proj. The JANG loader auto-detects this and uses bfloat16 (max 3.4×10^38). Zero quality impact — quantization noise dominates.
  • Reasoning mode: The model uses <think>...</think> tags for step-by-step reasoning. On hard questions (math, physics, algebra), this improves accuracy by 25+ percentage points.
  • Chat template: Includes enable_thinking toggle. Set to False for fast answers, True for reasoning.

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

한국어 안내

JANG은 Apple Silicon을 위한 혼합 정밀도 양자화 형식입니다. Qwen3.5-397B를 Mac Studio 한 대에서 36 tok/s로 실행할 수 있습니다.

  • 92.0% MMLU (추론 모드)
  • 187 GB 디스크 용량, M4 Ultra 256 GB에서 실행
  • MLX Studio 필요
pip install "jang[mlx]>=2.1.5"
Downloads last month
539
Safetensors
Model size
54B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Qwen3.5-397B-A17B-JANG_2L

Finetuned
(28)
this model