Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2

Enhanced mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

3.748 BPW | 53.5 GB | Vision preserved (BF16)

What's New in v2

v2 invests an additional ~1.5GB over v1 to surgically upgrade the most sensitive bottleneck points:

Upgrade	Change	Cost	Rationale
embed/lm_head	6→8bit	+0.3 GB	Every token passes through; directly affects output probability distribution
v_proj / o_proj	6→8bit	+0.4 GB	Attention value/output projections — quality-critical across all layers
Edge 6-layer attention	6→8bit	+0.3 GB	First/last 6 layers build initial/final representations
Edge 6-layer shared_expert	6→8bit	+0.2 GB	shared_expert activates on every inference; edge layers are most sensitive
Mid-layer shared_expert	4→6bit	+0.2 GB	Always-active expert deserves higher precision than sparse experts
Entry/exit layer switch_mlp	3→4bit	+0.1 GB	Layer 0 errors propagate through all 48 layers; Layer 47 feeds final output

Net effect: +1.5 GB model size for meaningfully better coherence, fewer repetitions, and improved instruction following — especially noticeable in long-form generation and complex reasoning tasks.

🚀 Hardware Optimization

This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 53.5GB — an 11.5GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.

64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.

Quantization

5-tier mixed precision by functional sensitivity:

Bits	Layers	% Params	Description
BF16	—	~2%	Vision tower, norm, router, conv1d — preserving full visual fidelity
8-bit	—	~10%	embed/lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert
6-bit	—	~6%	Full_attention q/k_proj (12 layers), mid-layer shared_expert
4-bit	—	~2%	DeltaNet attention, entry/exit layer switch_mlp (Layer 0 & 47)
3-bit	—	~80%	Expert FFN — 256 experts, 8 active/token, the bulk parameter pool

v1 vs v2 Comparison

	v1 (3.655 BPW)	v2 (3.748 BPW)
Model size	52 GB	53.5 GB
embed/lm_head	6-bit	8-bit
v_proj / o_proj	6-bit	8-bit
Edge attention	6-bit	8-bit
Edge shared_expert	6-bit	8-bit
Mid shared_expert	4-bit	6-bit
Entry/exit switch_mlp	3-bit	4-bit
Vision pipeline	BF16	BF16

Benchmark Results

Primary Intelligence Benchmarks

Benchmark	Accuracy	Correct	Total
MMLU	87.5%	875	1000
CMMLU	89.3%	268	300
TRUTHFULQA	89.6%	732	817
HUMANEVAL	90.8%	149	164
MBPP	82.5%	165	200
LIVECODEBENCH	58.0%	58	100

Tests were conducted using MLX implementation. MMLU, CMMLU, MBPP and LiveCodeBench were sampled as indicated to balance evaluation time and statistical significance.

Coding & Reasoning (10-Question Test Suite)

Tested against Claude Opus 4.6 reference answers on a 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:

Model	Score	Grade
Claude Opus 4.6 (reference)	98/100	S
Qwen3.5-122B v2 (this model)	95/100	S
Qwen3.5-122B v1	90/100	A+

v2 achieved significant improvements over the previous version, demonstrating that the precision upgrades translate into measurable reasoning improvements.

Quality (WikiText-2 Perplexity)

Lower is better. Same harness for every row: WikiText-2 test split, 128 × 2048 tokens, batch size 1.

Model	Mean	Median	Trimmed Mean	Notes
Uniform 4-bit (65 GB)	≥6.0	≥6.0	≥6.2	Baseline MLX 4-bit quant (identical script).
v1 — 3.655 BPW / 52 GB	5.3536	5.3639	5.6631	Previous mixed-precision release.
v2 — 3.748 BPW / 53.5 GB	5.3206	5.3430	5.6282	Current release (this card).

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2")

# Text conversation
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# Image understanding
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month: 326

Safetensors

Model size

17B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

3-bit

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model