Qwen3.5-122B-A10B-MLX-4.7bit-VL

All-4-bit-expert MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

4.655 BPW | 66 GB | Vision preserved (BF16)

🚀 Hardware Optimization

This build keeps the same Apple Silicon-first pipeline while trading ~13GB for higher expert fidelity: every sparse switch_mlp expert moves from 3-bit to 4-bit so the whole network bottoms out at 4-bit precision. The result is a 66GB package (vs. 53.5GB for the 3.7bit mix). Full 4-bit experts push resident usage into the 70GB range, so it targets >=96GB UM machines only.

96GB Unified Memory (minimum): Ensures the model + KV cache fit without paging.
128GB+ Unified Memory (recommended): Leaves comfortable headroom for long contexts, multi-image prompts, and background processes.

Quantization

5-tier mixed precision by functional sensitivity (no 3-bit kernels remaining):

Bits	Layers	% Params	Description
BF16	—	~2%	Vision tower, norm, router, conv1d — preserving full visual fidelity
8-bit	—	~10%	embed_tokens, lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert
6-bit	—	~6%	Full-attention q/k_proj (12 layers), mid-layer shared_expert
4-bit	—	~4%	DeltaNet attention path, entry/exit switch_mlp
4-bit (experts)	—	~78%	Sparse switch_mlp experts (256 total, 8 active/token) upgraded from 3-bit

Benchmark (M2 Max 38c / 96GB UM)

Only throughput numbers are recorded (same runner as screenshot dated 2026-04-06 18:52).

Context Window	Prefill (tok/s)	Generation (tok/s)
1k tokens	205.8	34.9
4k tokens	233.9	33.0

Quality (WikiText-2 Perplexity)

Lower is better. Evaluated on WikiText-2 test split with 128 sequences × 2048 tokens, batch size 1.

Model	Mean	Median	Trimmed Mean
4.7bit (this)	5.2094	5.3014	5.5491
3.7bit v2	5.3206	5.3430	5.6282
3.7bit v1	5.3536	5.3639	5.6631

Run log: 128 windows, total time 1234.9 s on M2 Max (38c). Trimmed statistics drop outlier windows caused by WikiText-2 headers/trailers.

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL")

# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month: 240

Safetensors

Model size

20B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model