Qwen3.5-122B-A10B-MLX-4.7bit-VL

All-4-bit-expert MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

  • 4.655 BPW | 66 GB | Vision preserved (BF16)

🚀 Hardware Optimization

This build keeps the same Apple Silicon-first pipeline while trading ~13GB for higher expert fidelity: every sparse switch_mlp expert moves from 3-bit to 4-bit so the whole network bottoms out at 4-bit precision. The result is a 66GB package (vs. 53.5GB for the 3.7bit mix). Full 4-bit experts push resident usage into the 70GB range, so it targets >=96GB UM machines only.

  • 96GB Unified Memory (minimum): Ensures the model + KV cache fit without paging.
  • 128GB+ Unified Memory (recommended): Leaves comfortable headroom for long contexts, multi-image prompts, and background processes.

Quantization

5-tier mixed precision by functional sensitivity (no 3-bit kernels remaining):

Bits Layers % Params Description
BF16 ~2% Vision tower, norm, router, conv1d — preserving full visual fidelity
8-bit ~10% embed_tokens, lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert
6-bit ~6% Full-attention q/k_proj (12 layers), mid-layer shared_expert
4-bit ~4% DeltaNet attention path, entry/exit switch_mlp
4-bit (experts) ~78% Sparse switch_mlp experts (256 total, 8 active/token) upgraded from 3-bit

Benchmark (M2 Max 38c / 96GB UM)

Only throughput numbers are recorded (same runner as screenshot dated 2026-04-06 18:52).

Context Window Prefill (tok/s) Generation (tok/s)
1k tokens 205.8 34.9
4k tokens 233.9 33.0

Quality (WikiText-2 Perplexity)

Lower is better. Evaluated on WikiText-2 test split with 128 sequences × 2048 tokens, batch size 1.

Model Mean Median Trimmed Mean
4.7bit (this) 5.2094 5.3014 5.5491
3.7bit v2 5.3206 5.3430 5.6282
3.7bit v1 5.3536 5.3639 5.6631

Run log: 128 windows, total time 1234.9 s on M2 Max (38c). Trimmed statistics drop outlier windows caused by WikiText-2 headers/trailers.

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL")

# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month
240
Safetensors
Model size
20B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL

Quantized
(104)
this model