Qwen3.5-122B-A10B-MLX-4.7bit-VL
All-4-bit-expert MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.
- 4.655 BPW | 66 GB | Vision preserved (BF16)
🚀 Hardware Optimization
This build keeps the same Apple Silicon-first pipeline while trading ~13GB for higher expert fidelity: every sparse switch_mlp expert moves from 3-bit to 4-bit so the whole network bottoms out at 4-bit precision. The result is a 66GB package (vs. 53.5GB for the 3.7bit mix). Full 4-bit experts push resident usage into the 70GB range, so it targets >=96GB UM machines only.
- 96GB Unified Memory (minimum): Ensures the model + KV cache fit without paging.
- 128GB+ Unified Memory (recommended): Leaves comfortable headroom for long contexts, multi-image prompts, and background processes.
Quantization
5-tier mixed precision by functional sensitivity (no 3-bit kernels remaining):
| Bits | Layers | % Params | Description |
|---|---|---|---|
| BF16 | — | ~2% | Vision tower, norm, router, conv1d — preserving full visual fidelity |
| 8-bit | — | ~10% | embed_tokens, lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert |
| 6-bit | — | ~6% | Full-attention q/k_proj (12 layers), mid-layer shared_expert |
| 4-bit | — | ~4% | DeltaNet attention path, entry/exit switch_mlp |
| 4-bit (experts) | — | ~78% | Sparse switch_mlp experts (256 total, 8 active/token) upgraded from 3-bit |
Benchmark (M2 Max 38c / 96GB UM)
Only throughput numbers are recorded (same runner as screenshot dated 2026-04-06 18:52).
| Context Window | Prefill (tok/s) | Generation (tok/s) |
|---|---|---|
| 1k tokens | 205.8 | 34.9 |
| 4k tokens | 233.9 | 33.0 |
Quality (WikiText-2 Perplexity)
Lower is better. Evaluated on WikiText-2 test split with 128 sequences × 2048 tokens, batch size 1.
| Model | Mean | Median | Trimmed Mean |
|---|---|---|---|
| 4.7bit (this) | 5.2094 | 5.3014 | 5.5491 |
| 3.7bit v2 | 5.3206 | 5.3430 | 5.6282 |
| 3.7bit v1 | 5.3536 | 5.3639 | 5.6631 |
Run log: 128 windows, total time 1234.9 s on M2 Max (38c). Trimmed statistics drop outlier windows caused by WikiText-2 headers/trailers.
Usage
from mlx_vlm import load, generate
model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL")
# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)
# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)
License
Apache 2.0
- Downloads last month
- 240
4-bit
Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-4.7bit-VL
Base model
Qwen/Qwen3.5-122B-A10B