Qwen3.5-122B-A10B-MLX-3.7bit-VL
Mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.
- 3.655 BPW | 52 GB | Vision preserved (BF16)
🚀 Hardware Optimization
This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 52GB — a 13GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.
This optimization unlocks two distinct local inference experiences:
- 64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
- 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.
Quantization
4-tier mixed precision by functional sensitivity:
| Bits | Layers | % Params | Description |
|---|---|---|---|
| BF16 | — | ~2% | Vision tower, norm, router, conv1d — preserving full visual fidelity |
| 6-bit | — | ~8% | Embeddings, v/o_proj, edge layers, full_attention q/k |
| 4-bit | — | ~3% | DeltaNet attention, shared expert |
| 3-bit | — | ~87% | Expert FFN (256 experts, 8 active/token) |
Benchmark (M2 Max 96GB)
| This (3.7bit) | Uniform 4bit | |
|---|---|---|
| Model size | 52 GB | 65 GB |
| Peak memory (ctx=4k) | 55.1 GB | 67.7 GB |
| Prefill (1k ctx) | 219.0 tok/s | 210.5 tok/s |
| Prefill (4k ctx) | 230.6 tok/s | 227.7 tok/s |
| Generation (1k ctx) | 36.2 tok/s | 38.3 tok/s |
| Generation (4k ctx) | 33.6 tok/s | 35.8 tok/s |
Speed is virtually identical, but the 13GB saved makes the difference between a usable and unusable long-context experience on 96GB machines.
Quality (WikiText-2 Perplexity)
Lower is better. Evaluated on 128 sequences × 2048 tokens.
| Metric | Value |
|---|---|
| Mean Perplexity | 5.3536 |
| Median Perplexity | 5.3639 |
| Trimmed Mean Perplexity | 5.6631 |
Usage
from mlx_vlm import load, generate
model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL")
# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)
# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)
License
Apache 2.0
- Downloads last month
- 626
3-bit
Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL
Base model
Qwen/Qwen3.5-122B-A10B