Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2
Enhanced mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.
- 3.748 BPW | 53.5 GB | Vision preserved (BF16)
What's New in v2
v2 invests an additional ~1.5GB over v1 to surgically upgrade the most sensitive bottleneck points:
| Upgrade | Change | Cost | Rationale |
|---|---|---|---|
| embed/lm_head | 6→8bit | +0.3 GB | Every token passes through; directly affects output probability distribution |
| v_proj / o_proj | 6→8bit | +0.4 GB | Attention value/output projections — quality-critical across all layers |
| Edge 6-layer attention | 6→8bit | +0.3 GB | First/last 6 layers build initial/final representations |
| Edge 6-layer shared_expert | 6→8bit | +0.2 GB | shared_expert activates on every inference; edge layers are most sensitive |
| Mid-layer shared_expert | 4→6bit | +0.2 GB | Always-active expert deserves higher precision than sparse experts |
| Entry/exit layer switch_mlp | 3→4bit | +0.1 GB | Layer 0 errors propagate through all 48 layers; Layer 47 feeds final output |
Net effect: +1.5 GB model size for meaningfully better coherence, fewer repetitions, and improved instruction following — especially noticeable in long-form generation and complex reasoning tasks.
🚀 Hardware Optimization
This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 53.5GB — an 11.5GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.
- 64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
- 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.
Quantization
5-tier mixed precision by functional sensitivity:
| Bits | Layers | % Params | Description |
|---|---|---|---|
| BF16 | — | ~2% | Vision tower, norm, router, conv1d — preserving full visual fidelity |
| 8-bit | — | ~10% | embed/lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert |
| 6-bit | — | ~6% | Full_attention q/k_proj (12 layers), mid-layer shared_expert |
| 4-bit | — | ~2% | DeltaNet attention, entry/exit layer switch_mlp (Layer 0 & 47) |
| 3-bit | — | ~80% | Expert FFN — 256 experts, 8 active/token, the bulk parameter pool |
v1 vs v2 Comparison
| v1 (3.655 BPW) | v2 (3.748 BPW) | |
|---|---|---|
| Model size | 52 GB | 53.5 GB |
| embed/lm_head | 6-bit | 8-bit |
| v_proj / o_proj | 6-bit | 8-bit |
| Edge attention | 6-bit | 8-bit |
| Edge shared_expert | 6-bit | 8-bit |
| Mid shared_expert | 4-bit | 6-bit |
| Entry/exit switch_mlp | 3-bit | 4-bit |
| Vision pipeline | BF16 | BF16 |
Benchmark Results
Primary Intelligence Benchmarks
| Benchmark | Accuracy | Correct | Total |
|---|---|---|---|
| MMLU | 87.5% | 875 | 1000 |
| CMMLU | 89.3% | 268 | 300 |
| TRUTHFULQA | 89.6% | 732 | 817 |
| HUMANEVAL | 90.8% | 149 | 164 |
| MBPP | 82.5% | 165 | 200 |
| LIVECODEBENCH | 58.0% | 58 | 100 |
Tests were conducted using MLX implementation. MMLU, CMMLU, MBPP and LiveCodeBench were sampled as indicated to balance evaluation time and statistical significance.
Coding & Reasoning (10-Question Test Suite)
Tested against Claude Opus 4.6 reference answers on a 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:
| Model | Score | Grade |
|---|---|---|
| Claude Opus 4.6 (reference) | 98/100 | S |
| Qwen3.5-122B v2 (this model) | 95/100 | S |
| Qwen3.5-122B v1 | 90/100 | A+ |
v2 achieved significant improvements over the previous version, demonstrating that the precision upgrades translate into measurable reasoning improvements.
Quality (WikiText-2 Perplexity)
Lower is better. Same harness for every row: WikiText-2 test split, 128 × 2048 tokens, batch size 1.
| Model | Mean | Median | Trimmed Mean | Notes |
|---|---|---|---|---|
| Uniform 4-bit (65 GB) | ≥6.0 | ≥6.0 | ≥6.2 | Baseline MLX 4-bit quant (identical script). |
| v1 — 3.655 BPW / 52 GB | 5.3536 | 5.3639 | 5.6631 | Previous mixed-precision release. |
| v2 — 3.748 BPW / 53.5 GB | 5.3206 | 5.3430 | 5.6282 | Current release (this card). |
Usage
from mlx_vlm import load, generate
model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2")
# Text conversation
response = generate(model, processor, prompt="Hello!", max_tokens=200)
# Image understanding
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)
License
Apache 2.0
- Downloads last month
- 326
3-bit
Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2
Base model
Qwen/Qwen3.5-122B-A10B