Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2

Enhanced mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

  • 3.748 BPW | 53.5 GB | Vision preserved (BF16)

What's New in v2

v2 invests an additional ~1.5GB over v1 to surgically upgrade the most sensitive bottleneck points:

Upgrade Change Cost Rationale
embed/lm_head 6→8bit +0.3 GB Every token passes through; directly affects output probability distribution
v_proj / o_proj 6→8bit +0.4 GB Attention value/output projections — quality-critical across all layers
Edge 6-layer attention 6→8bit +0.3 GB First/last 6 layers build initial/final representations
Edge 6-layer shared_expert 6→8bit +0.2 GB shared_expert activates on every inference; edge layers are most sensitive
Mid-layer shared_expert 4→6bit +0.2 GB Always-active expert deserves higher precision than sparse experts
Entry/exit layer switch_mlp 3→4bit +0.1 GB Layer 0 errors propagate through all 48 layers; Layer 47 feeds final output

Net effect: +1.5 GB model size for meaningfully better coherence, fewer repetitions, and improved instruction following — especially noticeable in long-form generation and complex reasoning tasks.

🚀 Hardware Optimization

This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 53.5GB — an 11.5GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.

  • 64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
  • 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.

Quantization

5-tier mixed precision by functional sensitivity:

Bits Layers % Params Description
BF16 ~2% Vision tower, norm, router, conv1d — preserving full visual fidelity
8-bit ~10% embed/lm_head, v/o_proj (all layers), edge 6-layer attention + shared_expert
6-bit ~6% Full_attention q/k_proj (12 layers), mid-layer shared_expert
4-bit ~2% DeltaNet attention, entry/exit layer switch_mlp (Layer 0 & 47)
3-bit ~80% Expert FFN — 256 experts, 8 active/token, the bulk parameter pool

v1 vs v2 Comparison

v1 (3.655 BPW) v2 (3.748 BPW)
Model size 52 GB 53.5 GB
embed/lm_head 6-bit 8-bit
v_proj / o_proj 6-bit 8-bit
Edge attention 6-bit 8-bit
Edge shared_expert 6-bit 8-bit
Mid shared_expert 4-bit 6-bit
Entry/exit switch_mlp 3-bit 4-bit
Vision pipeline BF16 BF16

Benchmark Results

Primary Intelligence Benchmarks

Benchmark Accuracy Correct Total
MMLU 87.5% 875 1000
CMMLU 89.3% 268 300
TRUTHFULQA 89.6% 732 817
HUMANEVAL 90.8% 149 164
MBPP 82.5% 165 200
LIVECODEBENCH 58.0% 58 100

Tests were conducted using MLX implementation. MMLU, CMMLU, MBPP and LiveCodeBench were sampled as indicated to balance evaluation time and statistical significance.

Coding & Reasoning (10-Question Test Suite)

Tested against Claude Opus 4.6 reference answers on a 10-question benchmark covering debugging, algorithms, security, concurrency, system design, and logic reasoning:

Model Score Grade
Claude Opus 4.6 (reference) 98/100 S
Qwen3.5-122B v2 (this model) 95/100 S
Qwen3.5-122B v1 90/100 A+

v2 achieved significant improvements over the previous version, demonstrating that the precision upgrades translate into measurable reasoning improvements.

Quality (WikiText-2 Perplexity)

Lower is better. Same harness for every row: WikiText-2 test split, 128 × 2048 tokens, batch size 1.

Model Mean Median Trimmed Mean Notes
Uniform 4-bit (65 GB) ≥6.0 ≥6.0 ≥6.2 Baseline MLX 4-bit quant (identical script).
v1 — 3.655 BPW / 52 GB 5.3536 5.3639 5.6631 Previous mixed-precision release.
v2 — 3.748 BPW / 53.5 GB 5.3206 5.3430 5.6282 Current release (this card).

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2")

# Text conversation
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# Image understanding
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month
326
Safetensors
Model size
17B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL-v2

Quantized
(104)
this model