Qwen3.5-122B-A10B-MLX-3.7bit-VL

Mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

  • 3.655 BPW | 52 GB | Vision preserved (BF16)

🚀 Hardware Optimization

This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 52GB — a 13GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.

This optimization unlocks two distinct local inference experiences:

  • 64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
  • 96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.

Quantization

4-tier mixed precision by functional sensitivity:

Bits Layers % Params Description
BF16 ~2% Vision tower, norm, router, conv1d — preserving full visual fidelity
6-bit ~8% Embeddings, v/o_proj, edge layers, full_attention q/k
4-bit ~3% DeltaNet attention, shared expert
3-bit ~87% Expert FFN (256 experts, 8 active/token)

Benchmark (M2 Max 96GB)

This (3.7bit) Uniform 4bit
Model size 52 GB 65 GB
Peak memory (ctx=4k) 55.1 GB 67.7 GB
Prefill (1k ctx) 219.0 tok/s 210.5 tok/s
Prefill (4k ctx) 230.6 tok/s 227.7 tok/s
Generation (1k ctx) 36.2 tok/s 38.3 tok/s
Generation (4k ctx) 33.6 tok/s 35.8 tok/s

Speed is virtually identical, but the 13GB saved makes the difference between a usable and unusable long-context experience on 96GB machines.

Quality (WikiText-2 Perplexity)

Lower is better. Evaluated on 128 sequences × 2048 tokens.

Metric Value
Mean Perplexity 5.3536
Median Perplexity 5.3639
Trimmed Mean Perplexity 5.6631

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL")

# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month
626
Safetensors
Model size
16B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL

Quantized
(104)
this model