Qwen3.5-122B-A10B-MLX-3.7bit-VL

Mixed-precision MLX quantization of Qwen3.5-122B-A10B — Alibaba's latest MoE model with full vision support preserved in BF16.

3.655 BPW | 52 GB | Vision preserved (BF16)

🚀 Hardware Optimization

This model brings 122B-class multimodal performance to Apple Silicon. By utilizing advanced mixed-precision quantization, we've compressed the model from uniform 4-bit's 65GB down to 52GB — a 13GB reduction — while preserving the full vision pipeline at BF16 precision for lossless image understanding.

This optimization unlocks two distinct local inference experiences:

64GB Unified Memory (Minimum): The uniform 4-bit quantization weighs 65GB and simply cannot fit in 64GB at all. This quantization breaks that barrier — fitting a full 122B multimodal model into 64GB for the first time, pushing the hardware boundaries to make local 122B vision+language inference possible on edge devices.
96GB+ Unified Memory (Recommended): Delivers an uncompromised, buttery-smooth multimodal experience. The efficient footprint frees up massive headroom for the KV cache, completely unlocking ultimate long-context capabilities for both text and vision tasks.

Quantization

4-tier mixed precision by functional sensitivity:

Bits	Layers	% Params	Description
BF16	—	~2%	Vision tower, norm, router, conv1d — preserving full visual fidelity
6-bit	—	~8%	Embeddings, v/o_proj, edge layers, full_attention q/k
4-bit	—	~3%	DeltaNet attention, shared expert
3-bit	—	~87%	Expert FFN (256 experts, 8 active/token)

Benchmark (M2 Max 96GB)

	This (3.7bit)	Uniform 4bit
Model size	52 GB	65 GB
Peak memory (ctx=4k)	55.1 GB	67.7 GB
Prefill (1k ctx)	219.0 tok/s	210.5 tok/s
Prefill (4k ctx)	230.6 tok/s	227.7 tok/s
Generation (1k ctx)	36.2 tok/s	38.3 tok/s
Generation (4k ctx)	33.6 tok/s	35.8 tok/s

Speed is virtually identical, but the 13GB saved makes the difference between a usable and unusable long-context experience on 96GB machines.

Quality (WikiText-2 Perplexity)

Lower is better. Evaluated on 128 sequences × 2048 tokens.

Metric	Value
Mean Perplexity	5.3536
Median Perplexity	5.3639
Trimmed Mean Perplexity	5.6631

Usage

from mlx_vlm import load, generate

model, processor = load("MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL")

# 文本对话
response = generate(model, processor, prompt="Hello!", max_tokens=200)

# 图像理解
response = generate(model, processor, prompt="Describe this image", image="photo.jpg", max_tokens=200)
print(response)

License

Apache 2.0

Downloads last month: 626

Safetensors

Model size

16B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

3-bit

Model tree for MoringLabs/Qwen3.5-122B-A10B-MLX-3.7bit-VL

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model