Qwen3.5-397B-A17B optimized for MLX!

  • Mixed-precision quantization balances throughput, accuracy, and memory.
  • Similar quality to a 4-bit baseline but requires 40% less memory.
  • Fixed chat template allows more reliable prompt caching.
  • This version does NOT support vision (image input).

Also available as a larger 178GB version: https://huggingface.co/spicyneuron/Qwen3.5-397B-A17B-MLX-3.5bit

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.5-397B-A17B-MLX-2.6bit

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision.
  • More tolerant layers like MoE experts get lower precision.

Benchmarks

metric lmstudio-community 4 bit 2.6 bit 3.5 bit
perplexity 3.919 ± 0.019 3.852 ± 0.018 3.919 ± 0.019
hellaswag 0.594 ± 0.022 0.598 ± 0.022 0.622 ± 0.022
piqa 0.798 ± 0.018 0.802 ± 0.018 0.804 ± 0.018
winogrande 0.744 ± 0.02 0.718 ± 0.02 0.746 ± 0.019
p1024/g512 prompt 490.702 489.545 479.453
p1024/g512 gen 39.192 38.398 35.547
p1024/g512 mem 225.095 131.523 179.842
Downloads last month
783
Safetensors
Model size
396B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicyneuron/Qwen3.5-397B-A17B-MLX-2.6bit

Quantized
(73)
this model