Quant optimized for quality / speed on a Strix Halo 128GiB system. Possibly also beneficial on DGX Spark and similar systems.

The TL;DR is this quant achieves both superior quality and speed compared to homogenous Q6_K.

Depending on your TTM settings you should be between 100k and 200k ctx, or more if you disable vision.

This quant, build 8245 (2026/03/08)

model size params backend ngl n_batch n_ubatch fa test t/s
qwen35moe 122B.A10B Q4_1 94.79 GiB 122.11 B ROCm 999 1024 1024 1 pp2048 274.99 ± 0.00
qwen35moe 122B.A10B Q4_1 94.79 GiB 122.11 B ROCm 999 1024 1024 1 tg256 16.62 ± 0.00
qwen35moe 122B.A10B Q4_1 94.79 GiB 122.11 B ROCm 999 1024 1024 1 pp2048 @ d8192 238.78 ± 0.00
qwen35moe 122B.A10B Q4_1 94.79 GiB 122.11 B ROCm 999 1024 1024 1 tg256 @ d8192 16.68 ± 0.00

Ignore displayed dtype, refer to the tensor types instead

See the GLM version for more details on theory and comparisons.

Downloads last month
1,648
GGUF
Model size
122B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO

Quantized
(104)
this model

Collection including Beinsezii/Qwen3.5-122B-A10B-GGUF-HALO