Quant optimized for quality / speed on a Strix Halo 128GiB system. Possibly also beneficial on DGX Spark and similar systems.

The TL;DR is this quant achieves both superior quality and speed to homogenous Q6_K

Q6_K

model size params backend ngl n_ubatch fa test t/s
glm4moe ?B Q6_K 89.58 GiB 106.85 B ROCm 999 1024 1 pp2048 187.87 ± 0.00
glm4moe ?B Q6_K 89.58 GiB 106.85 B ROCm 999 1024 1 tg256 16.73 ± 0.00
glm4moe ?B Q6_K 89.58 GiB 106.85 B ROCm 999 1024 1 pp2048 @ d8192 120.83 ± 0.00
glm4moe ?B Q6_K 89.58 GiB 106.85 B ROCm 999 1024 1 tg256 @ d8192 13.41 ± 0.00

This quant

model size params backend ngl n_ubatch fa test t/s
glm4moe ?B Q8_0 90.80 GiB 106.85 B ROCm 999 1024 1 pp2048 296.28 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B ROCm 999 1024 1 tg256 15.58 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B ROCm 999 1024 1 pp2048 @ d8192 160.92 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B ROCm 999 1024 1 tg256 @ d8192 12.69 ± 0.00

What this quant does is move some hot layers (attention, shared expert) to q8_0 for faster processing. Basically Q6_K is the optimal size for the Halo but it's also the slowest quant, made worse by the fact that it performs poorly on MMQ kernels which GLM4Moe always uses due to its high exp count. For detailed RDNA 3.0 benchmarks you can view my kernel selection PR here as well Johnathan's follow-up RDNA3.5 version here

Additionally the context should still fit ≥90k with room for a graphical desktop assuming a large TTM was set. Completely headless you might be able to reach full 128k.

Everything above assumes you're running ROCm, not Vulkan. Vulkan being faster is a myth. While it might look like +15% for tg512, when run at even a modest context depth, the speed becomes catastrophic

This quant, Vulkan

model size params backend ngl n_ubatch fa test t/s
glm4moe ?B Q8_0 90.80 GiB 106.85 B Vulkan 999 1024 1 pp2048 244.54 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B Vulkan 999 1024 1 tg256 17.18 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B Vulkan 999 1024 1 pp2048 @ d8192 33.08 ± 0.00
glm4moe ?B Q8_0 90.80 GiB 106.85 B Vulkan 999 1024 1 tg256 @ d8192 13.71 ± 0.00
Downloads last month
82
GGUF
Model size
107B params
Architecture
glm4moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Beinsezii/GLM-4.6V-GGUF-HALO

Base model

zai-org/GLM-4.6V
Quantized
(18)
this model

Collection including Beinsezii/GLM-4.6V-GGUF-HALO