Quant optimized for quality / speed on a Strix Halo 128GiB system. Possibly also beneficial on DGX Spark and similar systems.
The TL;DR is this quant achieves both superior quality and speed to homogenous Q6_K
Q6_K
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| glm4moe ?B Q6_K | 89.58 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | pp2048 | 187.87 ± 0.00 |
| glm4moe ?B Q6_K | 89.58 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | tg256 | 16.73 ± 0.00 |
| glm4moe ?B Q6_K | 89.58 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | pp2048 @ d8192 | 120.83 ± 0.00 |
| glm4moe ?B Q6_K | 89.58 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | tg256 @ d8192 | 13.41 ± 0.00 |
This quant
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | pp2048 | 296.28 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | tg256 | 15.58 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | pp2048 @ d8192 | 160.92 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | ROCm | 999 | 1024 | 1 | tg256 @ d8192 | 12.69 ± 0.00 |
What this quant does is move some hot layers (attention, shared expert) to q8_0 for faster processing. Basically Q6_K is the optimal size for the Halo but it's also the slowest quant, made worse by the fact that it performs poorly on MMQ kernels which GLM4Moe always uses due to its high exp count. For detailed RDNA 3.0 benchmarks you can view my kernel selection PR here as well Johnathan's follow-up RDNA3.5 version here
Additionally the context should still fit ≥90k with room for a graphical desktop assuming a large TTM was set. Completely headless you might be able to reach full 128k.
Everything above assumes you're running ROCm, not Vulkan. Vulkan being faster is a myth. While it might look like +15% for tg512, when run at even a modest context depth, the speed becomes catastrophic
This quant, Vulkan
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | Vulkan | 999 | 1024 | 1 | pp2048 | 244.54 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | Vulkan | 999 | 1024 | 1 | tg256 | 17.18 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | Vulkan | 999 | 1024 | 1 | pp2048 @ d8192 | 33.08 ± 0.00 |
| glm4moe ?B Q8_0 | 90.80 GiB | 106.85 B | Vulkan | 999 | 1024 | 1 | tg256 @ d8192 | 13.71 ± 0.00 |
- Downloads last month
- 82
We're not able to determine the quantization variants.
Model tree for Beinsezii/GLM-4.6V-GGUF-HALO
Base model
zai-org/GLM-4.6V