Quant optimized for quality / speed on a Strix Halo 128GiB system. Possibly also beneficial on DGX Spark and similar systems.

The TL;DR is this quant achieves both superior quality and speed to homogenous Q6_K

Q6_K

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
glm4moe ?B Q6_K	89.58 GiB	106.85 B	ROCm	999	1024	1	pp2048	187.87 ± 0.00
glm4moe ?B Q6_K	89.58 GiB	106.85 B	ROCm	999	1024	1	tg256	16.73 ± 0.00
glm4moe ?B Q6_K	89.58 GiB	106.85 B	ROCm	999	1024	1	pp2048 @ d8192	120.83 ± 0.00
glm4moe ?B Q6_K	89.58 GiB	106.85 B	ROCm	999	1024	1	tg256 @ d8192	13.41 ± 0.00

This quant

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
glm4moe ?B Q8_0	90.80 GiB	106.85 B	ROCm	999	1024	1	pp2048	296.28 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	ROCm	999	1024	1	tg256	15.58 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	ROCm	999	1024	1	pp2048 @ d8192	160.92 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	ROCm	999	1024	1	tg256 @ d8192	12.69 ± 0.00

What this quant does is move some hot layers (attention, shared expert) to q8_0 for faster processing. Basically Q6_K is the optimal size for the Halo but it's also the slowest quant, made worse by the fact that it performs poorly on MMQ kernels which GLM4Moe always uses due to its high exp count. For detailed RDNA 3.0 benchmarks you can view my kernel selection PR here as well Johnathan's follow-up RDNA3.5 version here

Additionally the context should still fit ≥90k with room for a graphical desktop assuming a large TTM was set. Completely headless you might be able to reach full 128k.

Everything above assumes you're running ROCm, not Vulkan. Vulkan being faster is a myth. While it might look like +15% for tg512, when run at even a modest context depth, the speed becomes catastrophic

This quant, Vulkan

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
glm4moe ?B Q8_0	90.80 GiB	106.85 B	Vulkan	999	1024	1	pp2048	244.54 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	Vulkan	999	1024	1	tg256	17.18 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	Vulkan	999	1024	1	pp2048 @ d8192	33.08 ± 0.00
glm4moe ?B Q8_0	90.80 GiB	106.85 B	Vulkan	999	1024	1	tg256 @ d8192	13.71 ± 0.00