Quantization

#5
by NukeNotNull - opened

What quantization does the chat.z.ai site use?

What quantization does the chat.z.ai site use?

Its deff gotta be lower than FP8, or just using really shitty KV.
Model performance just completely degrades after 100k tokens on there platform, same with glm 5, yet other providers don't have the same issues.

z.ai still hasn't addressed this like at all.

Stop whining already πŸ˜† I'm glad that it's OSS! And it really awesome model (I'm using unsloth's quantization).

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Coding plan.
USers have been complaining about its output for awhile.
After around 100k tokens GLM completely looses its mind and spits out nonsense

ppl-GLM-5.1

ik_llama.cpp offers the best quality quantizations as well as speed (especially for hybrid CPU+GPU inference) . i'd definitely recommend check out ubergarm/GLM-5.1-GGUF @AImhotep . As you can see the lower perplexity (better quality) as compared to some unreleased mainline compatible test quants that I created.

fwiw i've tested out to ~65k context and it is working fine with opencode for basic vibe coding quite well. i haven't gone further as it slows down quite a bit as i'm running it CPU-only (no GPUs at all hahah)...

I'm working with unsloth's IQ2_XSS on llama cpp, which is working amazingly well (model itself can be bit verbose tho :)
When using stock llama (newest possible) and MTP config I can get ~300t/s PP and 34 - 40t/s output (continously) - perfectly usable with Roo Code

Perplexity alone is not that good marker. Any KLD graphs?
Also I have 304GB of VRAM so I'm stuck with q2 for now.

@AImhotep

Sure, you can see my quants forming the pareto front for mean KLD calculated at 8k context using a special corpus with 16k chunk blurbs for better alignment:

KLD Chart

I added KLD logs, script, for your reference: https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/logs/kld

What arguments are you using for MTP, I'd like to try that out. Thanks!

Cheers!

Sign up or log in to comment