Quantization

by NukeNotNull - opened 5 days ago

Discussion

NukeNotNull

5 days ago

What quantization does the chat.z.ai site use?

Gavvvin

5 days ago

What quantization does the chat.z.ai site use?

Its deff gotta be lower than FP8, or just using really shitty KV.
Model performance just completely degrades after 100k tokens on there platform, same with glm 5, yet other providers don't have the same issues.

z.ai still hasn't addressed this like at all.

AImhotep

5 days ago

Stop whining already 😆 I'm glad that it's OSS! And it really awesome model (I'm using unsloth's quantization).

ZHANGYUXUAN-zR

Z.ai org 5 days ago

•

edited 5 days ago

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Gavvvin

5 days ago

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Coding plan.
USers have been complaining about its output for awhile.
After around 100k tokens GLM completely looses its mind and spits out nonsense

ubergarm

4 days ago

•

edited 4 days ago

ik_llama.cpp offers the best quality quantizations as well as speed (especially for hybrid CPU+GPU inference) . i'd definitely recommend check out ubergarm/GLM-5.1-GGUF @AImhotep . As you can see the lower perplexity (better quality) as compared to some unreleased mainline compatible test quants that I created.

fwiw i've tested out to ~65k context and it is working fine with opencode for basic vibe coding quite well. i haven't gone further as it slows down quite a bit as i'm running it CPU-only (no GPUs at all hahah)...

AImhotep

4 days ago

I'm working with unsloth's IQ2_XSS on llama cpp, which is working amazingly well (model itself can be bit verbose tho :)
When using stock llama (newest possible) and MTP config I can get ~300t/s PP and 34 - 40t/s output (continously) - perfectly usable with Roo Code

Perplexity alone is not that good marker. Any KLD graphs?
Also I have 304GB of VRAM so I'm stuck with q2 for now.

ubergarm

3 days ago

•

edited 3 days ago

@AImhotep

Sure, you can see my quants forming the pareto front for mean KLD calculated at 8k context using a special corpus with 16k chunk blurbs for better alignment:

I added KLD logs, script, for your reference: https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/logs/kld

What arguments are you using for MTP, I'd like to try that out. Thanks!

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment