Quick question for the team about Q8_K_XL

#30

by jswiftie - opened Mar 4

Mar 4

Dear Unsloth Team or anyone who might want to weigh in,

I notice the Q8_K_XL quant is quite beefy at 48.7 GB where others like Bartowski's Q8_0 is 36.9 GB.

I'm curious the advantages of the (30%) larger Unsloth quant. I'd imagine its a bit more precise, but I thought at Q8_0 it's nearly lossless already, so I'm wondering what value you see in the Q8_K_XL vs the Q8_0 and if the additional 12 GB size is worth it for some workloads?

I'm on MPS in case that matters.

Thanks for all the work you contribute to the community!

d2rx

Mar 5

Very interesting in this question too. I am using the previous UD_Q8_K_XL at 37GB, wonder whether there are meaningful improvements from the new 48.7GB version

Grossor

Mar 6

I'm very interested in this as well. The old UD Q8 KXL worked very well for me. Blazing fast, and seemed smart.

This one is much, much slower. The difference in token generation is from 50 tokens second in the old verson to 18 tokens second in the new version. It's quite painful.
It's unstable, in turn, probably because the memory footprint is bigger and spikes make it more likely that llama will crash due to OOM.

The comparison I'd like to know is Old UD Q8 KXL vs New Q8 KXL vs New Q6 KXL.

In case anyone has lost the old quant like I did:
You can get it from here: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/tree/2f668bac3136168a10e0a1052accb4f7a7d28101

kalle07

Mar 8

I think it is no Q8 any more its more F16 than Q8
or at least Q8_XXXL ;)

Grossor

Mar 8

I would agree. It's only about 30% smaller than the FP16 model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment