Quick question for the team about Q8_K_XL
Dear Unsloth Team or anyone who might want to weigh in,
I notice the Q8_K_XL quant is quite beefy at 48.7 GB where others like Bartowski's Q8_0 is 36.9 GB.
I'm curious the advantages of the (30%) larger Unsloth quant. I'd imagine its a bit more precise, but I thought at Q8_0 it's nearly lossless already, so I'm wondering what value you see in the Q8_K_XL vs the Q8_0 and if the additional 12 GB size is worth it for some workloads?
I'm on MPS in case that matters.
Thanks for all the work you contribute to the community!
Very interesting in this question too. I am using the previous UD_Q8_K_XL at 37GB, wonder whether there are meaningful improvements from the new 48.7GB version
I'm very interested in this as well. The old UD Q8 KXL worked very well for me. Blazing fast, and seemed smart.
This one is much, much slower. The difference in token generation is from 50 tokens second in the old verson to 18 tokens second in the new version. It's quite painful.
It's unstable, in turn, probably because the memory footprint is bigger and spikes make it more likely that llama will crash due to OOM.
The comparison I'd like to know is Old UD Q8 KXL vs New Q8 KXL vs New Q6 KXL.
In case anyone has lost the old quant like I did:
You can get it from here: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/tree/2f668bac3136168a10e0a1052accb4f7a7d28101
I think it is no Q8 any more its more F16 than Q8
or at least Q8_XXXL ;)
I would agree. It's only about 30% smaller than the FP16 model.