Note that llama.cpp’s Q4_0 quantization does not align with the native INT4 QAT format used in post tuning, so converting those tensors to Q4_0 is really just a second quantization step.
In past measurements with k2.5, I saw cosine similarity around ~0.95 between a Q4_0 tensor and the corresponding BF16 tensor, while the Q8 version of that same tensor around ~0.998.
If the quantization algorithm were identical to the original QAT format, I would expect cosine similarity to be near 1.0.
Until llama.cpp implements a quantization scheme that faithfully matches the native INT4 QAT, could we get a Q8 variant that does not use Q4_0 for those tensors
Yes we utilized the bijection patch from https://github.com/jukofyork - if not the stock Q4_0 has around 1.8% amax error - now it's within machine epsilon.
Note though Kimi uses BF16 for other tensors, and UD-Q8_K_XL does BF16 for others - we found there does exist some error in the Q4_X variants (UD-Q4_K_XL as well follows Q8_0 for other layers) vs the truly "lossless" UD-Q8_K_XL one seen below:
Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!
Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!
Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!
Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!
Yes I agree but Q4_X isn't as close to the original int4 as you you said as it's not lossless unlike the Q8 one.