Q4_0 vs native INT4 QAT fidelity

#4
by SpacetimeAI - opened

Note that llama.cpp’s Q4_0 quantization does not align with the native INT4 QAT format used in post tuning, so converting those tensors to Q4_0 is really just a second quantization step.

In past measurements with k2.5, I saw cosine similarity around ~0.95 between a Q4_0 tensor and the corresponding BF16 tensor, while the Q8 version of that same tensor around ~0.998.

If the quantization algorithm were identical to the original QAT format, I would expect cosine similarity to be near 1.0.

Until llama.cpp implements a quantization scheme that faithfully matches the native INT4 QAT, could we get a Q8 variant that does not use Q4_0 for those tensors

@SpacetimeAI

I asked about it already here: https://huggingface.co/unsloth/Kimi-K2.6-GGUF/discussions/2#69e796f9db5b3bac09676143

fwiw both myself and AesSedai are using symmetric patch as explained here: https://huggingface.co/ubergarm/Kimi-K2.6-GGUF#q4_x-patch

tl;dr; Q4_X is as close to original int4 as we can get in GGUF land.

Hey sorry for the delay - we wrote about it in https://unsloth.ai/docs/models/kimi-k2.6

Also copying from Ubergarm's other comment below:

Yes we utilized the bijection patch from https://github.com/jukofyork - if not the stock Q4_0 has around 1.8% amax error - now it's within machine epsilon.

Note though Kimi uses BF16 for other tensors, and UD-Q8_K_XL does BF16 for others - we found there does exist some error in the Q4_X variants (UD-Q4_K_XL as well follows Q8_0 for other layers) vs the truly "lossless" UD-Q8_K_XL one seen below:

image

CC: @SpacetimeAI

Phenomenal work by the community and Unsloth!

Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!

Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!

Unsloth AI org

Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!

Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!

Yes I agree but Q4_X isn't as close to the original int4 as you you said as it's not lossless unlike the Q8 one.

Sign up or log in to comment