What kind of Q4_0 are you using for ffn_(gate|up|down)_exps?

#2
by ubergarm - opened

I didn't see it mentioned on the model card, just curious if you're using the default llama.cpp q4_0 asymmetric quantization type?

Or if not, could you provide the patch used for the symmetric q4_0 llama-quantize?

Cheers!

Hey sorry on the delay - we published some of our findings at https://unsloth.ai/docs/models/kimi-k2.6

Yes we utilized the bijection patch from https://github.com/jukofyork - if not the stock Q4_0 has around 1.8% amax error - now it's within machine epsilon.

Note though Kimi uses BF16 for other tensors, and UD-Q8_K_XL does BF16 for others - we found there does exist some error in the Q4_X variants (UD-Q4_K_XL as well follows Q8_0 for other layers) vs the truly "lossless" UD-Q8_K_XL one seen below:

image

CC: @x-polyglot-x

Thanks for clarifying you are running llama-quantize with @jukofyork 's symmetric patch.

Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!

Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!

Sign up or log in to comment