Why the fp32 tensors in the IQ5_KS quant?
I noticed your recipe for IQ5_KS use
blk..*.ssm_alpha.weight=f32
blk..*.ssm_beta.weight=f32
Why is SM alpha/beta FP32 here when the base model itself is just bf16?
Like if those quants are just important enough that it's worth going above Q8 for them in the larger IQ5_KS recepie then why not just go with BF16 instead of FP32?
Mostly asking because I made my own IQ_K quants for an arbitered version of this model based on these quant recipes and the fp32 seems a bit weird and wasteful to me so I don't know if I should just blindly copy it.
Good eye!
Its a whole story haha... Yes you are correct, originals are bf16 so it is strange. There was some chatter on reddit and people became superstitious that it is important to keep them at full bf16 quality. But bf16 tends to inference a bit slower on GPU backends (CUDA) than fp32 (which has enough dynamic range to not clip). Since you can't for sure downcast them to fp16 (which some people originally were doing, but fortunately that nonsense was stopped - since there was no checking for clipping - in which case i guess it is fine)...
Anyway, q8_0 is probably "good enough" quality.
I did some comparison on speed differentials and its not a huge deal really, but q8_0 should be faster for TG.
One sec I'll find my discussion and graph with ik and link you to it.
Btw, there is zero reason to be using anything better than Q8_0 for ssm_alpha and ssm_beta. Your own PPL calculations show no difference to f32, and your own benchmarks show reduced TG performance. So, why would you want to do it?
ik https://github.com/ikawrakow/ik_llama.cpp/issues/1471#issuecomment-4097081505
my 3-way llama-sweep bench comparing ssm at q8_0, bf16, and fp32 https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/7#69b8404f18a5e8feffd9f5c8
Also, thanks for tagging your quants ik_llama.cpp so folks can find them! Cheers and welcome to the party!