GGUF q4_0 model variants quantize the "output.weight" tensor with q6_k (compared to former ggml files which left them at fp32 iirc)
Is this intended? It seems to depend on whether LLAMA_NO_K_QUANTS is set during compilation and quantize_output_tensor during quantization time. Is that something you changed in your process?
I noticed, because in my custom engine, I have not yet implemented all quantization methods (though k-quants seem attractive anyway...), and having models only with q4_0 (+ maybe q8_0) and fp32 was quite convenient. Not a big inconvenience, in the worst case, I can just dequantize at load time (to avoid having to implement full quantized matmul).
No, I haven't changed anything in my process. I build llama.cpp with default options
make clean && LLAMA_CUBLAS=1 make -j
And I quantise with, for example:
quantize airoboros-l2-13b-3.0.fp16.gguf airoboros-l2-13b-3.0.Q4_K_M.gguf Q4_K_M
I believe quantising the output tensor with q6_k has been standard since around the time GGUF was first released.
Yep, probably, not a big deal, I guess. Thanks for the answer.