GGUF q4_0 model variants quantize the "output.weight" tensor with q6_k (compared to former ggml files which left them at fp32 iirc)

#4
by jrudolph - opened

Is this intended? It seems to depend on whether LLAMA_NO_K_QUANTS is set during compilation and quantize_output_tensor during quantization time. Is that something you changed in your process?

I noticed, because in my custom engine, I have not yet implemented all quantization methods (though k-quants seem attractive anyway...), and having models only with q4_0 (+ maybe q8_0) and fp32 was quite convenient. Not a big inconvenience, in the worst case, I can just dequantize at load time (to avoid having to implement full quantized matmul).

No, I haven't changed anything in my process. I build llama.cpp with default options

 make clean && LLAMA_CUBLAS=1 make -j

And I quantise with, for example:

quantize airoboros-l2-13b-3.0.fp16.gguf airoboros-l2-13b-3.0.Q4_K_M.gguf Q4_K_M

I believe quantising the output tensor with q6_k has been standard since around the time GGUF was first released.

Yep, probably, not a big deal, I guess. Thanks for the answer.

jrudolph changed discussion status to closed

Sign up or log in to comment