GGUF q4_0 model variants quantize the "output.weight" tensor with q6_k (compared to former ggml files which left them at fp32 iirc)

by jrudolph - opened Oct 5, 2023

Oct 5, 2023

Is this intended? It seems to depend on whether LLAMA_NO_K_QUANTS is set during compilation and quantize_output_tensor during quantization time. Is that something you changed in your process?

I noticed, because in my custom engine, I have not yet implemented all quantization methods (though k-quants seem attractive anyway...), and having models only with q4_0 (+ maybe q8_0) and fp32 was quite convenient. Not a big inconvenience, in the worst case, I can just dequantize at load time (to avoid having to implement full quantized matmul).

TheBloke

Owner Oct 5, 2023

•

edited Oct 5, 2023

No, I haven't changed anything in my process. I build llama.cpp with default options

 make clean && LLAMA_CUBLAS=1 make -j

And I quantise with, for example:

quantize airoboros-l2-13b-3.0.fp16.gguf airoboros-l2-13b-3.0.Q4_K_M.gguf Q4_K_M

I believe quantising the output tensor with q6_k has been standard since around the time GGUF was first released.

jrudolph

Oct 6, 2023

Yep, probably, not a big deal, I guess. Thanks for the answer.

jrudolph changed discussion status to closed Oct 6, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment