could you possibly also provide Q3_0, Q3_1?

by waynemchuck - opened Mar 9

Mar 9

Dear Mungert, my GPU with 64gb vram has best performance at legacy quants like Q4_0, Q4_1, Q_5_1, much better than any other Q4_km, IQ4 quants. It's amazing that you provide Q5_1 quants (other people typically only provide q4_0 q4_1), which is great for smaller models that I can fit into my gpu. For this particular 122B model, I'm guessing Q3_0 or Q3_1 could potentially fit completely into my gpu without offloading to cpu. Thanks a lot for your great works anyway!

Mungert

Owner Mar 9

There is no support for Q3_0 or Q3_1 in llama.cpp that I use to create the quants. I also find the q4_0 format to run faster. its simpler but does not offer as much quality. I keep the quants pure q4_0 etc. for every tensor layer for this reason. I have recently fully seperated the way I created the quants so the different speeds are not mixed. so _0 and _1 are pure. The _K will only have other _K quants these are not as fast as _0 and _1 but are better quality. Finally the slowest are iq which are any mixture of quant types and give best quality (mostly) .

How Does Qwen3.5-27B run its got nearly three times the number of active parameters as its a dense model but you can use the 8_0 and it might perform well using only half total memory of your gpu.

waynemchuck

Mar 9

There is no support for Q3_0 or Q3_1 in llama.cpp that I use to create the quants. I also find the q4_0 format to run faster. its simpler but does not offer as much quality. I keep the quants pure q4_0 etc. for every tensor layer for this reason. I have recently fully seperated the way I created the quants so the different speeds are not mixed. so _0 and _1 are pure. The _K will only have other _K quants these are not as fast as _0 and _1 but are better quality. Finally the slowest are iq which are any mixture of quant types and give best quality (mostly) .

How Does Qwen3.5-27B run its got nearly three times the number of active parameters as its a dense model but you can use the 8_0 and it might perform well using only half total memory of your gpu.

Thanks a lot for reply. It's a pity that llama.cpp doesn't support Q3_0 and Q3_1. I run 122b q3_k_xl from unsloth and 27b q4_1. The speed is quite the same, pp 200tps, tg 18tps. I would prefer 122b at q3 because I guess it's more knowledgable.

waynemchuck

Mar 9

•

edited Mar 9

Thanks again for q5_1 q5_0 quants of small models. Really a blessing for me and a lot of other people I think.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment