mradermacher/Qwen3-Coder-42B-A3B-Instruct-TOTAL-RECALL-MASTER-CODER-M-i1-GGUF · Do Imatrix quants make less sense for Q5?

Do Imatrix quants make less sense for Q5?

by BingoBird - opened Sep 23, 2025

Discussion

BingoBird

Sep 23, 2025

Do Imatrix quants make less sense for Q5?
Some sort of diminishing returns as bits go up?

nicoboss

Sep 23, 2025

•

edited Sep 23, 2025

@BingoBird The quality difference between static and imatrix quants is indeed getting smaller to more bits per wight the quant has but a quality difference still remains even for Q6. If you have a choice, I always recommend weighted/imatrix quants over static quants and so recommend to go with i1.Q5_K_M instead of Q5_K_M. Great quant choice i1-Q5_K_M is the first quant which based on my interpretation of my quant quality measurements must be indistinguishable from the unquantized model for humans for any normal use case. i1-Q5_K_M is the quant I personally use for all models between 8B and 500B.

Please go to https://hf.tst.eu/model#Qwen3-Coder-42B-A3B-Instruct-TOTAL-RECALL-MASTER-CODER-M-i1-GGUF and take a look at the quality column. You can use the dropdown to select based on quant quality metric you want the quants to be sorted. Those quality numbers are the results of 500 hours of quant quality measurements I made during Q4 2024 and so likely the best way for you to compare the quality of different quants.

jtabox

Oct 8, 2025

@BingoBird The quality difference between static and imatrix quants is indeed getting smaller to more bits per wight the quant has but a quality difference still remains even for Q6. If you have a choice, I always recommend weighted/imatrix quants over static quants and so recommend to go with i1.Q5_K_M instead of Q5_K_M. Great quant choice i1-Q5_K_M is the first quant which based on my interpretation of my quant quality measurements must be indistinguishable from the unquantized model for humans for any normal use case. i1-Q5_K_M is the quant I personally use for all models between 8B and 500B.

Please go to https://hf.tst.eu/model#Qwen3-Coder-42B-A3B-Instruct-TOTAL-RECALL-MASTER-CODER-M-i1-GGUF and take a look at the quality column. You can use the dropdown to select based on quant quality metric you want the quants to be sorted. Those quality numbers are the results of 500 hours of quant quality measurements I made during Q4 2024 and so likely the best way for you to compare the quality of different quants.

That's an amazing site for people like me who need 1 hour to choose between 4 static quant versions, trying to decide if I want to prioritize size or speed or quality. Great work, thanks.

I've heard previously that Q5 quants are somehow worse because it's an odd number, and it's better to choose Q4 or Q6. Can this be true? Feels a bit like an "urban myth".

nicoboss

Oct 8, 2025

That's an amazing site for people like me who need 1 hour to choose between 4 static quant versions, trying to decide if I want to prioritize size or speed or quality. Great work, thanks.

I'm glad you like it. Don't overthink it. The quality diffrence is quite minimal. If I need faster speed than i1-Q5_K_M offers just go i1-Q4_K_M or even i1-IQ4_XS

As pointed out by @eleius in https://huggingface.co/mradermacher/model_requests/discussions/1436#68e51764093f0ffd9336a746 you can also take a look at the raw results of my beanchmark under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55 as well. somewhere under BabyHercules in some later discussions you even find my raw extensive performance/speed measurment results.

I've heard previously that Q5 quants are somehow worse because it's an odd number, and it's better to choose Q4 or Q6. Can this be true? Feels a bit like an "urban myth".

It is an urban myth for GGUFs but very relevant for bitsanybytes and other types of quants relaying on or beeing heavely accelerated by native GPU datatype support. But even then only 4bit, 8 bit, 16bit and 32bit get that boost and even then native 16bit is usually way faster despite it should not due to token generation on GPUs beeing supposed to be bandwith and not compute bottelnacked.

On fact Q4_K_M is a mixture of 4 and 5 bit quantized tensors while Q5_K_M is a mixture between 5 and 6 bit tensors.

There is one exception to this rule which are Q4_0 quants which llama.cpp can on the fly optimise for certain archidectures and AI accelerators but this is mainly relevant for ARM (mobile), RISC-V (embeded devices, chineese silicon), Apple Silicon (mobile, laptop) and some obscure AI accellerators (samsung, quallcom, never working propriatory AMD RyzenAI/Intel AI boost and many more) but other than for thouse specific use cases Q4_0 quants are terrible in almost every metric so only use them if you have hardware optimised for them.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment