Impressive

#1
by Throghar - opened

Tested Q2_K mixed, greater speed and so consistent answers compared to tradditional llama.cpp Q4_K_M quants, i would like to see more models in this qunatization method.

Thanks for the kind words!

Why is Qwen3.6-27B-Q2_K_MIXED.gguf larger than Q4_K_M? I've noticed that your other models are usually smaller.
I'm looking for something that's best with 16GB of VRAM :)

Why is Qwen3.6-27B-Q2_K_MIXED.gguf larger than Q4_K_M? I've noticed that your other models are usually smaller.
I'm looking for something that's best with 16GB of VRAM :)

Sorry that it didn't fit in 16GB. I've noticed that dense models will have larger size while MoE models will be smaller. Should be a AutoRound thing.

I think it is because more layers are preseved at higher precision because they are marked as important.

Why is Qwen3.6-27B-Q2_K_MIXED.gguf larger than Q4_K_M? I've noticed that your other models are usually smaller.
I'm looking for something that's best with 16GB of VRAM :)

Hi! Depending on your use case, you can fit it to your gpu just reduce batch and ubatch size to 256 in llama.cpp!
This will make a massive hit to your speeds, but i was able to fit Q3_K_L to 12 gm vram with 8k context(q4_1 kv cache) using this method. I had about 30 mb leftover.

Hi, What's the quantization setting for Q2_K_MIXED? is it better than Q4_K_M?

I am running Q4_K_M with 95K context , KV Cache only q4_0 . output is quite stable, feels better than IQ4_XS and UD_Q3_K_XL

Hi, What's the quantization setting for Q2_K_MIXED? is it better than Q4_K_M?

Hi, the layers is quantized to a mixed of Q4 and Q2. Overall it should be faster inference than Q4_K_M but with mostly the same quality.

Sign up or log in to comment