Awesome quants! KLD/PPL comparison

#2
by krampenschiesser - opened

You have some really cool quants here!

wikitext-2-raw-v1, unsloth q8_0 as base

Provider Quant Size GB Mean PPL Mean KLD Same Top p
Unsloth Q8 4.3155 +/- 0.02446 baseline baseline
Unsloth UD-Q6_K_XL 105.0 4.317536 ± 0.024475 0.004961 ± 0.000192 97.655 ± 0.039 %
Aes Sedai Q5_K_M 91.5 4.320741 ± 0.024486 0.005936 ± 0.000234 97.348 ± 0.042 %
Unsloth Q6_K_M 101.0 4.320079 ± 0.024524 0.006602 ± 0.000252 97.057 ± 0.044 %
Unsloth Q5_K_M 87.1 4.332594 ± 0.024603 0.010502 ± 0.000261 96.318 ± 0.049 %
Aes Sedai Q4_K_M 76.7 4.325629 ± 0.024507 0.010749 ± 0.000228 96.435 ± 0.048 %
Unsloth UD-Q5_K_XL 87.0 4.331663 ± 0.024585 0.011109 ± 0.000284 96.301 ± 0.049 %
Aes Sedai IQ4_X_S 60.4 4.404998 ± 0.025001 0.027409 ± 0.000300 94.259 ± 0.060 %
Unsloth Q4K_M 74.3 4.435888 ± 0.025722 0.033208 ± 0.000312 92.935 ± 0.067 %
Unsloth IQ4_NL 69.2 4.468707 ± 0.026029 0.038368 ± 0.000331 92.349 ± 0.069 %
Unsloth IQ4_XS 65.5 4.462136 ± 0.025988 0.038909 ± 0.000371 92.321 ± 0.069 %
Unsloth MXFP4 68.3 4.452131 ± 0.025427 0.057660 ± 0.000527 91.221 ± 0.073 %
Noctrex MXFP4 74.0 4.450555 ± 0.025420 0.057950 ± 0.000517 91.160 ± 0.074 %
Unsloth IQ3_XXS 50.5 4.567684 ± 0.026154 0.068894 ± 0.000561 90.486 ± 0.076 %
Aes Sedai IQ3_S 46.6 4.570771 ± 0.026085 0.073494 ± 0.000597 90.410 ± 0.076 %
Unsloth Q3_K_M 58.8 4.648459 ± 0.027585 0.083953 ± 0.000570 88.692 ± 0.082 %
Unsloth UD-Q3_K_XL 54.6 4.915599 ± 0.028904 0.128848 ± 0.000917 87.006 ± 0.087 %
Unsloth UD-Q4_K_XL 68.4 4.867515 ± 0.028856 0.130354 ± 0.000939 86.819 ± 0.088 %
Unsloth UD-Q2_K_XL 46.7 5.133302 ± 0.030444 0.174476 ± 0.001143 84.785 ± 0.093 %
Aes Sedai IQ2_XXS 33.9 5.105667 ± 0.030043 0.178437 ± 0.001154 84.945 ± 0.093 %

Thank you for measuring these! Always love seeing community data and I appreciate the work you did here.

Seriously amazing, thanks for hitting such a nice quant sweet spot AesSedai! I was worried the 122B model would be out of my reach.

I'm using: IQ4_X_S and get roughly 15 tokens/sec on a 5080RTX (16GB VRAM) + 64GB of system RAM and llamacpp which is slow but usable (the thinking takes a long time!). For comparison, qwen3-coder:30b-a3b-q4_K_M runs at 50 tok/sec for me (it also overflows VRAM into system RAM).

Previous best local models for my coding tasks: gemma3:27b and qwen3-coder:30b-a3b-q4_K_M, specifically a new Chrome browser extension. So far, haven't had any wrong answers (can't say the same for any other model I've tried), although the code it produces is still not usually up to my standards. Nothing I can't clean up myself though.

Seriously amazing, thanks for hitting such a nice quant sweet spot AesSedai! I was worried the 122B model would be out of my reach.

I'm using: IQ4_X_S and get roughly 15 tokens/sec on a 5080RTX (16GB VRAM) + 64GB of system RAM and llamacpp which is slow but usable (the thinking takes a long time!). For comparison, qwen3-coder:30b-a3b-q4_K_M runs at 50 tok/sec for me (it also overflows VRAM into system RAM).

Previous best local models for my coding tasks: gemma3:27b and qwen3-coder:30b-a3b-q4_K_M, specifically a new Chrome browser extension. So far, haven't had any wrong answers (can't say the same for any other model I've tried), although the code it produces is still not usually up to my standards. Nothing I can't clean up myself though.

For your setup I emphatically suggest trying Qwen3-Coder-30B-A3B-Instruct-Q3_K_S-2.69bpw.gguf from byteshape.
https://huggingface.co/byteshape/Qwen3-Coder-30B-A3B-Instruct-GGUF

It one-shotted this on a 16GB laptop doing maybe 4-5 t/s, ryzen 3500u with Vega8 iGP: https://x0.at/UbDs.py

qwen3-coder-30-bytedance

Question is, was this excellent model a fluke? Or does their method generalize well? If I were AesSedai i'd definitely check these guyses papers out!

I've seen a couple of things about ByteShape in the past but I didn't see if they published a paper or open source methodology, definitely interested if they have.

Edit: reading their page here says that it's proprietary: https://byteshape.com/index.html#technologies

At the heart of ByteShape’s acceleration stack is a suite of proprietary datatype learning algorithms that automatically determine the minimal precision required for each neural-network weight and activation. Unlike static quantization, ShapeLearn performs dynamic precision allocation to preserve accuracy while greatly reducing arithmetic complexity, memory footprint, and energy consumption.

Oh :/ I am embarassed by my post.

No need to be embarrassed, I'm interested in what they're doing too but they seem to be making a business out of it, so not really anything available otherwise on the open source front as far as I'm aware.

Sign up or log in to comment