Dataset Imatrix
Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?
Hi, I wonder if that's because the rest of the model is in Q8. Usually you're memory bandwidth bound though so I wouldn't expect it to be that drastically slower.
These have been updated with the fused up+gate tensors and maybe your backend doesn't have optimizations for that yet?
Regarding the imatrix, I use @bartowski 's calibration_datav5 which is a general purpose dataset.
It may certainly be because of the 8bpw, so basically with your method used, you allow the user higher precision (nearly FP) with cheaper hardware.
Would be great to find a way to increase the speed (I will see if I can do anything about it). This is the right direction, you nailed it.
My backend is updated daily, so that´s not the issue. Care to share how you load it (--flags and cmd prompt).
If possible to share also your method scripts would be awesome ,a huge contribution to me and to the community, (The method already shared is a bit unclear).
Thanks in advance.
Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?
Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.
Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.
Sure, independently of what kind of hardware used, the difference of inference speed should remain constant relatively to the hardware, using same models, same loading parameters and same backend.
I am using HugstonOne Enterprise Edition 1.0.9 with the last updated qwen3.5-397B-q4_k_m (4 days ago from AeSsedai) which runs around 1tps, compared to the other q4_k_m myself quantized or available in huggingface which run to 3-5tps. Unsloth run faster but in my test I got mostly lower accuracy, which is good for general purposes but useless to us because we using it for very high precision tasks. That´s why AesSedai offer a good solution keeping compression and conserving accuracy.
What´s left obviously is to understand better the method so to possibly optimize the inference speed. For that hopefully AeSsedai hopefully will be willing to share his method here.
Edit: if you have any suggestions or see that I am missing something you are welcome to reply :)
@Trilogix1 I've shared my method previously, it basically looks like what I described here: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF/discussions/2#699e4ad09b7edc25c095a595
If you look at the table on the model card for this model, I have a "mixture" column which is in Default Type / FFN Up / FFN Gate / FFN Down notational format. So you can see the Q4_K_M uses Q8_0 as the default type (so that's shared expert, attention, etc.) and I use Q4_K for both the FFN Up and FFN Gate and I use Q5_K for the FFN Down tensors. I also use F32 for most of the SSM tensors because 1) they're tiny and 2) they seem very prone to quantization error based on some feedback I've received.
TG speed should mostly be dominated by memory bandwidth, not compute usually, so I'm a bit confused as to why my quants are slower to run. The overall size is comparable. I need to download eg unsloth's Q4_K_M and do a tensor-quantization comparison and see if I can narrow down the culprit at some point.
@Trilogix1 I did look into it a bit last night and it turns out their Q4_K_M is nearly identical to mine with three differences:
- I use the fused Up + Gate and they use the normal split
ffn_up_expsandffn_gate_exps - For the
outputtensor I use Q8_0 and they use Q6_K - For the
ssm_alphaandssm_betatensors I use F32 and they use Q8_0.
I wasn't aware before looking into this that they've adopted @ddh0 's and my MoE-quantization schema. I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly.
The overall size change due to those three tensors (output, ssm_alpha, and ssm_beta) is less than 1.5GiB so it's not a size difference. It might be due to the fused Up + Gate. What backend are you running this on? Because I know that CUDA has some supported ops that were added for that fused tensor and I'm wondering if another backend is falling back to some default op instead.
still 3-4 times slower than the average quantization
Are you talking about prompt processing or token generation speed here?
In general PP is compute bottlenecked and TG is memory bandwidth bottlenecked.
I did a quick 3-way comparison with some of my test quants that are identical except for the ssm_(alpha|beta) tensors are either:
- q8_0
- bf16 (the native original tensor type)
- f32 (upcast just for speed purposes)
I did not test f16 as that is a downcast and could lead to potential clipping and issues and should be avoided in almost every case.
tl;dr;
They are all very similar in speed and quality as measured by PPL, KLD. But there may still be advantages to using bf16 or f32 for those tensors at long context, but I am not really sure and it would require some other kinds of benchmarks.
I'd have to see your full commands, but I don't see any reason why AesSedai's quant should be 3-4 times slower at all.
👈 Details
title: "ik_llama.cpp ssm_(alpha|beta) comparison"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"
hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"
q8_0 19.788 GiB (4.904 BPW), PPL: 6.5443 +/- 0.04165, meanKLD: 0.005151 ± 0.000088
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-q8_0.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model "$model" \
-c 135168 \
-ngl 999 \
-ub 4096 -b 4096 \
--merge-qkv \
-muge \
--threads 1 \
--no-mmap \
-n 128 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 0.753 | 5441.16 | 1.041 | 123.01 |
| 4096 | 128 | 4096 | 0.796 | 5144.24 | 1.057 | 121.06 |
| 4096 | 128 | 8192 | 0.836 | 4900.19 | 1.082 | 118.33 |
| 4096 | 128 | 12288 | 0.881 | 4647.00 | 1.112 | 115.07 |
| 4096 | 128 | 16384 | 0.923 | 4438.19 | 1.122 | 114.04 |
| 4096 | 128 | 20480 | 0.971 | 4220.20 | 1.143 | 111.95 |
| 4096 | 128 | 24576 | 1.016 | 4030.06 | 1.162 | 110.16 |
| 4096 | 128 | 28672 | 1.065 | 3846.26 | 1.176 | 108.87 |
| 4096 | 128 | 32768 | 1.111 | 3686.82 | 1.207 | 106.05 |
| 4096 | 128 | 36864 | 1.163 | 3522.38 | 1.215 | 105.37 |
| 4096 | 128 | 40960 | 1.203 | 3404.74 | 1.231 | 104.02 |
| 4096 | 128 | 45056 | 1.257 | 3258.70 | 1.255 | 101.96 |
| 4096 | 128 | 49152 | 1.298 | 3155.06 | 1.267 | 101.00 |
| 4096 | 128 | 53248 | 1.347 | 3040.32 | 1.285 | 99.62 |
| 4096 | 128 | 57344 | 1.398 | 2930.81 | 1.306 | 98.03 |
| 4096 | 128 | 61440 | 1.453 | 2819.59 | 1.320 | 97.01 |
| 4096 | 128 | 65536 | 1.506 | 2719.62 | 1.345 | 95.14 |
| 4096 | 128 | 69632 | 1.549 | 2645.00 | 1.359 | 94.17 |
| 4096 | 128 | 73728 | 1.583 | 2587.48 | 1.374 | 93.16 |
| 4096 | 128 | 77824 | 1.642 | 2494.95 | 1.398 | 91.56 |
| 4096 | 128 | 81920 | 1.696 | 2415.03 | 1.410 | 90.76 |
| 4096 | 128 | 86016 | 1.737 | 2357.86 | 1.435 | 89.18 |
| 4096 | 128 | 90112 | 1.813 | 2259.35 | 1.446 | 88.52 |
| 4096 | 128 | 94208 | 1.861 | 2201.32 | 1.462 | 87.57 |
| 4096 | 128 | 98304 | 1.908 | 2146.52 | 1.485 | 86.22 |
| 4096 | 128 | 102400 | 1.956 | 2094.36 | 1.499 | 85.39 |
| 4096 | 128 | 106496 | 2.018 | 2030.22 | 1.516 | 84.46 |
| 4096 | 128 | 110592 | 2.058 | 1989.93 | 1.537 | 83.26 |
| 4096 | 128 | 114688 | 2.110 | 1941.09 | 1.552 | 82.49 |
| 4096 | 128 | 118784 | 2.183 | 1876.62 | 1.579 | 81.06 |
| 4096 | 128 | 122880 | 2.195 | 1866.29 | 1.587 | 80.64 |
| 4096 | 128 | 126976 | 2.295 | 1784.79 | 1.604 | 79.78 |
| 4096 | 128 | 131072 | 2.283 | 1794.30 | 1.630 | 78.54 |
bf16 19.792 GiB (4.905 BPW), PPL: 6.5446 +/- 0.04164, meanKLD: 0.005060 ± 0.000079
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-bf16.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model "$model" \
-c 135168 \
-ngl 999 \
-ub 4096 -b 4096 \
--merge-qkv \
-muge \
--threads 1 \
--no-mmap \
-n 128 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 0.775 | 5288.56 | 1.081 | 118.36 |
| 4096 | 128 | 4096 | 0.815 | 5024.72 | 1.100 | 116.39 |
| 4096 | 128 | 8192 | 0.856 | 4784.39 | 1.122 | 114.09 |
| 4096 | 128 | 12288 | 0.899 | 4555.86 | 1.154 | 110.95 |
| 4096 | 128 | 16384 | 0.947 | 4326.82 | 1.164 | 109.99 |
| 4096 | 128 | 20480 | 0.992 | 4130.13 | 1.183 | 108.21 |
| 4096 | 128 | 24576 | 1.035 | 3957.18 | 1.207 | 106.04 |
| 4096 | 128 | 28672 | 1.082 | 3785.92 | 1.219 | 105.00 |
| 4096 | 128 | 32768 | 1.125 | 3640.17 | 1.249 | 102.45 |
| 4096 | 128 | 36864 | 1.177 | 3478.59 | 1.254 | 102.06 |
| 4096 | 128 | 40960 | 1.225 | 3344.43 | 1.270 | 100.78 |
| 4096 | 128 | 45056 | 1.257 | 3259.48 | 1.295 | 98.85 |
| 4096 | 128 | 49152 | 1.301 | 3149.43 | 1.306 | 98.05 |
| 4096 | 128 | 53248 | 1.360 | 3012.77 | 1.324 | 96.69 |
| 4096 | 128 | 57344 | 1.398 | 2930.15 | 1.345 | 95.15 |
| 4096 | 128 | 61440 | 1.446 | 2833.20 | 1.358 | 94.24 |
| 4096 | 128 | 65536 | 1.495 | 2740.14 | 1.385 | 92.45 |
| 4096 | 128 | 69632 | 1.525 | 2685.25 | 1.395 | 91.76 |
| 4096 | 128 | 73728 | 1.578 | 2595.47 | 1.409 | 90.85 |
| 4096 | 128 | 77824 | 1.626 | 2519.06 | 1.432 | 89.40 |
| 4096 | 128 | 81920 | 1.678 | 2440.45 | 1.445 | 88.59 |
| 4096 | 128 | 86016 | 1.734 | 2362.33 | 1.470 | 87.08 |
| 4096 | 128 | 90112 | 1.776 | 2306.39 | 1.482 | 86.37 |
| 4096 | 128 | 94208 | 1.840 | 2226.49 | 1.501 | 85.27 |
| 4096 | 128 | 98304 | 1.887 | 2170.59 | 1.525 | 83.93 |
| 4096 | 128 | 102400 | 1.930 | 2122.18 | 1.537 | 83.26 |
| 4096 | 128 | 106496 | 1.990 | 2057.95 | 1.553 | 82.44 |
| 4096 | 128 | 110592 | 2.011 | 2037.19 | 1.573 | 81.39 |
| 4096 | 128 | 114688 | 2.099 | 1951.78 | 1.585 | 80.76 |
| 4096 | 128 | 118784 | 2.141 | 1912.93 | 1.618 | 79.10 |
| 4096 | 128 | 122880 | 2.184 | 1875.07 | 1.623 | 78.86 |
| 4096 | 128 | 126976 | 2.210 | 1853.03 | 1.641 | 78.01 |
| 4096 | 128 | 131072 | 2.318 | 1767.33 | 1.664 | 76.92 |
f32 19.799 GiB (4.907 BPW), PPL: 6.5434 +/- 0.04164, meanKLD: 0.005096 ± 0.000083
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-f32.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model "$model" \
-c 135168 \
-ngl 999 \
-ub 4096 -b 4096 \
--merge-qkv \
-muge \
--threads 1 \
--no-mmap \
-n 128 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 0.769 | 5327.26 | 1.064 | 120.35 |
| 4096 | 128 | 4096 | 0.809 | 5064.19 | 1.085 | 117.94 |
| 4096 | 128 | 8192 | 0.851 | 4812.13 | 1.108 | 115.49 |
| 4096 | 128 | 12288 | 0.899 | 4558.38 | 1.140 | 112.24 |
| 4096 | 128 | 16384 | 0.940 | 4358.92 | 1.149 | 111.40 |
| 4096 | 128 | 20480 | 0.982 | 4172.42 | 1.168 | 109.62 |
| 4096 | 128 | 24576 | 1.028 | 3985.75 | 1.191 | 107.47 |
| 4096 | 128 | 28672 | 1.070 | 3827.01 | 1.208 | 105.96 |
| 4096 | 128 | 32768 | 1.117 | 3668.45 | 1.236 | 103.55 |
| 4096 | 128 | 36864 | 1.163 | 3522.86 | 1.242 | 103.08 |
| 4096 | 128 | 40960 | 1.206 | 3395.78 | 1.258 | 101.77 |
| 4096 | 128 | 45056 | 1.250 | 3276.34 | 1.283 | 99.78 |
| 4096 | 128 | 49152 | 1.291 | 3173.51 | 1.296 | 98.78 |
| 4096 | 128 | 53248 | 1.340 | 3055.86 | 1.312 | 97.53 |
| 4096 | 128 | 57344 | 1.388 | 2951.63 | 1.332 | 96.10 |
| 4096 | 128 | 61440 | 1.432 | 2859.72 | 1.346 | 95.11 |
| 4096 | 128 | 65536 | 1.473 | 2781.26 | 1.371 | 93.33 |
| 4096 | 128 | 69632 | 1.517 | 2699.84 | 1.382 | 92.65 |
| 4096 | 128 | 73728 | 1.563 | 2620.51 | 1.397 | 91.64 |
| 4096 | 128 | 77824 | 1.622 | 2525.71 | 1.420 | 90.13 |
| 4096 | 128 | 81920 | 1.666 | 2458.05 | 1.433 | 89.35 |
| 4096 | 128 | 86016 | 1.715 | 2388.10 | 1.458 | 87.77 |
| 4096 | 128 | 90112 | 1.770 | 2314.58 | 1.470 | 87.08 |
| 4096 | 128 | 94208 | 1.812 | 2261.07 | 1.484 | 86.24 |
| 4096 | 128 | 98304 | 1.850 | 2213.65 | 1.514 | 84.54 |
| 4096 | 128 | 102400 | 1.906 | 2149.24 | 1.525 | 83.91 |
| 4096 | 128 | 106496 | 1.953 | 2097.29 | 1.539 | 83.18 |
| 4096 | 128 | 110592 | 1.994 | 2054.43 | 1.562 | 81.95 |
| 4096 | 128 | 114688 | 2.034 | 2013.83 | 1.575 | 81.26 |
| 4096 | 128 | 118784 | 2.082 | 1967.48 | 1.601 | 79.95 |
| 4096 | 128 | 122880 | 2.123 | 1929.65 | 1.609 | 79.55 |
| 4096 | 128 | 126976 | 2.185 | 1874.56 | 1.625 | 78.75 |
| 4096 | 128 | 131072 | 2.237 | 1831.30 | 1.652 | 77.50 |
What backend are you running this on?
His own, obviously. (I noticed “HugstonOne Enterprise Edition 1.0.9” being mentioned and felt compelled to investigate.😀)
Wow I missed a lot I see, (I thought I closed the discussion and got caught with work, so...).
Thanks to everyone for the reply.
@AesSedai >I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly. CSI at work LOL, you keep nailing it.
What´s puzzling me is that Autoround (intel) which do kind of the same thing, runs fine on par with llama.cpp if converted and quantized, keeping the 8bpw quality. I haven´t tested yet which of you have higher precision but both your produced quants are better than average.
@Ubergarm , @Maxxim69 , Basically I am running on llama.cpp, tried both Hugstonized backend or simply the last available build of llama.cpp. It should be slower if we using 8bpw in q5 right, do you guys get same inference speed, I am confused.
Here the prompt: C:---------------------------hugstonone\resources\app.asar.unpacked\runtimes\gpu\hugston-cli.exe --model C:----------------------------\qwen3.5\tested-AesSedaiQwen3.5-35B-A3B-GGUF\qwen3.5 35b q5\Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --threads 28 --ctx-size 64000 --n-gpu-layers -1 --batch-size 512 --n-predict -1 --no-warmup --cache-type-k q8_0
ggml_cuda_init: found x CUDA devices:
It is easy for me to test TG always so here the 35b q5_k_m same prompt, query, parameters, output les than 100 tokens : AesSedai quant results [ Prompt: 19.6 t/s | Generation: 24.7 t/s ]
Unsloth q5_k-m [ Prompt: 23.0 t/s | Generation: 25.5 t/s ]
Then we test the heavy weights, Qwen3.5 397b q4_k_m . Here we need more output because it is still loading while printing output. Once fully loaded/offloaded the results:
AesSedai quants 244gb Q4K-M
So what am I doing wrong? How do you load it?
So what am I doing wrong? How do you load it?
I have an example build and run command for ik_llama.cpp here: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#quick-start
If you have 2+ GPUs on ik_llama.cpp you can use -sm graph.
On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.
Cheers!
On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.
I will try asap :)
Cheers!
@Trilogix1 turns out the issue IS the fused up + gate. I re-quanted the Q5_K_M with a fused and unfused base and when the model is entirely on VRAM you get identical TG and a 10% PP boost. But as soon as you have the model on VRAM + RAM, the TG takes a huge dip. I'll be updating my Qwen3.5 repos to include both fused and unfused quants over the next couple of days, so people can pick which one works for their setup.
I filed a ticket about this issue and am17an provided a patch that looks like it fixes it: https://github.com/ggml-org/llama.cpp/issues/20883#issuecomment-4109411761
So there should be a PR coming from him in the near future and that should bring the fused mixed-offloading TG performance back up.
@AesSedai amazing work, this will change everything.
Seems is done: "am17an closed this as completedin #209106 hours ago"
I have to try it, if this works will beat Intel Autoround on speed keeping the high quality.
That means we can now run proprietary quality model weights in consumer hardware with a decent speed.
Did you receive on offer yet from OpenAi or Anthropic? :))) You will be off market soon I guess.
I wonder if you have the time to create a repo with a good readme for the entire method!
Many thanks.
Edit: Tried it and works, with the last build it got 5.1tps according to my counter or to the llama.cpp counter: [ Prompt: 3.2 t/s | Generation: 4.7 t/s ], x3 times inference speed improvement, no need to update the weights :


