Dataset Imatrix

#7
by Trilogix1 - opened

Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?

Hi, I wonder if that's because the rest of the model is in Q8. Usually you're memory bandwidth bound though so I wouldn't expect it to be that drastically slower.

These have been updated with the fused up+gate tensors and maybe your backend doesn't have optimizations for that yet?

Regarding the imatrix, I use @bartowski 's calibration_datav5 which is a general purpose dataset.

It may certainly be because of the 8bpw, so basically with your method used, you allow the user higher precision (nearly FP) with cheaper hardware.
Would be great to find a way to increase the speed (I will see if I can do anything about it). This is the right direction, you nailed it.
My backend is updated daily, so that´s not the issue. Care to share how you load it (--flags and cmd prompt).
If possible to share also your method scripts would be awesome ,a huge contribution to me and to the community, (The method already shared is a bit unclear).

Thanks in advance.

Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?

Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.

Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.

Sure, independently of what kind of hardware used, the difference of inference speed should remain constant relatively to the hardware, using same models, same loading parameters and same backend.
I am using HugstonOne Enterprise Edition 1.0.9 with the last updated qwen3.5-397B-q4_k_m (4 days ago from AeSsedai) which runs around 1tps, compared to the other q4_k_m myself quantized or available in huggingface which run to 3-5tps. Unsloth run faster but in my test I got mostly lower accuracy, which is good for general purposes but useless to us because we using it for very high precision tasks. That´s why AesSedai offer a good solution keeping compression and conserving accuracy.
What´s left obviously is to understand better the method so to possibly optimize the inference speed. For that hopefully AeSsedai hopefully will be willing to share his method here.
Edit: if you have any suggestions or see that I am missing something you are welcome to reply :)

@Trilogix1 I've shared my method previously, it basically looks like what I described here: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF/discussions/2#699e4ad09b7edc25c095a595

If you look at the table on the model card for this model, I have a "mixture" column which is in Default Type / FFN Up / FFN Gate / FFN Down notational format. So you can see the Q4_K_M uses Q8_0 as the default type (so that's shared expert, attention, etc.) and I use Q4_K for both the FFN Up and FFN Gate and I use Q5_K for the FFN Down tensors. I also use F32 for most of the SSM tensors because 1) they're tiny and 2) they seem very prone to quantization error based on some feedback I've received.

TG speed should mostly be dominated by memory bandwidth, not compute usually, so I'm a bit confused as to why my quants are slower to run. The overall size is comparable. I need to download eg unsloth's Q4_K_M and do a tensor-quantization comparison and see if I can narrow down the culprit at some point.

@AesSedai > download eg unsloth's Q4_K_M and do a tensor-quantization comparison and see if I can narrow down the culprit at some point.
It´s definitely worth checking, I see it as a game changer with great benefits.

Trilogix1 changed discussion status to closed

@Trilogix1 I did look into it a bit last night and it turns out their Q4_K_M is nearly identical to mine with three differences:

  1. I use the fused Up + Gate and they use the normal split ffn_up_exps and ffn_gate_exps
  2. For the output tensor I use Q8_0 and they use Q6_K
  3. For the ssm_alpha and ssm_beta tensors I use F32 and they use Q8_0.

I wasn't aware before looking into this that they've adopted @ddh0 's and my MoE-quantization schema. I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly.

The overall size change due to those three tensors (output, ssm_alpha, and ssm_beta) is less than 1.5GiB so it's not a size difference. It might be due to the fused Up + Gate. What backend are you running this on? Because I know that CUDA has some supported ops that were added for that fused tensor and I'm wondering if another backend is falling back to some default op instead.

@Trilogix1

still 3-4 times slower than the average quantization

Are you talking about prompt processing or token generation speed here?

In general PP is compute bottlenecked and TG is memory bandwidth bottlenecked.

I did a quick 3-way comparison with some of my test quants that are identical except for the ssm_(alpha|beta) tensors are either:

  • q8_0
  • bf16 (the native original tensor type)
  • f32 (upcast just for speed purposes)

I did not test f16 as that is a downcast and could lead to potential clipping and issues and should be avoided in almost every case.

tl;dr;

They are all very similar in speed and quality as measured by PPL, KLD. But there may still be advantages to using bf16 or f32 for those tensors at long context, but I am not really sure and it would require some other kinds of benchmarks.

I'd have to see your full commands, but I don't see any reason why AesSedai's quant should be 3-4 times slower at all.

sweep-bench-Qwen3.5-35B-A3B-ssm-test-merged

👈 Details

title: "ik_llama.cpp ssm_(alpha|beta) comparison"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"
hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"

q8_0 19.788 GiB (4.904 BPW), PPL: 6.5443 +/- 0.04165, meanKLD: 0.005151 ± 0.000088

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-q8_0.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 0.753 5441.16 1.041 123.01
4096 128 4096 0.796 5144.24 1.057 121.06
4096 128 8192 0.836 4900.19 1.082 118.33
4096 128 12288 0.881 4647.00 1.112 115.07
4096 128 16384 0.923 4438.19 1.122 114.04
4096 128 20480 0.971 4220.20 1.143 111.95
4096 128 24576 1.016 4030.06 1.162 110.16
4096 128 28672 1.065 3846.26 1.176 108.87
4096 128 32768 1.111 3686.82 1.207 106.05
4096 128 36864 1.163 3522.38 1.215 105.37
4096 128 40960 1.203 3404.74 1.231 104.02
4096 128 45056 1.257 3258.70 1.255 101.96
4096 128 49152 1.298 3155.06 1.267 101.00
4096 128 53248 1.347 3040.32 1.285 99.62
4096 128 57344 1.398 2930.81 1.306 98.03
4096 128 61440 1.453 2819.59 1.320 97.01
4096 128 65536 1.506 2719.62 1.345 95.14
4096 128 69632 1.549 2645.00 1.359 94.17
4096 128 73728 1.583 2587.48 1.374 93.16
4096 128 77824 1.642 2494.95 1.398 91.56
4096 128 81920 1.696 2415.03 1.410 90.76
4096 128 86016 1.737 2357.86 1.435 89.18
4096 128 90112 1.813 2259.35 1.446 88.52
4096 128 94208 1.861 2201.32 1.462 87.57
4096 128 98304 1.908 2146.52 1.485 86.22
4096 128 102400 1.956 2094.36 1.499 85.39
4096 128 106496 2.018 2030.22 1.516 84.46
4096 128 110592 2.058 1989.93 1.537 83.26
4096 128 114688 2.110 1941.09 1.552 82.49
4096 128 118784 2.183 1876.62 1.579 81.06
4096 128 122880 2.195 1866.29 1.587 80.64
4096 128 126976 2.295 1784.79 1.604 79.78
4096 128 131072 2.283 1794.30 1.630 78.54

bf16 19.792 GiB (4.905 BPW), PPL: 6.5446 +/- 0.04164, meanKLD: 0.005060 ± 0.000079

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-bf16.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 0.775 5288.56 1.081 118.36
4096 128 4096 0.815 5024.72 1.100 116.39
4096 128 8192 0.856 4784.39 1.122 114.09
4096 128 12288 0.899 4555.86 1.154 110.95
4096 128 16384 0.947 4326.82 1.164 109.99
4096 128 20480 0.992 4130.13 1.183 108.21
4096 128 24576 1.035 3957.18 1.207 106.04
4096 128 28672 1.082 3785.92 1.219 105.00
4096 128 32768 1.125 3640.17 1.249 102.45
4096 128 36864 1.177 3478.59 1.254 102.06
4096 128 40960 1.225 3344.43 1.270 100.78
4096 128 45056 1.257 3259.48 1.295 98.85
4096 128 49152 1.301 3149.43 1.306 98.05
4096 128 53248 1.360 3012.77 1.324 96.69
4096 128 57344 1.398 2930.15 1.345 95.15
4096 128 61440 1.446 2833.20 1.358 94.24
4096 128 65536 1.495 2740.14 1.385 92.45
4096 128 69632 1.525 2685.25 1.395 91.76
4096 128 73728 1.578 2595.47 1.409 90.85
4096 128 77824 1.626 2519.06 1.432 89.40
4096 128 81920 1.678 2440.45 1.445 88.59
4096 128 86016 1.734 2362.33 1.470 87.08
4096 128 90112 1.776 2306.39 1.482 86.37
4096 128 94208 1.840 2226.49 1.501 85.27
4096 128 98304 1.887 2170.59 1.525 83.93
4096 128 102400 1.930 2122.18 1.537 83.26
4096 128 106496 1.990 2057.95 1.553 82.44
4096 128 110592 2.011 2037.19 1.573 81.39
4096 128 114688 2.099 1951.78 1.585 80.76
4096 128 118784 2.141 1912.93 1.618 79.10
4096 128 122880 2.184 1875.07 1.623 78.86
4096 128 126976 2.210 1853.03 1.641 78.01
4096 128 131072 2.318 1767.33 1.664 76.92

f32 19.799 GiB (4.907 BPW), PPL: 6.5434 +/- 0.04164, meanKLD: 0.005096 ± 0.000083

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-f32.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 0.769 5327.26 1.064 120.35
4096 128 4096 0.809 5064.19 1.085 117.94
4096 128 8192 0.851 4812.13 1.108 115.49
4096 128 12288 0.899 4558.38 1.140 112.24
4096 128 16384 0.940 4358.92 1.149 111.40
4096 128 20480 0.982 4172.42 1.168 109.62
4096 128 24576 1.028 3985.75 1.191 107.47
4096 128 28672 1.070 3827.01 1.208 105.96
4096 128 32768 1.117 3668.45 1.236 103.55
4096 128 36864 1.163 3522.86 1.242 103.08
4096 128 40960 1.206 3395.78 1.258 101.77
4096 128 45056 1.250 3276.34 1.283 99.78
4096 128 49152 1.291 3173.51 1.296 98.78
4096 128 53248 1.340 3055.86 1.312 97.53
4096 128 57344 1.388 2951.63 1.332 96.10
4096 128 61440 1.432 2859.72 1.346 95.11
4096 128 65536 1.473 2781.26 1.371 93.33
4096 128 69632 1.517 2699.84 1.382 92.65
4096 128 73728 1.563 2620.51 1.397 91.64
4096 128 77824 1.622 2525.71 1.420 90.13
4096 128 81920 1.666 2458.05 1.433 89.35
4096 128 86016 1.715 2388.10 1.458 87.77
4096 128 90112 1.770 2314.58 1.470 87.08
4096 128 94208 1.812 2261.07 1.484 86.24
4096 128 98304 1.850 2213.65 1.514 84.54
4096 128 102400 1.906 2149.24 1.525 83.91
4096 128 106496 1.953 2097.29 1.539 83.18
4096 128 110592 1.994 2054.43 1.562 81.95
4096 128 114688 2.034 2013.83 1.575 81.26
4096 128 118784 2.082 1967.48 1.601 79.95
4096 128 122880 2.123 1929.65 1.609 79.55
4096 128 126976 2.185 1874.56 1.625 78.75
4096 128 131072 2.237 1831.30 1.652 77.50

What backend are you running this on?

His own, obviously. (I noticed “HugstonOne Enterprise Edition 1.0.9” being mentioned and felt compelled to investigate.😀)

@Maxxim69

interesting haha... i see in the bottom of that github README it mentions that it bundles llama.cpp, but i didn't check to see if there is a pin'd version via submodule etc...

so its llama.cpp under the hood :gucci:

Wow I missed a lot I see, (I thought I closed the discussion and got caught with work, so...).
Thanks to everyone for the reply.

@AesSedai >I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly. CSI at work LOL, you keep nailing it.

What´s puzzling me is that Autoround (intel) which do kind of the same thing, runs fine on par with llama.cpp if converted and quantized, keeping the 8bpw quality. I haven´t tested yet which of you have higher precision but both your produced quants are better than average.

@Ubergarm , @Maxxim69 , Basically I am running on llama.cpp, tried both Hugstonized backend or simply the last available build of llama.cpp. It should be slower if we using 8bpw in q5 right, do you guys get same inference speed, I am confused.

Here the prompt: C:---------------------------hugstonone\resources\app.asar.unpacked\runtimes\gpu\hugston-cli.exe --model C:----------------------------\qwen3.5\tested-AesSedaiQwen3.5-35B-A3B-GGUF\qwen3.5 35b q5\Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --threads 28 --ctx-size 64000 --n-gpu-layers -1 --batch-size 512 --n-predict -1 --no-warmup --cache-type-k q8_0
ggml_cuda_init: found x CUDA devices:
It is easy for me to test TG always so here the 35b q5_k_m same prompt, query, parameters, output les than 100 tokens : AesSedai quant results [ Prompt: 19.6 t/s | Generation: 24.7 t/s ]
Unsloth q5_k-m [ Prompt: 23.0 t/s | Generation: 25.5 t/s ]
Then we test the heavy weights, Qwen3.5 397b q4_k_m . Here we need more output because it is still loading while printing output. Once fully loaded/offloaded the results:
AesSedai quants 244gb Q4K-M

Unsloth quants 245gb Q4_k_xl

So what am I doing wrong? How do you load it?

Trilogix1 changed discussion status to open

So what am I doing wrong? How do you load it?

I have an example build and run command for ik_llama.cpp here: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#quick-start

If you have 2+ GPUs on ik_llama.cpp you can use -sm graph.

On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.

Cheers!

On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.
I will try asap :)

Cheers!

@Trilogix1 turns out the issue IS the fused up + gate. I re-quanted the Q5_K_M with a fused and unfused base and when the model is entirely on VRAM you get identical TG and a 10% PP boost. But as soon as you have the model on VRAM + RAM, the TG takes a huge dip. I'll be updating my Qwen3.5 repos to include both fused and unfused quants over the next couple of days, so people can pick which one works for their setup.

image

I filed a ticket about this issue and am17an provided a patch that looks like it fixes it: https://github.com/ggml-org/llama.cpp/issues/20883#issuecomment-4109411761

So there should be a PR coming from him in the near future and that should bring the fused mixed-offloading TG performance back up.

image

@AesSedai amazing work, this will change everything.
Seems is done: "am17an closed this as completedin #209106 hours ago"

I have to try it, if this works will beat Intel Autoround on speed keeping the high quality.
That means we can now run proprietary quality model weights in consumer hardware with a decent speed.
Did you receive on offer yet from OpenAi or Anthropic? :))) You will be off market soon I guess.
I wonder if you have the time to create a repo with a good readme for the entire method!
Many thanks.
Edit: Tried it and works, with the last build it got 5.1tps according to my counter or to the llama.cpp counter: [ Prompt: 3.2 t/s | Generation: 4.7 t/s ], x3 times inference speed improvement, no need to update the weights :
image

Sign up or log in to comment