•

Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?

AesSedai

Owner Mar 14

Hi, I wonder if that's because the rest of the model is in Q8. Usually you're memory bandwidth bound though so I wouldn't expect it to be that drastically slower.

These have been updated with the fused up+gate tensors and maybe your backend doesn't have optimizations for that yet?

Regarding the imatrix, I use @bartowski 's calibration_datav5 which is a general purpose dataset.

Trilogix1

Mar 15

It may certainly be because of the 8bpw, so basically with your method used, you allow the user higher precision (nearly FP) with cheaper hardware.
Would be great to find a way to increase the speed (I will see if I can do anything about it). This is the right direction, you nailed it.
My backend is updated daily, so that´s not the issue. Care to share how you load it (--flags and cmd prompt).
If possible to share also your method scripts would be awesome ,a huge contribution to me and to the community, (The method already shared is a bit unclear).

Thanks in advance.

MartinPatterson

Mar 15

•

edited Mar 15

Hi, great job btw with the weights but it is still 3-4 times slower than the average quantization, we trading speed for accuracy.
Which dataset did you use for the Imatrix file, (is it a coding one or general knowledge)?

Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.

Trilogix1

Mar 15

•

edited Mar 15

Care to elaborate:
A)
What kind of rig you have?
B)
Which (Qwen3.5-397B-A17B) version you're talking here, being 3-4 times slower and vs what.

Sure, independently of what kind of hardware used, the difference of inference speed should remain constant relatively to the hardware, using same models, same loading parameters and same backend.
I am using HugstonOne Enterprise Edition 1.0.9 with the last updated qwen3.5-397B-q4_k_m (4 days ago from AeSsedai) which runs around 1tps, compared to the other q4_k_m myself quantized or available in huggingface which run to 3-5tps. Unsloth run faster but in my test I got mostly lower accuracy, which is good for general purposes but useless to us because we using it for very high precision tasks. That´s why AesSedai offer a good solution keeping compression and conserving accuracy.
What´s left obviously is to understand better the method so to possibly optimize the inference speed. For that hopefully AeSsedai hopefully will be willing to share his method here.
Edit: if you have any suggestions or see that I am missing something you are welcome to reply :)

AesSedai

Owner Mar 15

@Trilogix1 I've shared my method previously, it basically looks like what I described here: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF/discussions/2#699e4ad09b7edc25c095a595

If you look at the table on the model card for this model, I have a "mixture" column which is in Default Type / FFN Up / FFN Gate / FFN Down notational format. So you can see the Q4_K_M uses Q8_0 as the default type (so that's shared expert, attention, etc.) and I use Q4_K for both the FFN Up and FFN Gate and I use Q5_K for the FFN Down tensors. I also use F32 for most of the SSM tensors because 1) they're tiny and 2) they seem very prone to quantization error based on some feedback I've received.

TG speed should mostly be dominated by memory bandwidth, not compute usually, so I'm a bit confused as to why my quants are slower to run. The overall size is comparable. I need to download eg unsloth's Q4_K_M and do a tensor-quantization comparison and see if I can narrow down the culprit at some point.

Trilogix1

Mar 16

@AesSedai > download eg unsloth's Q4_K_M and do a tensor-quantization comparison and see if I can narrow down the culprit at some point.
It´s definitely worth checking, I see it as a game changer with great benefits.

Trilogix1 changed discussion status to closed Mar 16

AesSedai

Owner Mar 16

@Trilogix1 I did look into it a bit last night and it turns out their Q4_K_M is nearly identical to mine with three differences:

I use the fused Up + Gate and they use the normal split ffn_up_exps and ffn_gate_exps
For the output tensor I use Q8_0 and they use Q6_K
For the ssm_alpha and ssm_beta tensors I use F32 and they use Q8_0.

I wasn't aware before looking into this that they've adopted @ddh0 's and my MoE-quantization schema. I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly.

The overall size change due to those three tensors (output, ssm_alpha, and ssm_beta) is less than 1.5GiB so it's not a size difference. It might be due to the fused Up + Gate. What backend are you running this on? Because I know that CUDA has some supported ops that were added for that fused tensor and I'm wondering if another backend is falling back to some default op instead.

ubergarm

Mar 16

•

edited Mar 16

@Trilogix1

still 3-4 times slower than the average quantization

Are you talking about prompt processing or token generation speed here?

In general PP is compute bottlenecked and TG is memory bandwidth bottlenecked.

I did a quick 3-way comparison with some of my test quants that are identical except for the ssm_(alpha|beta) tensors are either:

q8_0
bf16 (the native original tensor type)
f32 (upcast just for speed purposes)

I did not test f16 as that is a downcast and could lead to potential clipping and issues and should be avoided in almost every case.

tl;dr;

They are all very similar in speed and quality as measured by PPL, KLD. But there may still be advantages to using bf16 or f32 for those tensors at long context, but I am not really sure and it would require some other kinds of benchmarks.

I'd have to see your full commands, but I don't see any reason why AesSedai's quant should be 3-4 times slower at all.

👈 Details

title: "ik_llama.cpp ssm_(alpha|beta) comparison"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"
hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"

q8_0 19.788 GiB (4.904 BPW), PPL: 6.5443 +/- 0.04165, meanKLD: 0.005151 ± 0.000088

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-q8_0.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	0.753	5441.16	1.041	123.01
4096	128	4096	0.796	5144.24	1.057	121.06
4096	128	8192	0.836	4900.19	1.082	118.33
4096	128	12288	0.881	4647.00	1.112	115.07
4096	128	16384	0.923	4438.19	1.122	114.04
4096	128	20480	0.971	4220.20	1.143	111.95
4096	128	24576	1.016	4030.06	1.162	110.16
4096	128	28672	1.065	3846.26	1.176	108.87
4096	128	32768	1.111	3686.82	1.207	106.05
4096	128	36864	1.163	3522.38	1.215	105.37
4096	128	40960	1.203	3404.74	1.231	104.02
4096	128	45056	1.257	3258.70	1.255	101.96
4096	128	49152	1.298	3155.06	1.267	101.00
4096	128	53248	1.347	3040.32	1.285	99.62
4096	128	57344	1.398	2930.81	1.306	98.03
4096	128	61440	1.453	2819.59	1.320	97.01
4096	128	65536	1.506	2719.62	1.345	95.14
4096	128	69632	1.549	2645.00	1.359	94.17
4096	128	73728	1.583	2587.48	1.374	93.16
4096	128	77824	1.642	2494.95	1.398	91.56
4096	128	81920	1.696	2415.03	1.410	90.76
4096	128	86016	1.737	2357.86	1.435	89.18
4096	128	90112	1.813	2259.35	1.446	88.52
4096	128	94208	1.861	2201.32	1.462	87.57
4096	128	98304	1.908	2146.52	1.485	86.22
4096	128	102400	1.956	2094.36	1.499	85.39
4096	128	106496	2.018	2030.22	1.516	84.46
4096	128	110592	2.058	1989.93	1.537	83.26
4096	128	114688	2.110	1941.09	1.552	82.49
4096	128	118784	2.183	1876.62	1.579	81.06
4096	128	122880	2.195	1866.29	1.587	80.64
4096	128	126976	2.295	1784.79	1.604	79.78
4096	128	131072	2.283	1794.30	1.630	78.54

bf16 19.792 GiB (4.905 BPW), PPL: 6.5446 +/- 0.04164, meanKLD: 0.005060 ± 0.000079

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-bf16.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	0.775	5288.56	1.081	118.36
4096	128	4096	0.815	5024.72	1.100	116.39
4096	128	8192	0.856	4784.39	1.122	114.09
4096	128	12288	0.899	4555.86	1.154	110.95
4096	128	16384	0.947	4326.82	1.164	109.99
4096	128	20480	0.992	4130.13	1.183	108.21
4096	128	24576	1.035	3957.18	1.207	106.04
4096	128	28672	1.082	3785.92	1.219	105.00
4096	128	32768	1.125	3640.17	1.249	102.45
4096	128	36864	1.177	3478.59	1.254	102.06
4096	128	40960	1.225	3344.43	1.270	100.78
4096	128	45056	1.257	3259.48	1.295	98.85
4096	128	49152	1.301	3149.43	1.306	98.05
4096	128	53248	1.360	3012.77	1.324	96.69
4096	128	57344	1.398	2930.15	1.345	95.15
4096	128	61440	1.446	2833.20	1.358	94.24
4096	128	65536	1.495	2740.14	1.385	92.45
4096	128	69632	1.525	2685.25	1.395	91.76
4096	128	73728	1.578	2595.47	1.409	90.85
4096	128	77824	1.626	2519.06	1.432	89.40
4096	128	81920	1.678	2440.45	1.445	88.59
4096	128	86016	1.734	2362.33	1.470	87.08
4096	128	90112	1.776	2306.39	1.482	86.37
4096	128	94208	1.840	2226.49	1.501	85.27
4096	128	98304	1.887	2170.59	1.525	83.93
4096	128	102400	1.930	2122.18	1.537	83.26
4096	128	106496	1.990	2057.95	1.553	82.44
4096	128	110592	2.011	2037.19	1.573	81.39
4096	128	114688	2.099	1951.78	1.585	80.76
4096	128	118784	2.141	1912.93	1.618	79.10
4096	128	122880	2.184	1875.07	1.623	78.86
4096	128	126976	2.210	1853.03	1.641	78.01
4096	128	131072	2.318	1767.33	1.664	76.92

f32 19.799 GiB (4.907 BPW), PPL: 6.5434 +/- 0.04164, meanKLD: 0.005096 ± 0.000083

model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-IQ4_KS-f32.gguf
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	0.769	5327.26	1.064	120.35
4096	128	4096	0.809	5064.19	1.085	117.94
4096	128	8192	0.851	4812.13	1.108	115.49
4096	128	12288	0.899	4558.38	1.140	112.24
4096	128	16384	0.940	4358.92	1.149	111.40
4096	128	20480	0.982	4172.42	1.168	109.62
4096	128	24576	1.028	3985.75	1.191	107.47
4096	128	28672	1.070	3827.01	1.208	105.96
4096	128	32768	1.117	3668.45	1.236	103.55
4096	128	36864	1.163	3522.86	1.242	103.08
4096	128	40960	1.206	3395.78	1.258	101.77
4096	128	45056	1.250	3276.34	1.283	99.78
4096	128	49152	1.291	3173.51	1.296	98.78
4096	128	53248	1.340	3055.86	1.312	97.53
4096	128	57344	1.388	2951.63	1.332	96.10
4096	128	61440	1.432	2859.72	1.346	95.11
4096	128	65536	1.473	2781.26	1.371	93.33
4096	128	69632	1.517	2699.84	1.382	92.65
4096	128	73728	1.563	2620.51	1.397	91.64
4096	128	77824	1.622	2525.71	1.420	90.13
4096	128	81920	1.666	2458.05	1.433	89.35
4096	128	86016	1.715	2388.10	1.458	87.77
4096	128	90112	1.770	2314.58	1.470	87.08
4096	128	94208	1.812	2261.07	1.484	86.24
4096	128	98304	1.850	2213.65	1.514	84.54
4096	128	102400	1.906	2149.24	1.525	83.91
4096	128	106496	1.953	2097.29	1.539	83.18
4096	128	110592	1.994	2054.43	1.562	81.95
4096	128	114688	2.034	2013.83	1.575	81.26
4096	128	118784	2.082	1967.48	1.601	79.95
4096	128	122880	2.123	1929.65	1.609	79.55
4096	128	126976	2.185	1874.56	1.625	78.75
4096	128	131072	2.237	1831.30	1.652	77.50

Maxxim69

Mar 16

•

edited Mar 16

What backend are you running this on?

His own, obviously. (I noticed “HugstonOne Enterprise Edition 1.0.9” being mentioned and felt compelled to investigate.😀)

ubergarm

Mar 16

@Maxxim69

interesting haha... i see in the bottom of that github README it mentions that it bundles llama.cpp, but i didn't check to see if there is a pin'd version via submodule etc...

so its llama.cpp under the hood :gucci:

Trilogix1

Mar 18

Wow I missed a lot I see, (I thought I closed the discussion and got caught with work, so...).
Thanks to everyone for the reply.

@AesSedai >I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly. CSI at work LOL, you keep nailing it.

What´s puzzling me is that Autoround (intel) which do kind of the same thing, runs fine on par with llama.cpp if converted and quantized, keeping the 8bpw quality. I haven´t tested yet which of you have higher precision but both your produced quants are better than average.

@Ubergarm , @Maxxim69 , Basically I am running on llama.cpp, tried both Hugstonized backend or simply the last available build of llama.cpp. It should be slower if we using 8bpw in q5 right, do you guys get same inference speed, I am confused.

Here the prompt: C:---------------------------hugstonone\resources\app.asar.unpacked\runtimes\gpu\hugston-cli.exe --model C:----------------------------\qwen3.5\tested-AesSedaiQwen3.5-35B-A3B-GGUF\qwen3.5 35b q5\Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --threads 28 --ctx-size 64000 --n-gpu-layers -1 --batch-size 512 --n-predict -1 --no-warmup --cache-type-k q8_0
ggml_cuda_init: found x CUDA devices:
It is easy for me to test TG always so here the 35b q5_k_m same prompt, query, parameters, output les than 100 tokens : AesSedai quant results [ Prompt: 19.6 t/s | Generation: 24.7 t/s ]
Unsloth q5_k-m [ Prompt: 23.0 t/s | Generation: 25.5 t/s ]
Then we test the heavy weights, Qwen3.5 397b q4_k_m . Here we need more output because it is still loading while printing output. Once fully loaded/offloaded the results:
AesSedai quants 244gb Q4K-M

Unsloth quants 245gb Q4_k_xl

So what am I doing wrong? How do you load it?

Trilogix1 changed discussion status to open Mar 18

ubergarm

Mar 18

So what am I doing wrong? How do you load it?

I have an example build and run command for ik_llama.cpp here: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#quick-start

If you have 2+ GPUs on ik_llama.cpp you can use -sm graph.

On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.

Cheers!

Trilogix1

Mar 18

On mainline drop --merge-qkv -muge and use one of AesSedai's new pre-fused GGUFs.
I will try asap :)

Cheers!

AesSedai

Owner about 1 month ago

@Trilogix1 turns out the issue IS the fused up + gate. I re-quanted the Q5_K_M with a fused and unfused base and when the model is entirely on VRAM you get identical TG and a 10% PP boost. But as soon as you have the model on VRAM + RAM, the TG takes a huge dip. I'll be updating my Qwen3.5 repos to include both fused and unfused quants over the next couple of days, so people can pick which one works for their setup.

AesSedai

Owner about 1 month ago

I filed a ticket about this issue and am17an provided a patch that looks like it fixes it: https://github.com/ggml-org/llama.cpp/issues/20883#issuecomment-4109411761

So there should be a PR coming from him in the near future and that should bring the fused mixed-offloading TG performance back up.

Trilogix1

30 days ago

•

edited 30 days ago

@AesSedai amazing work, this will change everything.
Seems is done: "am17an closed this as completedin #209106 hours ago"

I have to try it, if this works will beat Intel Autoround on speed keeping the high quality.
That means we can now run proprietary quality model weights in consumer hardware with a decent speed.
Did you receive on offer yet from OpenAi or Anthropic? :))) You will be off market soon I guess.
I wonder if you have the time to create a repo with a good readme for the entire method!
Many thanks.
Edit: Tried it and works, with the last build it got 5.1tps according to my counter or to the llama.cpp counter: [ Prompt: 3.2 t/s | Generation: 4.7 t/s ], x3 times inference speed improvement, no need to update the weights :

AesSedai
/

Qwen3.5-397B-A17B-GGUF

Dataset Imatrix

tl;dr;

title: "ik_llama.cpp ssm_(alpha|beta) comparison"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"
hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"

q8_0 19.788 GiB (4.904 BPW), PPL: 6.5443 +/- 0.04165, meanKLD: 0.005151 ± 0.000088

bf16 19.792 GiB (4.905 BPW), PPL: 6.5446 +/- 0.04164, meanKLD: 0.005060 ± 0.000079

f32 19.799 GiB (4.907 BPW), PPL: 6.5434 +/- 0.04164, meanKLD: 0.005096 ± 0.000083

@AesSedai >I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly. CSI at work LOL, you keep nailing it.

@Ubergarm , @Maxxim69 , Basically I am running on llama.cpp, tried both Hugstonized backend or simply the last available build of llama.cpp. It should be slower if we using 8bpw in q5 right, do you guys get same inference speed, I am confused.

Dataset Imatrix

tl;dr;

title: "ik_llama.cpp ssm_(alpha|beta) comparison"subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"

q8_0 19.788 GiB (4.904 BPW), PPL: 6.5443 +/- 0.04165, meanKLD: 0.005151 ± 0.000088

bf16 19.792 GiB (4.905 BPW), PPL: 6.5446 +/- 0.04164, meanKLD: 0.005060 ± 0.000079

f32 19.799 GiB (4.907 BPW), PPL: 6.5434 +/- 0.04164, meanKLD: 0.005096 ± 0.000083

@AesSedai >I thought their non-UD quants were the regular llama.cpp variety but they've changed it up quietly. CSI at work LOL, you keep nailing it.

@Ubergarm , @Maxxim69 , Basically I am running on llama.cpp, tried both Hugstonized backend or simply the last available build of llama.cpp. It should be slower if we using 8bpw in q5 right, do you guys get same inference speed, I am confused.

title: "ik_llama.cpp ssm_(alpha|beta) comparison"
subtitle: "ubergarm/Qwen3.5-35B-A3B-GGUF IQ4_KS ssm variants"
hardware: "1x RTX A6000 48GB VRAM Driver: 580.105.08 CUDA: 13.0"