Hybrid inference speed tests

#7
by Doctor-Shotgun - opened

I ran some tests on CPU+GPU hybrid inference on the same rig where I was running pure CUDA GLM-4.5-Air. I took the precision hit on q8_0 k/v instead of full precision in order to be able to fit -b 4096 -ub 4096 and a couple more layers. Here we compare Unsloth's ~3.5bpw quant (mainline/ik) with yours (ik only). With IQ3_KT, we can fit one less layer on GPU due to differences in tensor sizes (at 32768 ctx, q8_0 cache) - however sweep bench tests are done at 16k ctx for time's sake.

Windows 11
Ryzen 9 7950X with 4x32gb DDR5-3600
RTX PRO 6000 96gb

llama.cpp baa9255a45105d2d3b4ec432af13b7a6eda3ff35
Unsloth UD-Q3_K_XL
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf" -c 16384 -ctk q8_0 -ctv q8_0 -ngl 999 -ot "blk\.(5[3-9]|[6-9][0-9])\.ffn_.*_exps.=CPU" -fa -b 4096 -ub 4096 -t 16

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 18.528 221.07 174.501 5.87
4096 1024 4096 9.584 427.38 130.670 7.84
4096 1024 8192 9.058 452.18 135.819 7.54
4096 1024 12288 10.036 408.15 141.941 7.21

ik_llama.cpp 23fe18ce83237879dc5e55c444de560f0ed736d5
Unsloth UD-Q3_K_XL
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf" -c 16384 -ctk q8_0 -ctv q8_0 -ngl 999 -ot "blk\.(5[3-9]|[6-9][0-9])\.ffn_.*_exps.=CPU" -fa -fmoe -b 4096 -ub 4096 -t 16 --warmup-batch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 8.604 476.08 119.696 8.56
4096 1024 4096 8.487 482.64 121.051 8.46
4096 1024 8192 9.221 444.23 124.070 8.25
4096 1024 12288 9.400 435.77 124.101 8.25

Ubergarm IQ3_KT
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-IQ3_KT-00001-of-00004.gguf" -c 16384 -ctk q8_0 -ctv q8_0 -ngl 999 -ot "blk\.(5[2-9]|[6-9][0-9])\.ffn_.*_exps.=CPU" -fa -fmoe -b 4096 -ub 4096 -t 16 --warmup-batch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 8.363 489.76 133.633 7.66
4096 1024 4096 8.042 509.35 133.686 7.66
4096 1024 8192 8.913 459.56 137.884 7.43
4096 1024 12288 8.895 460.50 138.439 7.40

I think there's still some funkiness with the batch warm-up behavior on the forked llama-sweep-bench lol, the first batch on mainline llama.cpp takes a speed hit as a result.

This model should be passably-performant at 4bit+ if I had a server rig on Linux with 8-12 lanes of DDR5-6000 lol.

Thanks again for the results! Interesting to see performance on a Zen4 rig with the "verboten 4xDIMM" configuration with that jacked GPU in the mix!

Glad to see the IQ3_KT is still able to run with many of the KT tensors on CPU/RAM especially on Zen4! (On my 9950x I can use the experimental branch for Zen5 avx_vnni 512 bit computation flags here: https://github.com/ikawrakow/ik_llama.cpp/pull/610 which seems to help too.

Given the similar size of the two models, my hunch is that my IQ3_KT gives better perplexity/KLD quality compared to a similar size unsloth (given they are restricted to the older quantization types available on mainline).

Also note that in some of my recent testing that using -ctk q8_0 -ctv q8_0 with GLM-4.5/Air likely will slow things down more than I would have expected. So if you want to maximize TG you may want to forgo -ub 4096 -b 4096 and consider using default batches with -rtr and pay the VRAM price to keep kv-cache at full f16:

sweep-bench-GLM-4.5-Air-Q4_0-CUDA-graphs-plus-GQA-fix-long-context.png

So many knobs to tune and tweak, gotta love it hahah... Thanks!

I’ll give the rtr option a shot when I’m off work. GLM 4.5 does unfortunately have a rather fat k/v cache as it’s a large param model without MLA. It would certainly be less layers that could fit on the GPU if I switched to fp16 cache, not sure how it will impact performance.

And yes, this setup with 4 sticks is absolutely cursed and I couldn’t get it to POST with more than 3600 lol. Hopefully within the next couple months I’ll be moving this GPU to a new rig to avoid the Windows speed tax and the 2 channels of DDR5-3600 speed tax.

EDIT: Ah, it seems -rtr force disables mmap, which results in RAM OOMkill from the weights not fitting fully in system RAM.

EDIT2: Seems like messing with the -b and -ub triggers different prompt processing behavior - perhaps GPU streaming prompt ingestion vs hybrid prompt ingestion? When I lower both to 512, the CPU fans spin up much louder during prompt ingestion (while they stay quiet on -b 4096 -ub 4096) and every batch is rather slow. While at -b 4096 -ub 4096, only the first prompt is slow, and then subsequent prompts are much faster to ingest.

EDIT3: Hmmm... based on https://github.com/ikawrakow/ik_llama.cpp/pull/698#issue-3327278619, it seems like the batch size has to be greater than 32 * total_experts / active_experts to trigger offloading prompt processing to the GPU, so that might be it.

Tried using fp16 k/v cache - in this case it doesn't seem to speed things up on my end. Probably the savings from reduced dequant compute overhead are less than the speed penalty of 3 less layers on GPU.

Ubergarm IQ3_KT
./llama-sweep-bench.exe -m "C:\ML\GGUF\GLM-4.5-IQ3_KT-00001-of-00004.gguf" -c 16384 -ctk q8_0 -ctv q8_0 -ngl 999 -ot "blk\.(49|[5-9][0-9])\.ffn_.*_exps.=CPU" -fa -fmoe -b 4096 -ub 4096 -t 16 --warmup-batch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 8.915 459.45 139.109 7.36
4096 1024 4096 9.501 431.10 140.967 7.26
4096 1024 8192 10.012 409.09 147.228 6.96
4096 1024 12288 9.561 428.42 146.117 7.01

A few other observations, some the same as the edits above:

  1. -rtr disables mmap, which fails in my case because the model is larger than my system RAM. I suppose it's possible to get around this by using a fat pagefile, or downloading more RAM :). Not sure if there would be a way to do this lazy-loaded layer by layer or something instead.
  2. Using a small -ub 512 absolutely tanks my prompt processing speed. I think in this case it's because two separate prompt processing codepaths exist - hybrid CPU+GPU prompt processing, and GPU offload prompt processing. Based on what ikawrakow says here, GPU offload prompt processing gets triggered when the prompt batch exceeds 32 * total_experts / active_experts tokens. Setting -ub 512 forces only the hybrid CPU+GPU prompt processing codepath with this model, which seems to cap out around ~70 T/s on my setup.

The funny part is that I can hear the difference between the two prompt processing codepaths lol. With hybrid CPU+GPU prompt processing, the CPU fans spin up as soon as the prompt is received, while in the GPU offload codepath, the CPU fans only spin up once token generation starts.

disables mmap, which results in RAM OOMkill from the weights not fitting fully in system RAM.

wait, u have 128GB RAM + 96GB VRAM so you should be able to use -rtr with models up to ~200GB in file size without OOMing unless that is a windows thing? assuming you're offloading enough additional layers onto the GPU, at least in Linux, the CPU buffer is reduced to allow this.

yeah don't increase swap or pagefile as that will thrash your SSD and wear on it fast. mmap() is fine as it is read-only, but it will become the bottleneck and there isn't much need to benchmark a model that is running with any weights on SSD as it would throw everything else off probably? i saw one person try to llama-sweep-bench like 128k overnight running with half the model on SSD and it was just like 4 tok/sec TG the whole way iirc...

The funny part is that I can hear the difference between the two prompt processing codepaths lol. With hybrid CPU+GPU prompt processing, the CPU fans spin up as soon as the prompt is received, while in the GPU offload codepath, the CPU fans only spin up once token generation starts.

Yeah it is interesting to listen to a rig doing llama-sweep bench as it totally has that rhythmic effect due to different requirements for TG and PP! good observation!

Gonna go play with DeepSeek-V3.1 now today haha

Yeah when I got this rig, I figured 128gb RAM was adequate and disabled pagefile entirely to avoid thrashing (my previous PC was 16gb DDR4 and 8gb VRAM...). Little did I know... Anyhow yeah Windows I think is just very suboptimal for RAM management in general. mmap disabled straight up results in RAM OOM if the model is bigger than system RAM alone.

In other news I got some reports of Xeon 5 engineering samples being somewhat performant in pure CPU inference in sglang, at about expected 1/2-1/3 of the speed of the dual Xeon 6980Ps lmsys benchmarked. No word yet on performance on ik_llama.cpp or ktransformers CPU+GPU inference. Considering building a similar rig.

How well does ik_llama.cpp do in dual socket CPU + GPU configurations?

@Doctor-Shotgun

RAM OOM if the model is bigger than system RAM alone.

Ahh I see, yeah on Linux the CPU0 buffer becomes smaller when some is loaded onto VRAM so it can be possible to not OOM.

pure CPU inference in sglang

I did a small writeup on some of my initial thoughts on that sglang paper. One of the guys who worked on that also did the llama.cpp AMX implementation for Sapphire Rapids AMX cpu flags Intel Xeon CPUs. https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/422?u=ubergarm

How well does ik_llama.cpp do in dual socket CPU + GPU configurations?

tl;dr; is that PP scales fairly well across two sockets / NUMA nodes by adding more CPU cores. However TG does not scale well given the memory bandwidth across NUMA node penalty and non-optimized code. There's been quite a bit written and some experimental "data-parallel" implementations e.g. ktransformers USE_NUMA=1 compile flag and even an experimental mainline lcpp branch using manually allocated huge pages.

I've heard anecdotally that mainline lcpp does scale a bit better now more recently for TG on multiple NUMA nodes, but I've not tried benchmarking it myself lately. My old benchmarks are on the mainline lcpp discussion about it for the 6980P rig.

It can run two separate instances, one on each socket, then you combine them upstream behind a reverse proxy or something if you would benefit from concurrent/parallel inferencing scaling horizontally haha...

I found the sglang performance to be rather impressive for pure CPU at ~540 T/s prefill and ~15 T/s generation. It's interesting to see your results on ik_llama.cpp with access to 6980P. ~7.5 T/s tg would certainly be respectable for a single socket. I suppose the main draw would be being able to do NUMA parallelism to achieve higher decoding rate on a single request, rather than running separate instances (certainly I'm not going to be hosting for multiple users with the slow enough speed haha!). It looks like ktransformers uses a NUMA mirror type setup, where it requires double the RAM, presumably allocating a copy of the weights for each core.

I've heard horror stories about dual socket rigs underperforming single socket setups when the software isn't written to take advantage of them, so I'm still debating whether to bother with the more costly dual socket setup. It's kind of an investment in hoping the software ecosystem adopts it more lol.

The Xeon 5 ES CPUs have half the cores of the 6980P and only 8 channels instead of 12. Reported speed is 262 T/s pp and 7.6 T/s tg on dual socket in sglang with the fp8 deepseek model, which is a bit slower than int8.

Yeah, for now at least with ik/llama.cpp the strategy is as much memory bandwidth in a single NUMA node. So with AMD you can configure NPS0 but still takes some of a hit compared to optimized software as you mention.

Wendell has a recent video describing an interesting build to max out TG for big MoEs running partially on CPU: https://www.youtube.com/watch?v=bOxAdRfTpJg can't wait until he gets some benchmarks!

There is some older discussion on this multi-NUMA topic on these two old discussions:

ik himself seems to have some interest in looking into NUMA optimizations, but would need access to a larger rig to do development and testing for that situation.

Regarding Intel Xeon, I'm not convinced that inference engines can take much advantage of the AMX support, and so far ik's avx2/avx_vnni 512 bit stuff that just got merged in is likely faster than the mainline lcpp AMX support anyway for CPU PP.

Anyway lots of fun stuff happening, have fun designing your new inference rig!

I'm currently working on https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF

Ah the new Deepseek lol, hearing mixed reviews so far. Unfortunately needing quite a terrible bitrate to try on my setup.

Intervitens did try merging in some pending sglang PRs and was able to get a decent bump to performance with NUMA parallelism and AMX on the dual Xeon 5 engineering samples with 16 channels of DDR5-5600:

  • 530tps PP/10tps TG on 32BA/355B GLM-4.5 int8
  • 480tps PP/8.9tps TG on 37BA/671B Deepseek v3 int8

Considering mimicking this setup and slapping the RTX pro 6000 in, although still unsure on what the max performance with CPU+GPU would look like.

max performance with CPU+GPU would look like.

The theoretical max is not hard to calculate knowing the active parameter size and memory bandwidth. So assume your 16 channels DDR5-5600 give you say I don't know, maybe 1024 GiB/s just for example's sake aggregate bandwidth assuming best case across multiple sockets. (you'd have to run mlc intel memory latency checker for actual measurements).

  • 32B active parameters * 8bpw = 32GiB active weights per token generated
  • 1024 GiB/s / 32GiB = 32 tok/sec theoretical max for CPU only

Using a single RTX PRO 6000 you would offload attn_.* ffn_(gate|down|up)_exp and ffn_(gate|down|up)_shexp which are always active and as much additional routed exps as possible while leaving enough head-room for larger batch sizes and context.

So assuming you can offload say 8GiB active onto the GPU that would always run faster than the CPU leaving (32 - 8 = 24 GiB) remaining active on CPU. So theoretical max is now 1024 / 24 = ~42 tok/sec.

Now given they are seeing 10 tok/sec TG seems like they are achieving about 33% of theoretical max (which is similar to what I measured on the Intel Xeon 6980P in my own testing, it was not able to fully saturate theoretical max at all... My local AMD 9950X is within 90+% of max. AMD Epyc I have tested is about 60% of theoretical max). Likely the issue is NPS1 (or SNC=Disable on Intel) is still unable to do a great job of feeding the chiplets on a single CPU socket...

Anyway so at 33% effective TG speed on CPU with 24GiB active weights you might expect to see as much as 14 tok/sec on GLM-4.5 8bpw

Now take into consideration that int8 dtype is not the same as Q8_0 (8.5bpw). In fact because int8 is a dtype and not a quantization it likely has more rounding errors so likely "worse quality" or higher perplexity depsite being a similar number of bits. So if you use my GLM-4.5 iq5_k which is about 6bpw you could possibly match that speed on ik_llama.cpp maybe...

Anyway, just spitballing, you'll have to benchmark every possible configuration and do your best to setup an apples-apples comparison given the differences in quants/dtypes/inference engines etc.

Fun project you got there!

@ubergarm

Using a single RTX PRO 6000 you would offload attn_.* ffn_(gate|down|up)_exp and ffn_(gate|down|up)_shexp which are always active and as much additional routed exps as possible while leaving enough head-room for larger batch sizes and context.

Would this -ot recommendation apply for GLM, DeepSeek, or both? Can you please direct me to a discussion where I can learn more?
I am able to get around 500 t/s PP and 17 t/s TG with your DeepSeek-R1-0528-IQ4_KS_R4 on 9355 Epyc + RTX 6000 with this config:

./llama-sweep-bench \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -c 131072 \
    -ngl 999 \
    -ot "blk\.[3-6]\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 16 \
    --threads-batch 24 \
    --warmup-batch

I found that offloading more layers (3 vs 8) has barely measurable effect, while increasing -ub has quite significant effect on PP. With -ub 8192 I can still do -c 262144 --parallel 2 with 96 GB VRAM which is nice, but being on only 40% of theoretically achievable TG t/s is not nice at all 😂.

This is on Ubuntu VM inside Proxmox, BTW, getting better results than Windows on bare metal on the same machine.
Anyway, if you can direct me to a good place discussing offload strategies for RTX 6000, I would appreciate it. Thank you.

@sousekd

That is quite good performance already. Your command looks good specifically for R1-0528! Probably about the best you can get, and given you're not OOMing with such large batch sizes and have a single GPU this is about best speed to expect. Any bottleneck in TG is now likely related to your regular RAM bandwidth.

You might be able to increase -amb 1024 which may give some benefit as others mentioning over here: If you were running GLM you'd only change -ot "blk\.[1-6]\.ffn_.*=CUDA0"

Right, offloading 5 more routed experts layers isn't getting you much given for each token only 8 routed experts are used. You're already offloading the tensors that are always active for every token.

Would this -ot recommendation apply for GLM, DeepSeek, or both?

Right, the -ot override-tensors argument can be confusing as different models have slightly different architectures and need slightly different regular expressions. I wrote some more about the differences here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/discussions/1#68bc6a6f9e901a89f5d2dc24

The basic idea is that you only want to put routed experts onto CPU/RAM. They start at different layers depending on the MoE model arch e.g. DeepSeek/GLM/Kimi-K2/Qwen etc. You can look inside the model with ./gguf-py/scripts/gguf_dump.py or sometimes huggingface will actually work and you can see them in your browser e.g.: https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF?show_file_info=smol-IQ2_KS%2FKimi-K2-Instruct-0905-smol-IQ2_KS-00001-of-00006.gguf

but being on only 40% of theoretically achievable TG t/s is not nice at all 😂.

I'm not sure what you're saying here? What is 40% of theoretically achievable TG? Do you mean that when you use -ub 8192 -b 8192 -c 262144 --parallel 2 you have to offload less routed experts which slows down TG?

Sorry if you already know all this and I'm repeating myself haha.. Hope that helps a bit! Feel free to ask any specific example questions and I'll do my best!

Thank you @ubergarm for a great explanation. I was confused by:

Using a single RTX PRO 6000 you would offload attn_.* ffn_(gate|down|up)_exp and ffn_(gate|down|up)_shexp which are always active and as much additional routed exps as possible.

...thinking I missed something important lol. Now I understand (and please correct me if I am wrong) that this is basically what -ngl 999 -ot exps=CPU does without the need to specify -ot ".*attn_.*=CUDA0" -ot ".*ffn_(gate|down|up)_(exp|shexp).*=CUDA0" explicitly.

And then with this:

So theoretical max is now 1024 / 24 = ~42 tok/sec. Now given they are seeing 10 tok/sec TG seems like they are achieving about 33% of theoretical max (which is similar to what I measured on the Intel Xeon 6980P in my own testing, it was not able to fully saturate theoretical max at all... My local AMD 9950X is within 90+% of max. AMD Epyc I have tested is about 60% of theoretical max).

I just messed up my math with that 40%, or... not? 😀

I'm not sure what you're saying here? What is 40% of theoretically achievable TG? Do you mean that when you use -ub 8192 -b 8192 -c 262144 --parallel 2 you have to offload less routed experts which slows down TG?

Yeah, the thing is - offloading more experts (I was able to do -ot "blk\.([3-9]|10|11|12)\.ffn_.*=CUDA0"I believe with -ub 2048(or -ub 4096) - did not bring significant increase of TG t/s. Maybe it went up to 19 from 17, but PP t/s went down from 520 to 200 or less, so it was not worth it.

Again, I just got confused. Thank you for all the info.

Right, it is not obvious that all those important tensors I mentioned above are placed onto VRAM implicitly with the -ot command you're usuing. I've described it elsewhere how I imagine it in my brain as a single set of commands that work together in order:

-ngl 999  # <--- offload *everything* to GPUs
-ot "blk\.[3-6]\.ffn_.*=CUDA0" \ # <--- offload these four layers including routed experts to GPU [strange given we already said that above]
-ot exps=CPU \ # <--- just kidding, override all routed experts *that are not explicitly overridden already above* to CPU/RAM.

The final result being attn/first 3 (in this case) dense layers/shared expert and just those 4 routed expert layers end up on VRAM. All the remaining routed experts go to CPU/RAM where you want them.

To confuse matters there is some new command like n-cpu-moe or something and I don't know what it does, I prefer this original method after I grok'd it haha...

:) yeah, it is a mess. And then someone asks "And why don't you just use this free app I got on my phone instead?"

Sign up or log in to comment