4x24GB VRAM Spread Results.

by phakio - opened Dec 10, 2025

Dec 10, 2025

My GPU fans ramped up in excitement when they heard me mutter to myself "another ubergarm quant has arrived"...

3x3090 // 1x4090 VRAM Only Test

I didn't realize this was a 100b dense model till halfway downloading it, so the lower token generation is more than expected.

I had it create a simple dynamic landing html page, it was a little under 2k tokens generated, at a speed of 10.5 t/s. not bad! I think the 4bit quant offers a good enough perplexity tradeoff for enabling full GPU offload, so I'll play more with this model in the coming days.

As always thanks again! I'm excited for the new deepseek model quants once support is added to llama.cpp!

ubergarm

Owner Dec 10, 2025

@phakio

sweeet! Great seein' you here for the new release! Yeah I tried to size this one to work well in ~96GB VRAM full offload with enough context to be useful without sacrificing much perplexity (kept attn just a touch bigger at ~6.7BPW also).

If you're not already, when running fully offloaded you can set -t 1 as you're not using CPU threads which can give a few percent boost due to less stuff to synchronize.

Also if you want to experiment, the new ik_llama.cpp split mode -sm graph "tensor parallel" has given big boosts to me when using exactly 2xGPUs (or 2xGPUs plus CPU hybrid). I'm not sure it is working perfectly yet for 4x GPUs but easy to test.

ubergarm

Owner Dec 10, 2025

•

edited Dec 10, 2025

Just did a post on r/LocalLLaMA with some more info showing ~40% increase in tok/sec across the board testing with 2x CUDA GPUs on Devstral-Small

https://www.reddit.com/r/LocalLLaMA/comments/1pj9r93/now_40_faster_ik_llamacpp_sm_graph_on_2x_cuda_gpus/

UPDATE

ik added a feature to support -sm graph with more than 2 GPUs showing promising early results here if you wan't to test: https://github.com/ikawrakow/ik_llama.cpp/pull/1051

phakio

Dec 11, 2025

•

edited Dec 11, 2025

compiling now, I have a bad habit of using weeks old builds of ik_llama (if it ain't broke, don't fix it!)
I didn't realize tensor parallel support is finally starting to be implemented. I'll get some numbers and results soon in an edit to this post. The results from the git pull thread look promising!

edit
unfortunately I'm not seeing any expected improvements, infact it's performing worse and that's with the smaller 24b or this one, the 123b. I noticed one of my 3090s has significantly higher usage than the other cards, and as I understand it this isn't really the expected outcome, I though each card was supposed to be utilized more. see photos for details. I'm going to chalk this up due to parallelism needing the same gpu architecture, and the slight variances of the one 4090 and 3 3090s is making it perform not as expected.

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
    --ctx-size  20000 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 4 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \

ubergarm

Owner Dec 11, 2025

•

edited Dec 11, 2025

@phakio

-sm graph \
--max-gpu 4 \

Thanks for testing, don't lose heart yet! My impression reading the PR1051 more is that you actually want to set --max-gpu to something LESS than the number of GPUs in your system. Your best bet to max out TG is probably use exactly 2 or 3. You might be able to play some games with -mg to set your 4090 to the "main gpu" and maybe some other tricks to re-order the GPUs but that is beyond anything i've tested myself. Read here for more details: https://github.com/ikawrakow/ik_llama.cpp/pull/1051#issuecomment-3642668828

So give this a try:

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
    --ctx-size  20000 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 2 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --threads 1

UPDATE plus there were a couple more PRs merged overnight that may improve multi-GPU performance as well (essentially leaving P2P enabled instead of adding overhead to enable/disable it)

UPDATE 2 I'm trying to test it myself on exactly 2x GPUs but hit a snag. Anyway, exciting times haha...

phakio

Dec 11, 2025

•

edited Dec 11, 2025

I was trying various configs, I just attached the latest one I was trying. I'll rebuild and try again today since I have some free time. It is indeed annoying seeing one gpu do so much work and others slack! lol

edit; the pull request has been merged, I am compiling mainline ik_llama right now. the following are my build options... I've used these for so long is there any updated arguments to put now adays?

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_ARCHITECTURES="86;89"

results of the smaller 24b version as it's quicker to load after changing config:

before: (default config no new graph parameters)

after: (gpu-max: 3)

so there is a change, I just didn't notice it as much as I thought... looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!
(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)

second edit: setting main gpu as a 3090 rather than the 4090 actually provided more stable and the best token gen speeds, same prompt as above 60t/s generation... i'll keep playing around -

final edit for now:
the 123b parameter model started generating at 17 t/s, which is a great increase! however it then had a big fall-off to 11 t/s, which can be visualized with the following chart. again, GPU2 seems to be a bad apple, this looks like it could be a hardware issue at this point.

ubergarm

Owner Dec 11, 2025

Great seems like you're seeing some uplift!

is there any updated arguments to put now adays?

That is pretty much the same, you don't have to explicitly specify -DGGML_SCHED_MAX_COPIES=1 anymore as it is default, but hurts nothing (i still leave it on hah). For some KT quants I've seen improvement using explicit -DGGML_CUDA_F16=ON but haven't looked into it and don't think it matters for most quants.

looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!

totes! if you run llama-sweep-bench it makes the differences more obvious across the entire kv-cache depth e.g.

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --ctx-size   16896 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 2 \
    --threads 1

(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)

Yes this can also help with hybrid CPU big MoEs especially at longer context length. More info on that here: https://github.com/ikawrakow/ik_llama.cpp/pull/1040#issuecomment-3620696408

Thanks for testing out all the options!

phakio

Dec 11, 2025

•

edited Dec 11, 2025

And just for completions sake, here is the results of the sweep-bench script above.

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.886 |   271.47 |    9.972 |    12.84 |
|   512 |    128 |    512 |    1.849 |   276.90 |   11.611 |    11.02 |
|   512 |    128 |   1024 |    1.807 |   283.35 |   11.447 |    11.18 |
|   512 |    128 |   1536 |    1.865 |   274.52 |   10.760 |    11.90 |
|   512 |    128 |   2048 |    1.663 |   307.92 |   10.392 |    12.32 |
|   512 |    128 |   2560 |    1.732 |   295.56 |   10.214 |    12.53 |
|   512 |    128 |   3072 |    1.602 |   319.60 |    9.919 |    12.90 |
|   512 |    128 |   3584 |    1.675 |   305.72 |   10.079 |    12.70 |

I... I think my setup might be a little confusted, it seems to be improving as token length increases?

ubergarm

Owner Dec 11, 2025

@phakio

Haha yeah that seems odd, but maybe just some noise as 512 token steps aren't too big, if you let it run all the way out to full 16k it might have more clear downward trend.

my approach is have a command and then only vary one experimental value and then graph the runs all together to see the difference. e.g. run the exact same command but try --max-gpu 2 and --mmax-gpu 3 on the other and graph them.

if you decide to do that, i can graph it for you if you paste it in here, or i could share the python script etc

gghfez

Dec 12, 2025

Cool didn't know you did dense models.

Unless you need llama.cpp specific features, you're probably better off using exllamav3 with full cuda/vram

172.17.0.2:43166 - "GET /v1/models HTTP/1.1" 200
172.17.0.2:43182 - "POST /v1/chat/completions HTTP/1.1" 200
Received chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Finished chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Metrics (ID: 40c2da9978fd4db1a47b469279713dea): 561 tokens generated in 26.02 seconds (Queue: 0.0 s,
Process: 10388 cached tokens and 503 new tokens at 303.01 T/s, Generate: 23.03 T/s, Context: 10891 tokens)

(Random last gen log in the console) - that's the 4.0bpw on 3090's.

P.S. holy shit, we're getting tensor parallel in ik_llama!

gghfez

Dec 12, 2025

17 t/s at zero context. Nice free performance boost. Also cool that it manages to detect / select the best peer connections (nvlinked pairs)

ubergarm

Owner Dec 12, 2025

@gghfez

Cool didn't know you did dense models.

I'll try anything once! lol... dense models are not quite as exciting as the best recipes generally have all the tensors at similar level throughout with just a little bump here or there.

you're probably better off using exllamav3 with full cuda/vram

Right, in general if a model/quant can fit into full GPU VRAM offload it makes sense to check out exllamav3 EXL3 quants, and also stuff like vLLM (especially for multi-user / batched throughput case) assuming the quants available are good for your hardware.

P.S. holy shit, we're getting tensor parallel in ik_llama!

Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well. This is for the 4.709 BPW available in this repo:

A big speed up while allowing a lot of existing GGUFs at just the right quantization. Also the big win that I see with this new -sm graph "graph parallel" can still give a nice boost on hybrid CPU+ multi GPU, especially at longer context depths.

gghfez

Dec 12, 2025

•

edited Dec 12, 2025

Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well.
I just tested limiting to 4 GPUs (had to quant the kv cache) and 20t/s!
Control-vectors still work, so this is a huge win for me (probably won't bother keeping my hacked-together exl3 implementation working now!)

I'll have to try command-a as well, that's another "about 12 t/s if I want to use control-vectors" dense model.
edit:

=======================================================
Split mode 'graph' is not supported for this model
  => changing split mode to 'layer'                                            
=======================================================

I guess it depends on the model.

ubergarm

Owner Dec 12, 2025

@gghfez

Ahh I see these are supported so far:

# src/llama.cpp line ~1726
  static bool is_model_split_supported(const llama_model & model) {
      static std::unordered_set<llm_arch> k_supported = {
          LLM_ARCH_LLAMA,
          LLM_ARCH_QWEN3MOE,
          LLM_ARCH_GLM4_MOE,
          LLM_ARCH_MISTRAL3,
      };
      auto it =  k_supported.find(model.arch);
      return it != k_supported.end();
  }

If you wanted to go wild west, just add LLM_ARCH_COHERE2 or whatever and see what happens, but zero guarantees as this is all very new as you know.

Or you can just comment this out to experiment too. I did this to try with RPC backend applying -sm graph which did at least start up right, but ended up giving gibberish out of a tiny 0.6B test model.

--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -1758,7 +1758,7 @@ static bool llm_load_tensors(
             LLAMA_LOG_WARN("Split mode 'graph' is not supported for this model\n");
             LLAMA_LOG_WARN("  => changing split mode to 'layer'\n");
             LLAMA_LOG_WARN("=======================================================\n\n");
-            split_mode = LLAMA_SPLIT_MODE_LAYER;
+            // split_mode = LLAMA_SPLIT_MODE_LAYER;
         }
     }

Anyway, lots of exciting toys to try out haha...

gghfez

Dec 13, 2025

•

edited Dec 13, 2025

Yeah looks like I'll have to wait (cohere models are always a pain) lol

CUDA error: an illegal memory access was encountered
  current device: 0, in function launch_mul_mat_q at /home/ai/apps/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:4122
  cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
/home/ai/apps/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

Apart from the -sm graph, is the rpc backend performance decent now?

Update:
https://github.com/ikawrakow/ik_llama.cpp/pull/1061

21.5 t/s with this!

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	4.383	233.61	5.951	21.51

Slower prompt processing but TG is almost at exl3 levels!

Edit: Seems like nvlink helps with S_PPt/s

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	3.189	321.07	5.577	22.95
1024	128	1024	3.195	320.54	5.668	22.58

Only difference there is I changed the CUDA_VISIBLE_DEVICES

gghfez

Dec 13, 2025

Okay last one (last posting an unrelated model): llama-3.3-70b Q4_k

2x3090 nvlink

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	1.090	939.66	4.593	27.87
1024	128	1024	1.105	926.44	4.680	27.35
1024	128	2048	1.129	906.93	4.724	27.10
1024	128	3072	1.153	888.05	4.793	26.70
1024	128	4096	1.176	870.60	4.856	26.36

2x3090 PCIe4.0 x16

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	1.466	698.67	5.093	25.13
1024	128	1024	1.478	693.01	5.180	24.71
1024	128	2048	1.500	682.80	5.234	24.46
1024	128	3072	1.524	672.07	5.305	24.13
1024	128	4096	1.549	661.17	5.371	23.83

We can have exllamav2 speeds!

phakio

Dec 13, 2025

We can have exllamav2 speeds!

Yes! I had similar speeds running max-gpu 2 and using a finetuned llama 70b. what a time it is :)

ubergarm

Owner Dec 15, 2025

@phakio @gghfez

Another PR on the vine that is looking really good for the >2 GPU crew: https://github.com/ikawrakow/ik_llama.cpp/pull/1067

gghfez

Dec 15, 2025

Eh? That'd be faster than vllm with -tp 4 or exllamav2 with a perfect draft model (large-2407+mistral-7b-v3) if the graphs are accurate.

phakio

Dec 15, 2025

•

edited Dec 15, 2025

compiling and testing now thanks for the heads up !

edit - by disabling my gpu labeled "GPU 2" and running max gpu 3, I am seeing the fastest speeds with llama 3 70b i've ever witnessed, a solid 30 t/s.
when I run all my gpus, I noticed that GPU 2 under load throttles itself to 450mhz... I'm going to check the temps and other things, It could be thermally throttling itself instantly under load, resulting in the lower speeds I initially witnessed. suspecting it needs better cooling at the vram, as the core is well under throttle temps

gghfez

Dec 16, 2025

I couldn't get it any faster after trying for half an hour. I suspect I'm PCIe speed bound with 3 of my GPUs on 4.0 x4

ubergarm

Owner Dec 16, 2025

•

edited Dec 16, 2025

thanks for checking it out y'all!

on the remote rig with the older model 2x RTX A6000 (48GB VRAM each) the new PR is the same as main for -sm graph. So the new PR seems like it boosts speed for 3+ GPUs with enough PCIe speed maybe?

This rig already gets really good speed with -sm graph on main now, and strangely I have to use -cuda enable-p2p=0 otherwise dmesg prints a ton of this kinda stuff before suppressing it:

[Mon Dec 15 15:42:24 2025] AMD-Vi: IOMMU Event log restarting
[Mon Dec 15 15:42:26 2025] amd_iommu_report_page_fault: 882 callbacks suppressed
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c0000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c1000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c2000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c3000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c4000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0xdd139070 flags=0x0020]

EDIT: there are some more PRs coming in around this -sm graph and need to double check if there is a way to get the advantages while
also using -ot for CPU offload as that was turned off recently otherwise. its moving so fast hard to keep track haha...

https://github.com/ikawrakow/ik_llama.cpp/pull/1069

gghfez

Dec 16, 2025

and strangely I have to use -cuda enable-p2p=0 otherwise dmesg prints a ton of this kinda stuff before suppressing it:

Try this during inference:

nvidia-smi dmon -s pucvmet

And see if the pcie errors increase.

That count accumulates from boot time. Ideally it should be 0, but < 5-10 / minute is acceptable.

I had the same issue with trl and tabbyAPI on a few separate rigs / instances. The causes were (separate occasions):

Bad risers
Bad PSU
(weird one) bad 24-pin cable from PSU -> Motherboard

[glm air screenshot ]

I managed to get something like 56 t/s with Qwen3-235B yesterday but there's some bug where it crashes as soon as the context exceeeds whatever I set batch/ubatch to.

gghfez

Dec 16, 2025

•

edited Dec 16, 2025

Oh cool, just pulled origin and saw this:

commit f667bd58b0a94a15d8ae67e98c1f7006d2d7e445
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date:   Sat Dec 13 16:58:32 2025 +0100

    Undo sync reduction (#1063)
    
    I'm finding issues for Qwen3-MoE

That'll probably fix my Qwen3 issue.
But this is going to be a pain for me later :

commit 090f354d33822001a0b0c320fa55063e8c43ee2f
* server: split server.cpp code into server/common/task/queue/context

😔

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment