4x24GB VRAM Spread Results.

#2
by phakio - opened

My GPU fans ramped up in excitement when they heard me mutter to myself "another ubergarm quant has arrived"...

3x3090 // 1x4090 VRAM Only Test

I didn't realize this was a 100b dense model till halfway downloading it, so the lower token generation is more than expected.

I had it create a simple dynamic landing html page, it was a little under 2k tokens generated, at a speed of 10.5 t/s. not bad! I think the 4bit quant offers a good enough perplexity tradeoff for enabling full GPU offload, so I'll play more with this model in the coming days.

image

image

As always thanks again! I'm excited for the new deepseek model quants once support is added to llama.cpp!

@phakio

sweeet! Great seein' you here for the new release! Yeah I tried to size this one to work well in ~96GB VRAM full offload with enough context to be useful without sacrificing much perplexity (kept attn just a touch bigger at ~6.7BPW also).

If you're not already, when running fully offloaded you can set -t 1 as you're not using CPU threads which can give a few percent boost due to less stuff to synchronize.

Also if you want to experiment, the new ik_llama.cpp split mode -sm graph "tensor parallel" has given big boosts to me when using exactly 2xGPUs (or 2xGPUs plus CPU hybrid). I'm not sure it is working perfectly yet for 4x GPUs but easy to test.

Just did a post on r/LocalLLaMA with some more info showing ~40% increase in tok/sec across the board testing with 2x CUDA GPUs on Devstral-Small

https://www.reddit.com/r/LocalLLaMA/comments/1pj9r93/now_40_faster_ik_llamacpp_sm_graph_on_2x_cuda_gpus/

UPDATE

ik added a feature to support -sm graph with more than 2 GPUs showing promising early results here if you wan't to test: https://github.com/ikawrakow/ik_llama.cpp/pull/1051

compiling now, I have a bad habit of using weeks old builds of ik_llama (if it ain't broke, don't fix it!)
I didn't realize tensor parallel support is finally starting to be implemented. I'll get some numbers and results soon in an edit to this post. The results from the git pull thread look promising!

edit
unfortunately I'm not seeing any expected improvements, infact it's performing worse and that's with the smaller 24b or this one, the 123b. I noticed one of my 3090s has significantly higher usage than the other cards, and as I understand it this isn't really the expected outcome, I though each card was supposed to be utilized more. see photos for details. I'm going to chalk this up due to parallelism needing the same gpu architecture, and the slight variances of the one 4090 and 3 3090s is making it perform not as expected.

image

image

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
    --ctx-size  20000 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 4 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \

@phakio

-sm graph \
--max-gpu 4 \

Thanks for testing, don't lose heart yet! My impression reading the PR1051 more is that you actually want to set --max-gpu to something LESS than the number of GPUs in your system. Your best bet to max out TG is probably use exactly 2 or 3. You might be able to play some games with -mg to set your 4090 to the "main gpu" and maybe some other tricks to re-order the GPUs but that is beyond anything i've tested myself. Read here for more details: https://github.com/ikawrakow/ik_llama.cpp/pull/1051#issuecomment-3642668828

So give this a try:

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
    --ctx-size  20000 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 2 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --threads 1

UPDATE plus there were a couple more PRs merged overnight that may improve multi-GPU performance as well (essentially leaving P2P enabled instead of adding overhead to enable/disable it)

UPDATE 2 I'm trying to test it myself on exactly 2x GPUs but hit a snag. Anyway, exciting times haha...

I was trying various configs, I just attached the latest one I was trying. I'll rebuild and try again today since I have some free time. It is indeed annoying seeing one gpu do so much work and others slack! lol

edit; the pull request has been merged, I am compiling mainline ik_llama right now. the following are my build options... I've used these for so long is there any updated arguments to put now adays?

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_ARCHITECTURES="86;89"

results of the smaller 24b version as it's quicker to load after changing config:

before: (default config no new graph parameters)

image

after: (gpu-max: 3)

image

so there is a change, I just didn't notice it as much as I thought... looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!
(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)


second edit: setting main gpu as a 3090 rather than the 4090 actually provided more stable and the best token gen speeds, same prompt as above 60t/s generation... i'll keep playing around -


final edit for now:
the 123b parameter model started generating at 17 t/s, which is a great increase! however it then had a big fall-off to 11 t/s, which can be visualized with the following chart. again, GPU2 seems to be a bad apple, this looks like it could be a hardware issue at this point.

image

Great seems like you're seeing some uplift!

is there any updated arguments to put now adays?

That is pretty much the same, you don't have to explicitly specify -DGGML_SCHED_MAX_COPIES=1 anymore as it is default, but hurts nothing (i still leave it on hah). For some KT quants I've seen improvement using explicit -DGGML_CUDA_F16=ON but haven't looked into it and don't think it matters for most quants.

looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!

totes! if you run llama-sweep-bench it makes the differences more obvious across the entire kv-cache depth e.g.

/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
    --ctx-size   16896 \
    -ctk q8_0 -ctv q8_0 \
    -ngl 99 \
    -sm graph \
    --max-gpu 2 \
    --threads 1

(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)

Yes this can also help with hybrid CPU big MoEs especially at longer context length. More info on that here: https://github.com/ikawrakow/ik_llama.cpp/pull/1040#issuecomment-3620696408

Thanks for testing out all the options!

And just for completions sake, here is the results of the sweep-bench script above.

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.886 |   271.47 |    9.972 |    12.84 |
|   512 |    128 |    512 |    1.849 |   276.90 |   11.611 |    11.02 |
|   512 |    128 |   1024 |    1.807 |   283.35 |   11.447 |    11.18 |
|   512 |    128 |   1536 |    1.865 |   274.52 |   10.760 |    11.90 |
|   512 |    128 |   2048 |    1.663 |   307.92 |   10.392 |    12.32 |
|   512 |    128 |   2560 |    1.732 |   295.56 |   10.214 |    12.53 |
|   512 |    128 |   3072 |    1.602 |   319.60 |    9.919 |    12.90 |
|   512 |    128 |   3584 |    1.675 |   305.72 |   10.079 |    12.70 |

I... I think my setup might be a little confusted, it seems to be improving as token length increases?

@phakio

Haha yeah that seems odd, but maybe just some noise as 512 token steps aren't too big, if you let it run all the way out to full 16k it might have more clear downward trend.

my approach is have a command and then only vary one experimental value and then graph the runs all together to see the difference. e.g. run the exact same command but try --max-gpu 2 and --mmax-gpu 3 on the other and graph them.

if you decide to do that, i can graph it for you if you paste it in here, or i could share the python script etc

Cool didn't know you did dense models.

Unless you need llama.cpp specific features, you're probably better off using exllamav3 with full cuda/vram

172.17.0.2:43166 - "GET /v1/models HTTP/1.1" 200
172.17.0.2:43182 - "POST /v1/chat/completions HTTP/1.1" 200
Received chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Finished chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Metrics (ID: 40c2da9978fd4db1a47b469279713dea): 561 tokens generated in 26.02 seconds (Queue: 0.0 s,
Process: 10388 cached tokens and 503 new tokens at 303.01 T/s, Generate: 23.03 T/s, Context: 10891 tokens)

(Random last gen log in the console) - that's the 4.0bpw on 3090's.

P.S. holy shit, we're getting tensor parallel in ik_llama!

17 t/s at zero context. Nice free performance boost. Also cool that it manages to detect / select the best peer connections (nvlinked pairs)

@gghfez

Cool didn't know you did dense models.

I'll try anything once! lol... dense models are not quite as exciting as the best recipes generally have all the tensors at similar level throughout with just a little bump here or there.

you're probably better off using exllamav3 with full cuda/vram

Right, in general if a model/quant can fit into full GPU VRAM offload it makes sense to check out exllamav3 EXL3 quants, and also stuff like vLLM (especially for multi-user / batched throughput case) assuming the quants available are good for your hardware.

P.S. holy shit, we're getting tensor parallel in ik_llama!

Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well. This is for the 4.709 BPW available in this repo:

sweep-bench-Devstral-2-123B-Instruct-2512

A big speed up while allowing a lot of existing GGUFs at just the right quantization. Also the big win that I see with this new -sm graph "graph parallel" can still give a nice boost on hybrid CPU+ multi GPU, especially at longer context depths.

Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well.
I just tested limiting to 4 GPUs (had to quant the kv cache) and 20t/s!
Control-vectors still work, so this is a huge win for me (probably won't bother keeping my hacked-together exl3 implementation working now!)

I'll have to try command-a as well, that's another "about 12 t/s if I want to use control-vectors" dense model.
edit:

=======================================================
Split mode 'graph' is not supported for this model
  => changing split mode to 'layer'                                            
=======================================================

I guess it depends on the model.

@gghfez

Ahh I see these are supported so far:

# src/llama.cpp line ~1726
  static bool is_model_split_supported(const llama_model & model) {
      static std::unordered_set<llm_arch> k_supported = {
          LLM_ARCH_LLAMA,
          LLM_ARCH_QWEN3MOE,
          LLM_ARCH_GLM4_MOE,
          LLM_ARCH_MISTRAL3,
      };
      auto it =  k_supported.find(model.arch);
      return it != k_supported.end();
  }

If you wanted to go wild west, just add LLM_ARCH_COHERE2 or whatever and see what happens, but zero guarantees as this is all very new as you know.

Or you can just comment this out to experiment too. I did this to try with RPC backend applying -sm graph which did at least start up right, but ended up giving gibberish out of a tiny 0.6B test model.

--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -1758,7 +1758,7 @@ static bool llm_load_tensors(
             LLAMA_LOG_WARN("Split mode 'graph' is not supported for this model\n");
             LLAMA_LOG_WARN("  => changing split mode to 'layer'\n");
             LLAMA_LOG_WARN("=======================================================\n\n");
-            split_mode = LLAMA_SPLIT_MODE_LAYER;
+            // split_mode = LLAMA_SPLIT_MODE_LAYER;
         }
     }

Anyway, lots of exciting toys to try out haha...

Yeah looks like I'll have to wait (cohere models are always a pain) lol

CUDA error: an illegal memory access was encountered
  current device: 0, in function launch_mul_mat_q at /home/ai/apps/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:4122
  cudaFuncSetAttribute(mul_mat_q<type, mmq_x, 8, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
/home/ai/apps/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

Apart from the -sm graph, is the rpc backend performance decent now?

Update:
https://github.com/ikawrakow/ik_llama.cpp/pull/1061

21.5 t/s with this!

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 4.383 233.61 5.951 21.51

Slower prompt processing but TG is almost at exl3 levels!

Edit: Seems like nvlink helps with S_PPt/s

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 3.189 321.07 5.577 22.95
1024 128 1024 3.195 320.54 5.668 22.58

Only difference there is I changed the CUDA_VISIBLE_DEVICES

Okay last one (last posting an unrelated model): llama-3.3-70b Q4_k

2x3090 nvlink

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 1.090 939.66 4.593 27.87
1024 128 1024 1.105 926.44 4.680 27.35
1024 128 2048 1.129 906.93 4.724 27.10
1024 128 3072 1.153 888.05 4.793 26.70
1024 128 4096 1.176 870.60 4.856 26.36

2x3090 PCIe4.0 x16

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 1.466 698.67 5.093 25.13
1024 128 1024 1.478 693.01 5.180 24.71
1024 128 2048 1.500 682.80 5.234 24.46
1024 128 3072 1.524 672.07 5.305 24.13
1024 128 4096 1.549 661.17 5.371 23.83

We can have exllamav2 speeds!

We can have exllamav2 speeds!

Yes! I had similar speeds running max-gpu 2 and using a finetuned llama 70b. what a time it is :)

@phakio @gghfez

Another PR on the vine that is looking really good for the >2 GPU crew: https://github.com/ikawrakow/ik_llama.cpp/pull/1067

Eh? That'd be faster than vllm with -tp 4 or exllamav2 with a perfect draft model (large-2407+mistral-7b-v3) if the graphs are accurate.

compiling and testing now thanks for the heads up !


edit - by disabling my gpu labeled "GPU 2" and running max gpu 3, I am seeing the fastest speeds with llama 3 70b i've ever witnessed, a solid 30 t/s.
when I run all my gpus, I noticed that GPU 2 under load throttles itself to 450mhz... I'm going to check the temps and other things, It could be thermally throttling itself instantly under load, resulting in the lower speeds I initially witnessed. suspecting it needs better cooling at the vram, as the core is well under throttle temps

I couldn't get it any faster after trying for half an hour. I suspect I'm PCIe speed bound with 3 of my GPUs on 4.0 x4

thanks for checking it out y'all!

on the remote rig with the older model 2x RTX A6000 (48GB VRAM each) the new PR is the same as main for -sm graph. So the new PR seems like it boosts speed for 3+ GPUs with enough PCIe speed maybe?

This rig already gets really good speed with -sm graph on main now, and strangely I have to use -cuda enable-p2p=0 otherwise dmesg prints a ton of this kinda stuff before suppressing it:

[Mon Dec 15 15:42:24 2025] AMD-Vi: IOMMU Event log restarting
[Mon Dec 15 15:42:26 2025] amd_iommu_report_page_fault: 882 callbacks suppressed
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c0000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c1000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c2000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c3000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0x2a0021c4000 flags=0x0020]
[Mon Dec 15 15:42:26 2025] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002a address=0xdd139070 flags=0x0020]

sweep-bench-GLM-4.5-Air-PR1067
sweep-bench-Devstral-Small-2-24B-Instruct-2512-PR1067

EDIT: there are some more PRs coming in around this -sm graph and need to double check if there is a way to get the advantages while
also using -ot for CPU offload as that was turned off recently otherwise. its moving so fast hard to keep track haha...

https://github.com/ikawrakow/ik_llama.cpp/pull/1069

and strangely I have to use -cuda enable-p2p=0 otherwise dmesg prints a ton of this kinda stuff before suppressing it:

Try this during inference:

nvidia-smi dmon -s pucvmet

And see if the pcie errors increase.

image

That count accumulates from boot time. Ideally it should be 0, but < 5-10 / minute is acceptable.

I had the same issue with trl and tabbyAPI on a few separate rigs / instances. The causes were (separate occasions):

  1. Bad risers
  2. Bad PSU
  3. (weird one) bad 24-pin cable from PSU -> Motherboard

[glm air screenshot ]

I managed to get something like 56 t/s with Qwen3-235B yesterday but there's some bug where it crashes as soon as the context exceeeds whatever I set batch/ubatch to.

Oh cool, just pulled origin and saw this:

commit f667bd58b0a94a15d8ae67e98c1f7006d2d7e445
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date:   Sat Dec 13 16:58:32 2025 +0100

    Undo sync reduction (#1063)
    
    I'm finding issues for Qwen3-MoE

That'll probably fix my Qwen3 issue.
But this is going to be a pain for me later :

commit 090f354d33822001a0b0c320fa55063e8c43ee2f
* server: split server.cpp code into server/common/task/queue/context

πŸ˜”

Sign up or log in to comment