SOLVED - Abysmal performance on 1x24GB 3090 Ti + 48GB RAM

by David337 - opened Feb 26

•

Hi, I'm having abysmal 1.1t/s performance on a 24GB 3090 Ti + 48GB RAM running llama.cpp, despite trying smaller quants like UD-Q3_K_XL and UD-IQ3_XXS. Model stored on a fast NVMe and even with --no-mmap model fits comfortably into the memory, I tried smaller context sizes like 4096...
Unsloth's Qwen3.5 docs mention magical numbers like 25+ t/s even for the bigger A397B at Q4 on 1x24GB GPU + RAM... so what am I doing wrong here?

my params:

IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
CACHE_DIR="${HOME}/.cache/llama.cpp"

podman run \
  --rm \
  --name llama-server \
  --replace \
  -it \
  --network host \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  -v "${CACHE_DIR}:/root/.cache/llama.cpp:Z" \
  "${IMAGE}" \
  -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --jinja \
  --threads -1 \
  --ctx_size 4096 \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0.0 \
  --top-k 20 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0 \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  --port 3000 \
  --no-prefill-assistant \
  --host 0.0.0.0

David337 changed discussion title from Performance on 3090 Ti + 48GB RAM to Abysmal performance on 3090 Ti + 48GB RAM Feb 26

David337 changed discussion title from Abysmal performance on 3090 Ti + 48GB RAM to Abysmal performance on 1x24GB 3090 Ti + 48GB RAM Feb 26

combatballerina

Feb 26

As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU

Apply my config to your setup

I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg

"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^

David337

Feb 26

As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU

Apply my config to your setup

I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg

"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^

Thank you! I applied these params to my launch script and it has improved to about 8t/s, that's 8x improvement!!

David337

Feb 26

Upon further testing I think I found the culprit. it was the --threads -1 param. Removing it from my old config bumps the inference speed to about 22t/s!
This might be related to the CPU being an Intel 14900K which has P and E cores, and so forcing -1 on threads might be messing with thread scheduling?

David337 changed discussion title from Abysmal performance on 1x24GB 3090 Ti + 48GB RAM to SOLVED - Abysmal performance on 1x24GB 3090 Ti + 48GB RAM Feb 26

BingoBird

Feb 27

Since years we've been able to observe that there's a curve of performace in llama.cpp when increasing threads. At some point adding more threads drops performance.

Specifying -1 for number threads will attempt to use all cores which slows it down.

jdchmiel

Mar 1

I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.

combatballerina

Mar 2

I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.

Good observation! 4-5 threads indeed perform better than naive 8 or 16

ffewfqefwsefwfe

Mar 3

I get ~25 t/s using these on Halo Strix:

version = 1
[*]
parallel = 1
timeout = 900
threads-http = 4
cont-batching = true
no-mmap = true
b = 2048

[Qwen3.5-122B-A10B-GGUF]
ngl = 999
jinja = true
c = 32000
fa = 1
parallel = 1
cram = 0
n-predict = 15000
draft-max = 0
draft-p-min = 0.95
#load-on-startup = false
model = /my-models/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf
chat-template-file = /my-models/Qwen3.5-122B-A10B-GGUF/chat_template
mm = /my-models/Qwen3.5-122B-A10B-GGUF/mmproj-BF16.gguf

397B version is ~18 t/s. 35B version is ~70 t/s.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment