SOLVED - Abysmal performance on 1x24GB 3090 Ti + 48GB RAM

#6
by David337 - opened

Hi, I'm having abysmal 1.1t/s performance on a 24GB 3090 Ti + 48GB RAM running llama.cpp, despite trying smaller quants like UD-Q3_K_XL and UD-IQ3_XXS. Model stored on a fast NVMe and even with --no-mmap model fits comfortably into the memory, I tried smaller context sizes like 4096...
Unsloth's Qwen3.5 docs mention magical numbers like 25+ t/s even for the bigger A397B at Q4 on 1x24GB GPU + RAM... so what am I doing wrong here?

my params:

IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
CACHE_DIR="${HOME}/.cache/llama.cpp"

podman run \
  --rm \
  --name llama-server \
  --replace \
  -it \
  --network host \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  -v "${CACHE_DIR}:/root/.cache/llama.cpp:Z" \
  "${IMAGE}" \
  -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --jinja \
  --threads -1 \
  --ctx_size 4096 \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0.0 \
  --top-k 20 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0 \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  --port 3000 \
  --no-prefill-assistant \
  --host 0.0.0.0 
David337 changed discussion title from Performance on 3090 Ti + 48GB RAM to Abysmal performance on 3090 Ti + 48GB RAM
David337 changed discussion title from Abysmal performance on 3090 Ti + 48GB RAM to Abysmal performance on 1x24GB 3090 Ti + 48GB RAM

As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU

Apply my config to your setup

I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg

"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^

As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU

Apply my config to your setup

I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg

"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^

Thank you! I applied these params to my launch script and it has improved to about 8t/s, that's 8x improvement!!

Upon further testing I think I found the culprit. it was the --threads -1 param. Removing it from my old config bumps the inference speed to about 22t/s!
This might be related to the CPU being an Intel 14900K which has P and E cores, and so forcing -1 on threads might be messing with thread scheduling?

David337 changed discussion title from Abysmal performance on 1x24GB 3090 Ti + 48GB RAM to SOLVED - Abysmal performance on 1x24GB 3090 Ti + 48GB RAM

Since years we've been able to observe that there's a curve of performace in llama.cpp when increasing threads. At some point adding more threads drops performance.

Specifying -1 for number threads will attempt to use all cores which slows it down.

I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.

I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.

Good observation! 4-5 threads indeed perform better than naive 8 or 16

I get ~25 t/s using these on Halo Strix:

version = 1
[*]
parallel = 1
timeout = 900
threads-http = 4
cont-batching = true
no-mmap = true
b = 2048

[Qwen3.5-122B-A10B-GGUF]
ngl = 999
jinja = true
c = 32000
fa = 1
parallel = 1
cram = 0
n-predict = 15000
draft-max = 0
draft-p-min = 0.95
#load-on-startup = false
model = /my-models/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf
chat-template-file = /my-models/Qwen3.5-122B-A10B-GGUF/chat_template
mm = /my-models/Qwen3.5-122B-A10B-GGUF/mmproj-BF16.gguf

397B version is ~18 t/s. 35B version is ~70 t/s.

Sign up or log in to comment