SOLVED - Abysmal performance on 1x24GB 3090 Ti + 48GB RAM
Hi, I'm having abysmal 1.1t/s performance on a 24GB 3090 Ti + 48GB RAM running llama.cpp, despite trying smaller quants like UD-Q3_K_XL and UD-IQ3_XXS. Model stored on a fast NVMe and even with --no-mmap model fits comfortably into the memory, I tried smaller context sizes like 4096...
Unsloth's Qwen3.5 docs mention magical numbers like 25+ t/s even for the bigger A397B at Q4 on 1x24GB GPU + RAM... so what am I doing wrong here?
my params:
IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
CACHE_DIR="${HOME}/.cache/llama.cpp"
podman run \
--rm \
--name llama-server \
--replace \
-it \
--network host \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
-v "${CACHE_DIR}:/root/.cache/llama.cpp:Z" \
"${IMAGE}" \
-hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--jinja \
--threads -1 \
--ctx_size 4096 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
--chat-template-kwargs "{\"enable_thinking\": true}" \
--port 3000 \
--no-prefill-assistant \
--host 0.0.0.0
As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU
Apply my config to your setup
I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg
"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^
As if for offloaded MoE models, try to offload all expert layers onto CPU while having the rest loaded in GPU
Apply my config to your setup
I have 1x16GB 4070 Ti Super + 64GB DDR4 RAM
~220 pp / ~15 tg"%~dp0llama-server.exe" ^
-m E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf ^
--mmproj E:\qwen\qwen3.5-122B-A10B\UD-Q3_K_XL\mmproj-BF16.gguf ^
--n-gpu-layers 999 ^
-ot ".ffn_.*_exps.=CPU" ^ # Most important line
--ctx-size 262144 ^
--threads 8 ^
--threads-batch 8 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--batch-size 2048 ^
--ubatch-size 1024 ^
--flash-attn on ^
--mlock ^
--host 0.0.0.0 ^
--port 8080 ^
--parallel 1 ^
--cont-batching ^
Thank you! I applied these params to my launch script and it has improved to about 8t/s, that's 8x improvement!!
Upon further testing I think I found the culprit. it was the --threads -1 param. Removing it from my old config bumps the inference speed to about 22t/s!
This might be related to the CPU being an Intel 14900K which has P and E cores, and so forcing -1 on threads might be messing with thread scheduling?
Since years we've been able to observe that there's a curve of performace in llama.cpp when increasing threads. At some point adding more threads drops performance.
Specifying -1 for number threads will attempt to use all cores which slows it down.
I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.
I have noticed that I can saturate the memory bandwidth of ddr5 5200 dual channel with about 4-5 threads. you can run llama-bench with -t 1,2,3,4,5,6,7,8 and see the 'knee' in the curve. the knee is in a different spot based on how much is offloaded to the CPU.
Good observation! 4-5 threads indeed perform better than naive 8 or 16
I get ~25 t/s using these on Halo Strix:
version = 1
[*]
parallel = 1
timeout = 900
threads-http = 4
cont-batching = true
no-mmap = true
b = 2048
[Qwen3.5-122B-A10B-GGUF]
ngl = 999
jinja = true
c = 32000
fa = 1
parallel = 1
cram = 0
n-predict = 15000
draft-max = 0
draft-p-min = 0.95
#load-on-startup = false
model = /my-models/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q3_K_XL-00001-of-00003.gguf
chat-template-file = /my-models/Qwen3.5-122B-A10B-GGUF/chat_template
mm = /my-models/Qwen3.5-122B-A10B-GGUF/mmproj-BF16.gguf
397B version is ~18 t/s. 35B version is ~70 t/s.