Q3_K_XL works surprisingly fast for 3x3090 + 128 ram
Thought this might be useful info for some of you with similar setups.
prompt eval time = 561212.95 ms / 20638 tokens ( 27.19 ms per token, 36.77 tokens per second)
eval time = 125.56 ms / 2 tokens ( 62.78 ms per token, 15.93 tokens per second)
total time = 561338.52 ms / 20640 tokens
Oh fantastic!
I'm getting similar perf for UD-Q4_K_XL and 72GB VRAM:
- RTX 4090D 48GB
- RTX 3090 24GB
- Intel Xeon W5-3425 with 256GB DDR5-4800
prompt eval time = 13726.21 ms / 512 tokens ( 26.81 ms per token, 37.30 tokens per second)
eval time = 64585.92 ms / 857 tokens ( 75.36 ms per token, 13.27 tokens per second)
Compose file:
services:
qwen35:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8067
container_name: qwen35
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/llama.cpp/router/local-qwen35-400b:/root/.cache/llama.cpp
entrypoint: ["./llama-server"]
command: >
--model /root/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_UD-Q4_K_XL_Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf
--mmproj /root/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_mmproj-F16.gguf
--alias local-qwen35-400b
--host 0.0.0.0 --port 8080
--ctx-size 65536
--parallel 1
--min-p 0 --top-p 0.8 --top-k 20 --temp 0.7
--chat-template-kwargs "{\"enable_thinking\": false}"
very nice docker compose @SlavikF , I've tried it and the model does run but I get a ton of failed tool calls and errors using Opencode. If you're using Opencode and wouldn't mind sharing your config, I'd appreciate it.
@aaron-newsome
Tools calls is known issue.
For some reason - especially for opencode. RooCode works fine for me.
There are few ways to work around:
use branch from this PR:
https://github.com/ggml-org/llama.cpp/pull/18675also this project offers workaround for existing llama.cpp versions:
https://github.com/crashr/llama-stream
in my test that UD-IQ2_M is working far good comparing Q3_K_XL but its just slightly slow comparing to Q3_K_XL.
tested with 200k length code files.
@fizzacles what settings did you use to get that speed?
Hey. This is the startup config I was using.
./llama-server \
-m "Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf" \
-fa on \
--jinja \
--chat-template-kwargs '{"enable_thinking": false}' \
-c 32768 \
-ctv q8_0 \
-ctk q8_0 \
--batch_size 128 \
--ubatch_size 128 \
-np 1 \
--no-mmap \
--no-warmup