Q3_K_XL works surprisingly fast for 3x3090 + 128 ram

#4
by fizzacles - opened

Thought this might be useful info for some of you with similar setups.

prompt eval time =  561212.95 ms / 20638 tokens (   27.19 ms per token,    36.77 tokens per second)
       eval time =     125.56 ms /     2 tokens (   62.78 ms per token,    15.93 tokens per second)
      total time =  561338.52 ms / 20640 tokens
Unsloth AI org

Oh fantastic!

I'm getting similar perf for UD-Q4_K_XL and 72GB VRAM:

  • RTX 4090D 48GB
  • RTX 3090 24GB
  • Intel Xeon W5-3425 with 256GB DDR5-4800
prompt eval time =   13726.21 ms /   512 tokens (   26.81 ms per token,    37.30 tokens per second)
       eval time =   64585.92 ms /   857 tokens (   75.36 ms per token,    13.27 tokens per second)

Compose file:

services:
  qwen35:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8067
    container_name: qwen35
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router/local-qwen35-400b:/root/.cache/llama.cpp
    entrypoint: ["./llama-server"]
    command: >
      --model  /root/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_UD-Q4_K_XL_Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf
      --mmproj /root/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_mmproj-F16.gguf
      --alias local-qwen35-400b
      --host 0.0.0.0  --port 8080
      --ctx-size 65536
      --parallel 1
      --min-p 0 --top-p 0.8 --top-k 20 --temp 0.7
      --chat-template-kwargs "{\"enable_thinking\": false}"

very nice docker compose @SlavikF , I've tried it and the model does run but I get a ton of failed tool calls and errors using Opencode. If you're using Opencode and wouldn't mind sharing your config, I'd appreciate it.

@aaron-newsome
Tools calls is known issue.
For some reason - especially for opencode. RooCode works fine for me.

There are few ways to work around:

  1. use branch from this PR:
    https://github.com/ggml-org/llama.cpp/pull/18675

  2. also this project offers workaround for existing llama.cpp versions:
    https://github.com/crashr/llama-stream

shimmyshimmer pinned discussion

@fizzacles what settings did you use to get that speed?

in my test that UD-IQ2_M is working far good comparing Q3_K_XL but its just slightly slow comparing to Q3_K_XL.

tested with 200k length code files.

@fizzacles what settings did you use to get that speed?

Hey. This is the startup config I was using.

./llama-server \
  -m "Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf" \
  -fa on \
  --jinja \
  --chat-template-kwargs '{"enable_thinking": false}' \
  -c 32768 \
  -ctv q8_0 \
  -ctk q8_0 \
  --batch_size 128 \
  --ubatch_size 128 \
  -np 1 \
  --no-mmap \
  --no-warmup

Sign up or log in to comment