Qwen3.5-397B-A17B-IQ4_KSS on 8 RTX 3090 context 161K tokens load by ik_llama.cpp , test with opencode

#12
by martossien - opened

Just sharing a successful local inference report for ubergarm/Qwen3.5-397B-A17B-GGUF using the IQ4_KSS quant, listed at 194.058 GiB and 4.206 BPW.

Machine:

  • CPU: AMD EPYC 7532 32-Core Processor (64 threads)
  • RAM: 503 GiB DDR4 2933 MHz (78 GiB used, 424 GiB free)
  • GPU: 8Γ— NVIDIA GeForce RTX 3090 (24 GiB each)
  • NVLink: GPU 1↔5 and GPU 3↔6 (4 active links, so 2 nvlink on 4 max)
  • OS: Fedora Linux 42, kernel 6.18.x
  • Local serving: ik_llama.cpp
  • NVIDIA driver: 580.126.09

I used LM Studio only to download the model files.

Command:
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS --host 0.0.0.0 --port 8080 --ctx-size 161280 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95 --n-cpu-moe 18 --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --graph-reuse -muge --ctx-checkpoints 48 --ctx-checkpoints-interval 512 --ctx-checkpoints-tolerance 5 --cache-ram 32768 --jinja

The hardest part was balancing VRAM usage evenly across all 8 GPUs, which is why I ended up using:
--tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95

A few notes about the settings:

  • -muge: Multi-GPU Expert optimization for MoE
  • --ctx-checkpoints 48: number of context checkpoints
  • --ctx-checkpoints-interval 512: interval between checkpoints
  • --ctx-checkpoints-tolerance 5: checkpointing tolerance
  • --cache-ram 32768: 32 GB CPU RAM cache for repeated prompts
  • --n-cpu-moe 18: CPU expert inram/cpu
  • --split-mode-graph-scheduling: advanced graph scheduling

Observed performance:

  • Prompt processing: 50 to 350 tok/s (very variable depending on prompt)
  • Generation: 16 to 30 tok/s

Many thanks again to ubergarm for this LLM.

image

@martossien

Thanks for the detailed report, its so cool to see you tightening up your commands and getting that monster rig dialed in over time!

I shared your results with some other folks attempting the same quant here: https://github.com/ikawrakow/ik_llama.cpp/issues/1495#issuecomment-4111733654

That is a pretty news driver, what version of CUDA are you using? I've heard some folks wondering about the latest 13.2.. i'm on 13.1 myself.

Excuse me , i am in version CUDA 13.1 .
i must test this llm with new commit on ik_llama.cpp ( with fit and other news )

Sign up or log in to comment