Qwen3.5-397B-A17B-IQ4_KSS on 8 RTX 3090 context 161K tokens load by ik_llama.cpp , test with opencode
Just sharing a successful local inference report for ubergarm/Qwen3.5-397B-A17B-GGUF using the IQ4_KSS quant, listed at 194.058 GiB and 4.206 BPW.
Machine:
- CPU: AMD EPYC 7532 32-Core Processor (64 threads)
- RAM: 503 GiB DDR4 2933 MHz (78 GiB used, 424 GiB free)
- GPU: 8Γ NVIDIA GeForce RTX 3090 (24 GiB each)
- NVLink: GPU 1β5 and GPU 3β6 (4 active links, so 2 nvlink on 4 max)
- OS: Fedora Linux 42, kernel 6.18.x
- Local serving: ik_llama.cpp
- NVIDIA driver: 580.126.09
I used LM Studio only to download the model files.
Command:
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS --host 0.0.0.0 --port 8080 --ctx-size 161280 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95 --n-cpu-moe 18 --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --graph-reuse -muge --ctx-checkpoints 48 --ctx-checkpoints-interval 512 --ctx-checkpoints-tolerance 5 --cache-ram 32768 --jinja
The hardest part was balancing VRAM usage evenly across all 8 GPUs, which is why I ended up using:
--tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95
A few notes about the settings:
-muge: Multi-GPU Expert optimization for MoE--ctx-checkpoints 48: number of context checkpoints--ctx-checkpoints-interval 512: interval between checkpoints--ctx-checkpoints-tolerance 5: checkpointing tolerance--cache-ram 32768: 32 GB CPU RAM cache for repeated prompts--n-cpu-moe 18: CPU expert inram/cpu--split-mode-graph-scheduling: advanced graph scheduling
Observed performance:
- Prompt processing: 50 to 350 tok/s (very variable depending on prompt)
- Generation: 16 to 30 tok/s
Many thanks again to ubergarm for this LLM.
Thanks for the detailed report, its so cool to see you tightening up your commands and getting that monster rig dialed in over time!
I shared your results with some other folks attempting the same quant here: https://github.com/ikawrakow/ik_llama.cpp/issues/1495#issuecomment-4111733654
That is a pretty news driver, what version of CUDA are you using? I've heard some folks wondering about the latest 13.2.. i'm on 13.1 myself.
Excuse me , i am in version CUDA 13.1 .
i must test this llm with new commit on ik_llama.cpp ( with fit and other news )
