Qwen3.5-397B-A17B-IQ4_KSS on 8 RTX 3090 context 161K tokens load by ik_llama.cpp , test with opencode

#12

by martossien - opened Mar 22

•

Just sharing a successful local inference report for ubergarm/Qwen3.5-397B-A17B-GGUF using the IQ4_KSS quant, listed at 194.058 GiB and 4.206 BPW.

Machine:

CPU: AMD EPYC 7532 32-Core Processor (64 threads)
RAM: 503 GiB DDR4 2933 MHz (78 GiB used, 424 GiB free)
GPU: 8× NVIDIA GeForce RTX 3090 (24 GiB each)
NVLink: GPU 1↔5 and GPU 3↔6 (4 active links, so 2 nvlink on 4 max)
OS: Fedora Linux 42, kernel 6.18.x
Local serving: ik_llama.cpp
NVIDIA driver: 580.126.09

I used LM Studio only to download the model files.

Command:
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS --host 0.0.0.0 --port 8080 --ctx-size 161280 --no-mmap --threads 32 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95 --n-cpu-moe 18 --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --graph-reuse -muge --ctx-checkpoints 48 --ctx-checkpoints-interval 512 --ctx-checkpoints-tolerance 5 --cache-ram 32768 --jinja

The hardest part was balancing VRAM usage evenly across all 8 GPUs, which is why I ended up using:
--tensor-split 0.97,1.01,1.01,1,0.98,1,1.01,0.95

A few notes about the settings:

-muge: Multi-GPU Expert optimization for MoE
--ctx-checkpoints 48: number of context checkpoints
--ctx-checkpoints-interval 512: interval between checkpoints
--ctx-checkpoints-tolerance 5: checkpointing tolerance
--cache-ram 32768: 32 GB CPU RAM cache for repeated prompts
--n-cpu-moe 18: CPU expert inram/cpu
--split-mode-graph-scheduling: advanced graph scheduling

Observed performance:

Prompt processing: 50 to 350 tok/s (very variable depending on prompt)
Generation: 16 to 30 tok/s

Many thanks again to ubergarm for this LLM.

ubergarm

Owner about 1 month ago

@martossien

Thanks for the detailed report, its so cool to see you tightening up your commands and getting that monster rig dialed in over time!

I shared your results with some other folks attempting the same quant here: https://github.com/ikawrakow/ik_llama.cpp/issues/1495#issuecomment-4111733654

That is a pretty news driver, what version of CUDA are you using? I've heard some folks wondering about the latest 13.2.. i'm on 13.1 myself.

martossien

24 days ago

Excuse me , i am in version CUDA 13.1 .
i must test this llm with new commit on ik_llama.cpp ( with fit and other news )

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment