AesSedai/Kimi-K2.5-GGUF using the Q4_X on 8 RTX 3090

#7
by martossien - opened

Just sharing a test local inference report for AesSedai/Kimi-K2.5-GGUF using the Q4_X ( in progress )
Machine

CPU: AMD EPYC 7532, 32 cores / 64 threads
RAM: 503 GiB DDR4 2933 MHz
GPU: 8 × NVIDIA GeForce RTX 3090 (24 GiB each)
NVLink: GPU 1↔5 and GPU 3↔6 (4 active links total with 2 nvlink)
OS: Fedora Linux 42, kernel 6.18.x
Serving stack: latest ik_llama.cpp (CUDA build)
NVIDIA driver: 580.142 ( new version CUDA 13 )

Model files were downloaded via LM Studio only; all inference is done with ik_llama.cpp.

You can see i have only 512 Go ram , so for the test i buy a ssd 128 Go for swap partition ( if not i have a crash because the first buffer need more ram i have )

The first sucesfull load, i must use layer because the automatic use more ram than i have , so the first version is :

~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X
--host 0.0.0.0
--port 8080
--ctx-size 16384
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
-muge
--jinja
--override-tensor 'blk.1.ffn_down_exps.weight=CUDA0,blk.1.ffn_gate_exps.weight=CUDA0,blk.1.ffn_up_exps.weight=CUDA0,blk.8.ffn_down_exps.weight=CUDA1,blk.8.ffn_gate_exps.weight=CUDA1,blk.8.ffn_up_exps.weight=CUDA1,blk.16.ffn_down_exps.weight=CUDA2,blk.16.ffn_gate_exps.weight=CUDA2,blk.16.ffn_up_exps.weight=CUDA2,blk.24.ffn_down_exps.weight=CUDA3,blk.24.ffn_gate_exps.weight=CUDA3,blk.24.ffn_up_exps.weight=CUDA3,blk.32.ffn_down_exps.weight=CUDA4,blk.32.ffn_gate_exps.weight=CUDA4,blk.32.ffn_up_exps.weight=CUDA4,blk.40.ffn_down_exps.weight=CUDA5,blk.40.ffn_gate_exps.weight=CUDA5,blk.40.ffn_up_exps.weight=CUDA5,blk.48.ffn_down_exps.weight=CUDA6,blk.48.ffn_gate_exps.weight=CUDA6,blk.48.ffn_up_exps.weight=CUDA6,blk.56.ffn_down_exps.weight=CUDA7,blk.56.ffn_gate_exps.weight=CUDA7,blk.56.ffn_up_exps.weight=CUDA7,blk.2.ffn_down_exps.weight=CPU,blk.2.ffn_gate_exps.weight=CPU,blk.2.ffn_up_exps.weight=CPU,blk.3.ffn_down_exps.weight=CPU,blk.3.ffn_gate_exps.weight=CPU,blk.3.ffn_up_exps.weight=CPU,blk.4.ffn_down_exps.weight=CPU,blk.4.ffn_gate_exps.weight=CPU,blk.4.ffn_up_exps.weight=CPU,blk.5.ffn_down_exps.weight=CPU,blk.5.ffn_gate_exps.weight=CPU,blk.5.ffn_up_exps.weight=CPU,blk.6.ffn_down_exps.weight=CPU,blk.6.ffn_gate_exps.weight=CPU,blk.6.ffn_up_exps.weight=CPU,blk.7.ffn_down_exps.weight=CPU,blk.7.ffn_gate_exps.weight=CPU,blk.7.ffn_up_exps.weight=CPU,blk.9.ffn_down_exps.weight=CPU,blk.9.ffn_gate_exps.weight=CPU,blk.9.ffn_up_exps.weight=CPU,blk.10.ffn_down_exps.weight=CPU,blk.10.ffn_gate_exps.weight=CPU,blk.10.ffn_up_exps.weight=CPU,blk.11.ffn_down_exps.weight=CPU,blk.11.ffn_gate_exps.weight=CPU,blk.11.ffn_up_exps.weight=CPU,blk.12.ffn_down_exps.weight=CPU,blk.12.ffn_gate_exps.weight=CPU,blk.12.ffn_up_exps.weight=CPU,blk.13.ffn_down_exps.weight=CPU,blk.13.ffn_gate_exps.weight=CPU,blk.13.ffn_up_exps.weight=CPU,blk.14.ffn_down_exps.weight=CPU,blk.14.ffn_gate_exps.weight=CPU,blk.14.ffn_up_exps.weight=CPU,blk.15.ffn_down_exps.weight=CPU,blk.15.ffn_gate_exps.weight=CPU,blk.15.ffn_up_exps.weight=CPU,blk.17.ffn_down_exps.weight=CPU,blk.17.ffn_gate_exps.weight=CPU,blk.17.ffn_up_exps.weight=CPU,blk.18.ffn_down_exps.weight=CPU,blk.18.ffn_gate_exps.weight=CPU,blk.18.ffn_up_exps.weight=CPU,blk.19.ffn_down_exps.weight=CPU,blk.19.ffn_gate_exps.weight=CPU,blk.19.ffn_up_exps.weight=CPU,blk.20.ffn_down_exps.weight=CPU,blk.20.ffn_gate_exps.weight=CPU,blk.20.ffn_up_exps.weight=CPU,blk.21.ffn_down_exps.weight=CPU,blk.21.ffn_gate_exps.weight=CPU,blk.21.ffn_up_exps.weight=CPU,blk.22.ffn_down_exps.weight=CPU,blk.22.ffn_gate_exps.weight=CPU,blk.22.ffn_up_exps.weight=CPU,blk.23.ffn_down_exps.weight=CPU,blk.23.ffn_gate_exps.weight=CPU,blk.23.ffn_up_exps.weight=CPU,blk.25.ffn_down_exps.weight=CPU,blk.25.ffn_gate_exps.weight=CPU,blk.25.ffn_up_exps.weight=CPU,blk.26.ffn_down_exps.weight=CPU,blk.26.ffn_gate_exps.weight=CPU,blk.26.ffn_up_exps.weight=CPU,blk.27.ffn_down_exps.weight=CPU,blk.27.ffn_gate_exps.weight=CPU,blk.27.ffn_up_exps.weight=CPU,blk.28.ffn_down_exps.weight=CPU,blk.28.ffn_gate_exps.weight=CPU,blk.28.ffn_up_exps.weight=CPU,blk.29.ffn_down_exps.weight=CPU,blk.29.ffn_gate_exps.weight=CPU,blk.29.ffn_up_exps.weight=CPU,blk.30.ffn_down_exps.weight=CPU,blk.30.ffn_gate_exps.weight=CPU,blk.30.ffn_up_exps.weight=CPU,blk.31.ffn_down_exps.weight=CPU,blk.31.ffn_gate_exps.weight=CPU,blk.31.ffn_up_exps.weight=CPU,blk.33.ffn_down_exps.weight=CPU,blk.33.ffn_gate_exps.weight=CPU,blk.33.ffn_up_exps.weight=CPU,blk.34.ffn_down_exps.weight=CPU,blk.34.ffn_gate_exps.weight=CPU,blk.34.ffn_up_exps.weight=CPU,blk.35.ffn_down_exps.weight=CPU,blk.35.ffn_gate_exps.weight=CPU,blk.35.ffn_up_exps.weight=CPU,blk.36.ffn_down_exps.weight=CPU,blk.36.ffn_gate_exps.weight=CPU,blk.36.ffn_up_exps.weight=CPU,blk.37.ffn_down_exps.weight=CPU,blk.37.ffn_gate_exps.weight=CPU,blk.37.ffn_up_exps.weight=CPU,blk.38.ffn_down_exps.weight=CPU,blk.38.ffn_gate_exps.weight=CPU,blk.38.ffn_up_exps.weight=CPU,blk.39.ffn_down_exps.weight=CPU,blk.39.ffn_gate_exps.weight=CPU,blk.39.ffn_up_exps.weight=CPU,blk.41.ffn_down_exps.weight=CPU,blk.41.ffn_gate_exps.weight=CPU,blk.41.ffn_up_exps.weight=CPU,blk.42.ffn_down_exps.weight=CPU,blk.42.ffn_gate_exps.weight=CPU,blk.42.ffn_up_exps.weight=CPU,blk.43.ffn_down_exps.weight=CPU,blk.43.ffn_gate_exps.weight=CPU,blk.43.ffn_up_exps.weight=CPU,blk.44.ffn_down_exps.weight=CPU,blk.44.ffn_gate_exps.weight=CPU,blk.44.ffn_up_exps.weight=CPU,blk.45.ffn_down_exps.weight=CPU,blk.45.ffn_gate_exps.weight=CPU,blk.45.ffn_up_exps.weight=CPU,blk.46.ffn_down_exps.weight=CPU,blk.46.ffn_gate_exps.weight=CPU,blk.46.ffn_up_exps.weight=CPU,blk.47.ffn_down_exps.weight=CPU,blk.47.ffn_gate_exps.weight=CPU,blk.47.ffn_up_exps.weight=CPU,blk.49.ffn_down_exps.weight=CPU,blk.49.ffn_gate_exps.weight=CPU,blk.49.ffn_up_exps.weight=CPU,blk.50.ffn_down_exps.weight=CPU,blk.50.ffn_gate_exps.weight=CPU,blk.50.ffn_up_exps.weight=CPU,blk.51.ffn_down_exps.weight=CPU,blk.51.ffn_gate_exps.weight=CPU,blk.51.ffn_up_exps.weight=CPU,blk.52.ffn_down_exps.weight=CPU,blk.52.ffn_gate_exps.weight=CPU,blk.52.ffn_up_exps.weight=CPU,blk.53.ffn_down_exps.weight=CPU,blk.53.ffn_gate_exps.weight=CPU,blk.53.ffn_up_exps.weight=CPU,blk.54.ffn_down_exps.weight=CPU,blk.54.ffn_gate_exps.weight=CPU,blk.54.ffn_up_exps.weight=CPU,blk.55.ffn_down_exps.weight=CPU,blk.55.ffn_gate_exps.weight=CPU,blk.55.ffn_up_exps.weight=CPU,blk.57.ffn_down_exps.weight=CPU,blk.57.ffn_gate_exps.weight=CPU,blk.57.ffn_up_exps.weight=CPU,blk.58.ffn_down_exps.weight=CPU,blk.58.ffn_gate_exps.weight=CPU,blk.58.ffn_up_exps.weight=CPU,blk.59.ffn_down_exps.weight=CPU,blk.59.ffn_gate_exps.weight=CPU,blk.59.ffn_up_exps.weight=CPU,blk.60.ffn_down_exps.weight=CPU,blk.60.ffn_gate_exps.weight=CPU,blk.60.ffn_up_exps.weight=CPU'

image

First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384

BUT i must continue ...

1 test with context at --ctx-size 102400 with opencode :

prompt eval time = 542344.01 ms / 15061 tokens ( 36.01 ms per token, 27.77 tokens per second)
eval time = 8214.41 ms / 65 tokens ( 126.38 ms per token, 7.91 tokens per second)
total time = 550558.42 ms / 15126 tokens

image

prompt processing is low , but lot of this llm is on RAM/CPU

If you're having to use a swap partition for this, I'd recommend stepping down to the IQ3_S instead of the Q4_X. The entire model is needed for prompt processing so your swap partition offloading is what's dragging that token rate down more than expected I think.

i can't use IQ3_S : the llm have some trouble with code ( forget some " or ; ...) , the swap is only used for the load after it's use 94% of ram (without the swap )...
I continue the test

image

First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384

BUT i must continue ...

Nice effort @martossien ! I will keep an eye on your progress with this beast model.

A new big change in ik_llama.cpp ( for me ;) ) the :
--fit and --fit-margin N,
other commit :

Date Commit Description Impact
Mar 27 93ae47e Fix CUDA Hadamard transform bug CRITIQUE
Mar 26 8ab016e Fix CPU flash attention bf16 IMPORTANT
Mar 26 78977c0 Fix jinja IMPORTANT
Mar 26 a84d90a Fix bug in #1506 Stabilité
Mar 25 dd75fd0 Fix KV cache split graph Stabilité
Mar 24 cdf9142 Fix Qwen3.5 grammar Compatibilité
Mar 25 86f4f51 Auto-fit MoE models MAJEUR
Mar 25 b6bac1a Auto-fit dense models MAJEUR
Mar 25 4b1a656 Documentation --fit/--fit-margin Documentation
Mar 24 233225d Layer sizes for GPU layers IMPORTANT
Mar 25 1f3e832 Improve MTP acceptance rate Performance
Mar 28 798af86 Correct split modes llama-bench
Mar 26 c06067f Ignore MTP in memory calc
Mar 26 9eaf105 Print pinned memory info
Mar 26 095bd3d Pinned memory with mmap
Mar 26 d66dc7c Restore pinned memory usage
Mar 26 e46601c Log probs on sampling crash

so i test this new fonction :
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit
--host 0.0.0.0
--port 8081
--ctx-size 102400
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--fit-margin 3072
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
-muge
--jinja

so now ( at beginning of context , so i must test it at 100K ):
0 NVIDIA GeForce RTX 3090 22303MiB / 24576MiB
| 1 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 2 NVIDIA GeForce RTX 3090 21961MiB / 24576MiB
| 3 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 4 NVIDIA GeForce RTX 3090 22248MiB / 24576MiB
| 5 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 6 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 7 NVIDIA GeForce RTX 3090 22933MiB / 24576MiB
the trouble with GPU7 ( i have the same in manual ) ,
now 98 % of ram ( no swap ) BUT graph seems works ! So a little better result ...

--batch-size 256
--ubatch-size 256 \

Those also have to be adapted since your PP will by abysmally LOW & SLOW

--fit-margin 3072

Why so high? Not a headless machine?

--batch-size and --ubatch-size use lot of vram and when you have a lot in ram/cpu , the cpu is bottleneck , i want to test 512 but i need to work with 100K of context ( my dev team need this minimum limit, openode compress context at 70% )
--fit-margin 3072 in fact fit is not magic, you give a goal but the automatic fit make waht he can, and with this llm for 100K context the marge must be near this size.
and the load take long time , so i must test low parameters and test 100K and if not vram crash i can check if i can gain.
Load 1H
test 100K 2H ( it depends)
find better and divers 30 mn
so 3H 30/ complet test, and i think i must 10 to 20 complet test for find best result :code quality , big context, speed.
After this i give the acces to my dev team, they use it for confidential code task , and give their return.

Load 1H

Mother of God, why so slow? I'm on PCIE4 / 7443P with only 24 cores and it take a max of 3-5 minutes per load!!!

Ok, you could also try (since you're using threads on CPU a lot) to use the SCHED_RR.
sudo chrt --rr 99 ik-llama_PID

Also you might one to recompile ik_llama beacause since 2h ago: https://github.com/ikawrakow/ik_llama.cpp/pull/1540 --> I never knew about the bug HOWEVER I was going nuts seeing my "-ts" being constantly ignored! WTG!

Hello everyone,

First of all, thank you to all contributors in this thread and especially to the model author for this impressive quantization ( @AesSedai ). It’s quite something to run this model locally.

After quite a long night of testing (and many crashes), I wanted to share a working configuration and some feedback that might help others.

Working command
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit_nomuge
--host 0.0.0.0
--port 8080
--ctx-size 92160
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--cache-type-k q5_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
--jinja
--fit-margin 12000,12000,12000,12000,12000,12000,12000,12550

We recompiled with the latest commits as suggested in this discussion.

We did not test:
sudo chrt --rr 99
At this point the setup was already quite unstable during tuning (many crashes), so we avoided adding more risk.

Important clarification about --fit-margin
At first, I misunderstood how --fit-margin works:
It is NOT a margin applied after allocation
It is applied during the initial layer distribution phase
You can specify per-GPU values using commas
Why we reduced GPU7 load
GPU7 was consistently about ~1 GB more loaded than the other GPUs.
This caused:
crashes around ~32k context
VRAM exhaustion specifically on GPU7

To fix this, we used a higher margin:
12550 (for GPU7)
This prevents one layer from being loaded on GPU7, restoring balance across GPUs and allowing much higher context.

RAM constraints

We were running at:
~99% system RAM usage
< 4 GB free
without using swap

We initially tried -muge, but:
it pushed swap usage very aggressively
and in our case, it caused crashes
So we removed it entirely.

Results
After multiple runs and validations:
Stable context: ~90k tokens
Above that: unstable / crashes
KV cache growth becomes critical near the limit

We also considered:
--batch-size 512
--ubatch-size 512
But:
VRAM is already the limiting factor
testing this properly would require much more time
(we already spent more than 2 full day tuning this 😅)

VRAM usage per GPU (startup vs context growth)
Tested on 8× RTX 3090 (24GB):
After model load
GPU0: ~21.2 GiB / 24 GiB
GPU1: ~20.9 GiB
GPU2: ~20.9 GiB
GPU3: ~20.7 GiB
GPU4: ~20.9 GiB
GPU5: ~20.9 GiB
GPU6: ~20.9 GiB
GPU7: ~13.0–14.0 GiB (reduced via --fit-margin)

Most GPUs start already at ~85–88% VRAM usage.

With context growth (measured)
~15k tokens → ~20.9–21.2 GiB
~31k tokens → ~21.4–21.7 GiB
~45k tokens → ~21.7–22.2 GiB
~60k tokens → ~22.0–22.3 GiB

GPU0 is always the limiting GPU
(GPU7 becomes the limiting one if not reduced via fit-margin)

KV cache growth

From measurements:

~+1.0 to +1.2 GiB between 15k → 60k tokens
≈ 25–35 MiB per 1k tokens per GPU

Consistent with:

q5_0 K cache
q4_0 V cache
Hadamard enabled
Key behavior
KV cache accumulates with context
The limit is reached progressively (not at startup)
Projection
~80k → ~23+ GiB
~90k → very close to 24 GiB limit
beyond that → high risk of crash

Even with --ctx-size 92160,
the real usable context is ~90k max on this setup

Performance observations
Generation: ~7–8 tokens/sec
Prefill (prompt processing): ~26–27 tokens/sec
The main bottleneck is prefill at high context, not generation speed.

Practical conclusion
Even though it technically fits:
~90k context is the practical ceiling
prompt processing latency becomes significant
for our dev team, this is still a bit tight for large workflows

If this helps someone avoid a full night of crashes, then it was worth writing 🙂

Just for comparison: with one RTX PRO 6000 Q4_X gives me typically 9.6 t/s, once I saw 9.8 and a few times 9.7. Haven't been able to squeeze more so far. Mainline llama.cpp.

I have old cpu ( epyc 7532 ) old ram ( ddr4 2933mhz ) and 8 rtx 3090 ( 600 euros by gpu ) , so i very low cost computer. The rtx 6000 is a good gpu , but not very cheap ...
With 8 gpu , a lot of vram is used for synchro the other gpu ... And with 90 k token with this config i can test llama.cpp but when i mad it, ik_llama.cpp is Always the winner when the load have ram offload ...
But if someone can give me this beautiful gpu , i make other test ...
In my office computer ( 2 rtx 5090 with very old cpu ) i Ask 1 other rtx 5090 and 1 rtx 6000 pro Blackwell, if i have the Ok i come back with this same test ...

Sign up or log in to comment