AesSedai/Kimi-K2.5-GGUF using the Q4_X on 8 RTX 3090
Just sharing a test local inference report for AesSedai/Kimi-K2.5-GGUF using the Q4_X ( in progress )
Machine
CPU: AMD EPYC 7532, 32 cores / 64 threads
RAM: 503 GiB DDR4 2933 MHz
GPU: 8 × NVIDIA GeForce RTX 3090 (24 GiB each)
NVLink: GPU 1↔5 and GPU 3↔6 (4 active links total with 2 nvlink)
OS: Fedora Linux 42, kernel 6.18.x
Serving stack: latest ik_llama.cpp (CUDA build)
NVIDIA driver: 580.142 ( new version CUDA 13 )
Model files were downloaded via LM Studio only; all inference is done with ik_llama.cpp.
You can see i have only 512 Go ram , so for the test i buy a ssd 128 Go for swap partition ( if not i have a crash because the first buffer need more ram i have )
The first sucesfull load, i must use layer because the automatic use more ram than i have , so the first version is :
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X
--host 0.0.0.0
--port 8080
--ctx-size 16384
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
-muge
--jinja
--override-tensor 'blk.1.ffn_down_exps.weight=CUDA0,blk.1.ffn_gate_exps.weight=CUDA0,blk.1.ffn_up_exps.weight=CUDA0,blk.8.ffn_down_exps.weight=CUDA1,blk.8.ffn_gate_exps.weight=CUDA1,blk.8.ffn_up_exps.weight=CUDA1,blk.16.ffn_down_exps.weight=CUDA2,blk.16.ffn_gate_exps.weight=CUDA2,blk.16.ffn_up_exps.weight=CUDA2,blk.24.ffn_down_exps.weight=CUDA3,blk.24.ffn_gate_exps.weight=CUDA3,blk.24.ffn_up_exps.weight=CUDA3,blk.32.ffn_down_exps.weight=CUDA4,blk.32.ffn_gate_exps.weight=CUDA4,blk.32.ffn_up_exps.weight=CUDA4,blk.40.ffn_down_exps.weight=CUDA5,blk.40.ffn_gate_exps.weight=CUDA5,blk.40.ffn_up_exps.weight=CUDA5,blk.48.ffn_down_exps.weight=CUDA6,blk.48.ffn_gate_exps.weight=CUDA6,blk.48.ffn_up_exps.weight=CUDA6,blk.56.ffn_down_exps.weight=CUDA7,blk.56.ffn_gate_exps.weight=CUDA7,blk.56.ffn_up_exps.weight=CUDA7,blk.2.ffn_down_exps.weight=CPU,blk.2.ffn_gate_exps.weight=CPU,blk.2.ffn_up_exps.weight=CPU,blk.3.ffn_down_exps.weight=CPU,blk.3.ffn_gate_exps.weight=CPU,blk.3.ffn_up_exps.weight=CPU,blk.4.ffn_down_exps.weight=CPU,blk.4.ffn_gate_exps.weight=CPU,blk.4.ffn_up_exps.weight=CPU,blk.5.ffn_down_exps.weight=CPU,blk.5.ffn_gate_exps.weight=CPU,blk.5.ffn_up_exps.weight=CPU,blk.6.ffn_down_exps.weight=CPU,blk.6.ffn_gate_exps.weight=CPU,blk.6.ffn_up_exps.weight=CPU,blk.7.ffn_down_exps.weight=CPU,blk.7.ffn_gate_exps.weight=CPU,blk.7.ffn_up_exps.weight=CPU,blk.9.ffn_down_exps.weight=CPU,blk.9.ffn_gate_exps.weight=CPU,blk.9.ffn_up_exps.weight=CPU,blk.10.ffn_down_exps.weight=CPU,blk.10.ffn_gate_exps.weight=CPU,blk.10.ffn_up_exps.weight=CPU,blk.11.ffn_down_exps.weight=CPU,blk.11.ffn_gate_exps.weight=CPU,blk.11.ffn_up_exps.weight=CPU,blk.12.ffn_down_exps.weight=CPU,blk.12.ffn_gate_exps.weight=CPU,blk.12.ffn_up_exps.weight=CPU,blk.13.ffn_down_exps.weight=CPU,blk.13.ffn_gate_exps.weight=CPU,blk.13.ffn_up_exps.weight=CPU,blk.14.ffn_down_exps.weight=CPU,blk.14.ffn_gate_exps.weight=CPU,blk.14.ffn_up_exps.weight=CPU,blk.15.ffn_down_exps.weight=CPU,blk.15.ffn_gate_exps.weight=CPU,blk.15.ffn_up_exps.weight=CPU,blk.17.ffn_down_exps.weight=CPU,blk.17.ffn_gate_exps.weight=CPU,blk.17.ffn_up_exps.weight=CPU,blk.18.ffn_down_exps.weight=CPU,blk.18.ffn_gate_exps.weight=CPU,blk.18.ffn_up_exps.weight=CPU,blk.19.ffn_down_exps.weight=CPU,blk.19.ffn_gate_exps.weight=CPU,blk.19.ffn_up_exps.weight=CPU,blk.20.ffn_down_exps.weight=CPU,blk.20.ffn_gate_exps.weight=CPU,blk.20.ffn_up_exps.weight=CPU,blk.21.ffn_down_exps.weight=CPU,blk.21.ffn_gate_exps.weight=CPU,blk.21.ffn_up_exps.weight=CPU,blk.22.ffn_down_exps.weight=CPU,blk.22.ffn_gate_exps.weight=CPU,blk.22.ffn_up_exps.weight=CPU,blk.23.ffn_down_exps.weight=CPU,blk.23.ffn_gate_exps.weight=CPU,blk.23.ffn_up_exps.weight=CPU,blk.25.ffn_down_exps.weight=CPU,blk.25.ffn_gate_exps.weight=CPU,blk.25.ffn_up_exps.weight=CPU,blk.26.ffn_down_exps.weight=CPU,blk.26.ffn_gate_exps.weight=CPU,blk.26.ffn_up_exps.weight=CPU,blk.27.ffn_down_exps.weight=CPU,blk.27.ffn_gate_exps.weight=CPU,blk.27.ffn_up_exps.weight=CPU,blk.28.ffn_down_exps.weight=CPU,blk.28.ffn_gate_exps.weight=CPU,blk.28.ffn_up_exps.weight=CPU,blk.29.ffn_down_exps.weight=CPU,blk.29.ffn_gate_exps.weight=CPU,blk.29.ffn_up_exps.weight=CPU,blk.30.ffn_down_exps.weight=CPU,blk.30.ffn_gate_exps.weight=CPU,blk.30.ffn_up_exps.weight=CPU,blk.31.ffn_down_exps.weight=CPU,blk.31.ffn_gate_exps.weight=CPU,blk.31.ffn_up_exps.weight=CPU,blk.33.ffn_down_exps.weight=CPU,blk.33.ffn_gate_exps.weight=CPU,blk.33.ffn_up_exps.weight=CPU,blk.34.ffn_down_exps.weight=CPU,blk.34.ffn_gate_exps.weight=CPU,blk.34.ffn_up_exps.weight=CPU,blk.35.ffn_down_exps.weight=CPU,blk.35.ffn_gate_exps.weight=CPU,blk.35.ffn_up_exps.weight=CPU,blk.36.ffn_down_exps.weight=CPU,blk.36.ffn_gate_exps.weight=CPU,blk.36.ffn_up_exps.weight=CPU,blk.37.ffn_down_exps.weight=CPU,blk.37.ffn_gate_exps.weight=CPU,blk.37.ffn_up_exps.weight=CPU,blk.38.ffn_down_exps.weight=CPU,blk.38.ffn_gate_exps.weight=CPU,blk.38.ffn_up_exps.weight=CPU,blk.39.ffn_down_exps.weight=CPU,blk.39.ffn_gate_exps.weight=CPU,blk.39.ffn_up_exps.weight=CPU,blk.41.ffn_down_exps.weight=CPU,blk.41.ffn_gate_exps.weight=CPU,blk.41.ffn_up_exps.weight=CPU,blk.42.ffn_down_exps.weight=CPU,blk.42.ffn_gate_exps.weight=CPU,blk.42.ffn_up_exps.weight=CPU,blk.43.ffn_down_exps.weight=CPU,blk.43.ffn_gate_exps.weight=CPU,blk.43.ffn_up_exps.weight=CPU,blk.44.ffn_down_exps.weight=CPU,blk.44.ffn_gate_exps.weight=CPU,blk.44.ffn_up_exps.weight=CPU,blk.45.ffn_down_exps.weight=CPU,blk.45.ffn_gate_exps.weight=CPU,blk.45.ffn_up_exps.weight=CPU,blk.46.ffn_down_exps.weight=CPU,blk.46.ffn_gate_exps.weight=CPU,blk.46.ffn_up_exps.weight=CPU,blk.47.ffn_down_exps.weight=CPU,blk.47.ffn_gate_exps.weight=CPU,blk.47.ffn_up_exps.weight=CPU,blk.49.ffn_down_exps.weight=CPU,blk.49.ffn_gate_exps.weight=CPU,blk.49.ffn_up_exps.weight=CPU,blk.50.ffn_down_exps.weight=CPU,blk.50.ffn_gate_exps.weight=CPU,blk.50.ffn_up_exps.weight=CPU,blk.51.ffn_down_exps.weight=CPU,blk.51.ffn_gate_exps.weight=CPU,blk.51.ffn_up_exps.weight=CPU,blk.52.ffn_down_exps.weight=CPU,blk.52.ffn_gate_exps.weight=CPU,blk.52.ffn_up_exps.weight=CPU,blk.53.ffn_down_exps.weight=CPU,blk.53.ffn_gate_exps.weight=CPU,blk.53.ffn_up_exps.weight=CPU,blk.54.ffn_down_exps.weight=CPU,blk.54.ffn_gate_exps.weight=CPU,blk.54.ffn_up_exps.weight=CPU,blk.55.ffn_down_exps.weight=CPU,blk.55.ffn_gate_exps.weight=CPU,blk.55.ffn_up_exps.weight=CPU,blk.57.ffn_down_exps.weight=CPU,blk.57.ffn_gate_exps.weight=CPU,blk.57.ffn_up_exps.weight=CPU,blk.58.ffn_down_exps.weight=CPU,blk.58.ffn_gate_exps.weight=CPU,blk.58.ffn_up_exps.weight=CPU,blk.59.ffn_down_exps.weight=CPU,blk.59.ffn_gate_exps.weight=CPU,blk.59.ffn_up_exps.weight=CPU,blk.60.ffn_down_exps.weight=CPU,blk.60.ffn_gate_exps.weight=CPU,blk.60.ffn_up_exps.weight=CPU'
First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384
BUT i must continue ...
1 test with context at --ctx-size 102400 with opencode :
prompt eval time = 542344.01 ms / 15061 tokens ( 36.01 ms per token, 27.77 tokens per second)
eval time = 8214.41 ms / 65 tokens ( 126.38 ms per token, 7.91 tokens per second)
total time = 550558.42 ms / 15126 tokens
prompt processing is low , but lot of this llm is on RAM/CPU
If you're having to use a swap partition for this, I'd recommend stepping down to the IQ3_S instead of the Q4_X. The entire model is needed for prompt processing so your swap partition offloading is what's dragging that token rate down more than expected I think.
i can't use IQ3_S : the llm have some trouble with code ( forget some " or ; ...) , the swap is only used for the load after it's use 94% of ram (without the swap )...
I continue the test
First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384BUT i must continue ...
Nice effort @martossien ! I will keep an eye on your progress with this beast model.
A new big change in ik_llama.cpp ( for me ;) ) the :
--fit and --fit-margin N,
other commit :
| Date | Commit | Description | Impact |
|---|---|---|---|
| Mar 27 | 93ae47e |
Fix CUDA Hadamard transform bug | CRITIQUE |
| Mar 26 | 8ab016e |
Fix CPU flash attention bf16 | IMPORTANT |
| Mar 26 | 78977c0 |
Fix jinja | IMPORTANT |
| Mar 26 | a84d90a |
Fix bug in #1506 | Stabilité |
| Mar 25 | dd75fd0 |
Fix KV cache split graph | Stabilité |
| Mar 24 | cdf9142 |
Fix Qwen3.5 grammar | Compatibilité |
| Mar 25 | 86f4f51 |
Auto-fit MoE models | MAJEUR |
| Mar 25 | b6bac1a |
Auto-fit dense models | MAJEUR |
| Mar 25 | 4b1a656 |
Documentation --fit/--fit-margin | Documentation |
| Mar 24 | 233225d |
Layer sizes for GPU layers | IMPORTANT |
| Mar 25 | 1f3e832 |
Improve MTP acceptance rate | Performance |
| Mar 28 | 798af86 |
Correct split modes llama-bench | |
| Mar 26 | c06067f |
Ignore MTP in memory calc | |
| Mar 26 | 9eaf105 |
Print pinned memory info | |
| Mar 26 | 095bd3d |
Pinned memory with mmap | |
| Mar 26 | d66dc7c |
Restore pinned memory usage | |
| Mar 26 | e46601c |
Log probs on sampling crash |
so i test this new fonction :
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit
--host 0.0.0.0
--port 8081
--ctx-size 102400
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--fit-margin 3072
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
-muge
--jinja
so now ( at beginning of context , so i must test it at 100K ):
0 NVIDIA GeForce RTX 3090 22303MiB / 24576MiB
| 1 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 2 NVIDIA GeForce RTX 3090 21961MiB / 24576MiB
| 3 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 4 NVIDIA GeForce RTX 3090 22248MiB / 24576MiB
| 5 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 6 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 7 NVIDIA GeForce RTX 3090 22933MiB / 24576MiB
the trouble with GPU7 ( i have the same in manual ) ,
now 98 % of ram ( no swap ) BUT graph seems works ! So a little better result ...
--batch-size 256
--ubatch-size 256 \
Those also have to be adapted since your PP will by abysmally LOW & SLOW
--fit-margin 3072
Why so high? Not a headless machine?
--batch-size and --ubatch-size use lot of vram and when you have a lot in ram/cpu , the cpu is bottleneck , i want to test 512 but i need to work with 100K of context ( my dev team need this minimum limit, openode compress context at 70% )
--fit-margin 3072 in fact fit is not magic, you give a goal but the automatic fit make waht he can, and with this llm for 100K context the marge must be near this size.
and the load take long time , so i must test low parameters and test 100K and if not vram crash i can check if i can gain.
Load 1H
test 100K 2H ( it depends)
find better and divers 30 mn
so 3H 30/ complet test, and i think i must 10 to 20 complet test for find best result :code quality , big context, speed.
After this i give the acces to my dev team, they use it for confidential code task , and give their return.
Load 1H
Mother of God, why so slow? I'm on PCIE4 / 7443P with only 24 cores and it take a max of 3-5 minutes per load!!!
Ok, you could also try (since you're using threads on CPU a lot) to use the SCHED_RR.
sudo chrt --rr 99 ik-llama_PID
Also you might one to recompile ik_llama beacause since 2h ago: https://github.com/ikawrakow/ik_llama.cpp/pull/1540 --> I never knew about the bug HOWEVER I was going nuts seeing my "-ts" being constantly ignored! WTG!
Hello everyone,
First of all, thank you to all contributors in this thread and especially to the model author for this impressive quantization ( @AesSedai ). It’s quite something to run this model locally.
After quite a long night of testing (and many crashes), I wanted to share a working configuration and some feedback that might help others.
Working command
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit_nomuge
--host 0.0.0.0
--port 8080
--ctx-size 92160
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--cache-type-k q5_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
--jinja
--fit-margin 12000,12000,12000,12000,12000,12000,12000,12550
We recompiled with the latest commits as suggested in this discussion.
We did not test:
sudo chrt --rr 99
At this point the setup was already quite unstable during tuning (many crashes), so we avoided adding more risk.
Important clarification about --fit-margin
At first, I misunderstood how --fit-margin works:
It is NOT a margin applied after allocation
It is applied during the initial layer distribution phase
You can specify per-GPU values using commas
Why we reduced GPU7 load
GPU7 was consistently about ~1 GB more loaded than the other GPUs.
This caused:
crashes around ~32k context
VRAM exhaustion specifically on GPU7
To fix this, we used a higher margin:
12550 (for GPU7)
This prevents one layer from being loaded on GPU7, restoring balance across GPUs and allowing much higher context.
RAM constraints
We were running at:
~99% system RAM usage
< 4 GB free
without using swap
We initially tried -muge, but:
it pushed swap usage very aggressively
and in our case, it caused crashes
So we removed it entirely.
Results
After multiple runs and validations:
Stable context: ~90k tokens
Above that: unstable / crashes
KV cache growth becomes critical near the limit
We also considered:
--batch-size 512
--ubatch-size 512
But:
VRAM is already the limiting factor
testing this properly would require much more time
(we already spent more than 2 full day tuning this 😅)
VRAM usage per GPU (startup vs context growth)
Tested on 8× RTX 3090 (24GB):
After model load
GPU0: ~21.2 GiB / 24 GiB
GPU1: ~20.9 GiB
GPU2: ~20.9 GiB
GPU3: ~20.7 GiB
GPU4: ~20.9 GiB
GPU5: ~20.9 GiB
GPU6: ~20.9 GiB
GPU7: ~13.0–14.0 GiB (reduced via --fit-margin)
Most GPUs start already at ~85–88% VRAM usage.
With context growth (measured)
~15k tokens → ~20.9–21.2 GiB
~31k tokens → ~21.4–21.7 GiB
~45k tokens → ~21.7–22.2 GiB
~60k tokens → ~22.0–22.3 GiB
GPU0 is always the limiting GPU
(GPU7 becomes the limiting one if not reduced via fit-margin)
KV cache growth
From measurements:
~+1.0 to +1.2 GiB between 15k → 60k tokens
≈ 25–35 MiB per 1k tokens per GPU
Consistent with:
q5_0 K cache
q4_0 V cache
Hadamard enabled
Key behavior
KV cache accumulates with context
The limit is reached progressively (not at startup)
Projection
~80k → ~23+ GiB
~90k → very close to 24 GiB limit
beyond that → high risk of crash
Even with --ctx-size 92160,
the real usable context is ~90k max on this setup
Performance observations
Generation: ~7–8 tokens/sec
Prefill (prompt processing): ~26–27 tokens/sec
The main bottleneck is prefill at high context, not generation speed.
Practical conclusion
Even though it technically fits:
~90k context is the practical ceiling
prompt processing latency becomes significant
for our dev team, this is still a bit tight for large workflows
If this helps someone avoid a full night of crashes, then it was worth writing 🙂
Just for comparison: with one RTX PRO 6000 Q4_X gives me typically 9.6 t/s, once I saw 9.8 and a few times 9.7. Haven't been able to squeeze more so far. Mainline llama.cpp.
I have old cpu ( epyc 7532 ) old ram ( ddr4 2933mhz ) and 8 rtx 3090 ( 600 euros by gpu ) , so i very low cost computer. The rtx 6000 is a good gpu , but not very cheap ...
With 8 gpu , a lot of vram is used for synchro the other gpu ... And with 90 k token with this config i can test llama.cpp but when i mad it, ik_llama.cpp is Always the winner when the load have ram offload ...
But if someone can give me this beautiful gpu , i make other test ...
In my office computer ( 2 rtx 5090 with very old cpu ) i Ask 1 other rtx 5090 and 1 rtx 6000 pro Blackwell, if i have the Ok i come back with this same test ...

