AesSedai/Kimi-K2.5-GGUF using the Q4_X on 8 RTX 3090

by martossien - opened 26 days ago

Just sharing a test local inference report for AesSedai/Kimi-K2.5-GGUF using the Q4_X ( in progress )
Machine

CPU: AMD EPYC 7532, 32 cores / 64 threads
RAM: 503 GiB DDR4 2933 MHz
GPU: 8 × NVIDIA GeForce RTX 3090 (24 GiB each)
NVLink: GPU 1↔5 and GPU 3↔6 (4 active links total with 2 nvlink)
OS: Fedora Linux 42, kernel 6.18.x
Serving stack: latest ik_llama.cpp (CUDA build)
NVIDIA driver: 580.142 ( new version CUDA 13 )

Model files were downloaded via LM Studio only; all inference is done with ik_llama.cpp.

You can see i have only 512 Go ram , so for the test i buy a ssd 128 Go for swap partition ( if not i have a crash because the first buffer need more ram i have )

The first sucesfull load, i must use layer because the automatic use more ram than i have , so the first version is :

~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X
--host 0.0.0.0
--port 8080
--ctx-size 16384
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
-muge
--jinja
--override-tensor 'blk.1.ffn_down_exps.weight=CUDA0,blk.1.ffn_gate_exps.weight=CUDA0,blk.1.ffn_up_exps.weight=CUDA0,blk.8.ffn_down_exps.weight=CUDA1,blk.8.ffn_gate_exps.weight=CUDA1,blk.8.ffn_up_exps.weight=CUDA1,blk.16.ffn_down_exps.weight=CUDA2,blk.16.ffn_gate_exps.weight=CUDA2,blk.16.ffn_up_exps.weight=CUDA2,blk.24.ffn_down_exps.weight=CUDA3,blk.24.ffn_gate_exps.weight=CUDA3,blk.24.ffn_up_exps.weight=CUDA3,blk.32.ffn_down_exps.weight=CUDA4,blk.32.ffn_gate_exps.weight=CUDA4,blk.32.ffn_up_exps.weight=CUDA4,blk.40.ffn_down_exps.weight=CUDA5,blk.40.ffn_gate_exps.weight=CUDA5,blk.40.ffn_up_exps.weight=CUDA5,blk.48.ffn_down_exps.weight=CUDA6,blk.48.ffn_gate_exps.weight=CUDA6,blk.48.ffn_up_exps.weight=CUDA6,blk.56.ffn_down_exps.weight=CUDA7,blk.56.ffn_gate_exps.weight=CUDA7,blk.56.ffn_up_exps.weight=CUDA7,blk.2.ffn_down_exps.weight=CPU,blk.2.ffn_gate_exps.weight=CPU,blk.2.ffn_up_exps.weight=CPU,blk.3.ffn_down_exps.weight=CPU,blk.3.ffn_gate_exps.weight=CPU,blk.3.ffn_up_exps.weight=CPU,blk.4.ffn_down_exps.weight=CPU,blk.4.ffn_gate_exps.weight=CPU,blk.4.ffn_up_exps.weight=CPU,blk.5.ffn_down_exps.weight=CPU,blk.5.ffn_gate_exps.weight=CPU,blk.5.ffn_up_exps.weight=CPU,blk.6.ffn_down_exps.weight=CPU,blk.6.ffn_gate_exps.weight=CPU,blk.6.ffn_up_exps.weight=CPU,blk.7.ffn_down_exps.weight=CPU,blk.7.ffn_gate_exps.weight=CPU,blk.7.ffn_up_exps.weight=CPU,blk.9.ffn_down_exps.weight=CPU,blk.9.ffn_gate_exps.weight=CPU,blk.9.ffn_up_exps.weight=CPU,blk.10.ffn_down_exps.weight=CPU,blk.10.ffn_gate_exps.weight=CPU,blk.10.ffn_up_exps.weight=CPU,blk.11.ffn_down_exps.weight=CPU,blk.11.ffn_gate_exps.weight=CPU,blk.11.ffn_up_exps.weight=CPU,blk.12.ffn_down_exps.weight=CPU,blk.12.ffn_gate_exps.weight=CPU,blk.12.ffn_up_exps.weight=CPU,blk.13.ffn_down_exps.weight=CPU,blk.13.ffn_gate_exps.weight=CPU,blk.13.ffn_up_exps.weight=CPU,blk.14.ffn_down_exps.weight=CPU,blk.14.ffn_gate_exps.weight=CPU,blk.14.ffn_up_exps.weight=CPU,blk.15.ffn_down_exps.weight=CPU,blk.15.ffn_gate_exps.weight=CPU,blk.15.ffn_up_exps.weight=CPU,blk.17.ffn_down_exps.weight=CPU,blk.17.ffn_gate_exps.weight=CPU,blk.17.ffn_up_exps.weight=CPU,blk.18.ffn_down_exps.weight=CPU,blk.18.ffn_gate_exps.weight=CPU,blk.18.ffn_up_exps.weight=CPU,blk.19.ffn_down_exps.weight=CPU,blk.19.ffn_gate_exps.weight=CPU,blk.19.ffn_up_exps.weight=CPU,blk.20.ffn_down_exps.weight=CPU,blk.20.ffn_gate_exps.weight=CPU,blk.20.ffn_up_exps.weight=CPU,blk.21.ffn_down_exps.weight=CPU,blk.21.ffn_gate_exps.weight=CPU,blk.21.ffn_up_exps.weight=CPU,blk.22.ffn_down_exps.weight=CPU,blk.22.ffn_gate_exps.weight=CPU,blk.22.ffn_up_exps.weight=CPU,blk.23.ffn_down_exps.weight=CPU,blk.23.ffn_gate_exps.weight=CPU,blk.23.ffn_up_exps.weight=CPU,blk.25.ffn_down_exps.weight=CPU,blk.25.ffn_gate_exps.weight=CPU,blk.25.ffn_up_exps.weight=CPU,blk.26.ffn_down_exps.weight=CPU,blk.26.ffn_gate_exps.weight=CPU,blk.26.ffn_up_exps.weight=CPU,blk.27.ffn_down_exps.weight=CPU,blk.27.ffn_gate_exps.weight=CPU,blk.27.ffn_up_exps.weight=CPU,blk.28.ffn_down_exps.weight=CPU,blk.28.ffn_gate_exps.weight=CPU,blk.28.ffn_up_exps.weight=CPU,blk.29.ffn_down_exps.weight=CPU,blk.29.ffn_gate_exps.weight=CPU,blk.29.ffn_up_exps.weight=CPU,blk.30.ffn_down_exps.weight=CPU,blk.30.ffn_gate_exps.weight=CPU,blk.30.ffn_up_exps.weight=CPU,blk.31.ffn_down_exps.weight=CPU,blk.31.ffn_gate_exps.weight=CPU,blk.31.ffn_up_exps.weight=CPU,blk.33.ffn_down_exps.weight=CPU,blk.33.ffn_gate_exps.weight=CPU,blk.33.ffn_up_exps.weight=CPU,blk.34.ffn_down_exps.weight=CPU,blk.34.ffn_gate_exps.weight=CPU,blk.34.ffn_up_exps.weight=CPU,blk.35.ffn_down_exps.weight=CPU,blk.35.ffn_gate_exps.weight=CPU,blk.35.ffn_up_exps.weight=CPU,blk.36.ffn_down_exps.weight=CPU,blk.36.ffn_gate_exps.weight=CPU,blk.36.ffn_up_exps.weight=CPU,blk.37.ffn_down_exps.weight=CPU,blk.37.ffn_gate_exps.weight=CPU,blk.37.ffn_up_exps.weight=CPU,blk.38.ffn_down_exps.weight=CPU,blk.38.ffn_gate_exps.weight=CPU,blk.38.ffn_up_exps.weight=CPU,blk.39.ffn_down_exps.weight=CPU,blk.39.ffn_gate_exps.weight=CPU,blk.39.ffn_up_exps.weight=CPU,blk.41.ffn_down_exps.weight=CPU,blk.41.ffn_gate_exps.weight=CPU,blk.41.ffn_up_exps.weight=CPU,blk.42.ffn_down_exps.weight=CPU,blk.42.ffn_gate_exps.weight=CPU,blk.42.ffn_up_exps.weight=CPU,blk.43.ffn_down_exps.weight=CPU,blk.43.ffn_gate_exps.weight=CPU,blk.43.ffn_up_exps.weight=CPU,blk.44.ffn_down_exps.weight=CPU,blk.44.ffn_gate_exps.weight=CPU,blk.44.ffn_up_exps.weight=CPU,blk.45.ffn_down_exps.weight=CPU,blk.45.ffn_gate_exps.weight=CPU,blk.45.ffn_up_exps.weight=CPU,blk.46.ffn_down_exps.weight=CPU,blk.46.ffn_gate_exps.weight=CPU,blk.46.ffn_up_exps.weight=CPU,blk.47.ffn_down_exps.weight=CPU,blk.47.ffn_gate_exps.weight=CPU,blk.47.ffn_up_exps.weight=CPU,blk.49.ffn_down_exps.weight=CPU,blk.49.ffn_gate_exps.weight=CPU,blk.49.ffn_up_exps.weight=CPU,blk.50.ffn_down_exps.weight=CPU,blk.50.ffn_gate_exps.weight=CPU,blk.50.ffn_up_exps.weight=CPU,blk.51.ffn_down_exps.weight=CPU,blk.51.ffn_gate_exps.weight=CPU,blk.51.ffn_up_exps.weight=CPU,blk.52.ffn_down_exps.weight=CPU,blk.52.ffn_gate_exps.weight=CPU,blk.52.ffn_up_exps.weight=CPU,blk.53.ffn_down_exps.weight=CPU,blk.53.ffn_gate_exps.weight=CPU,blk.53.ffn_up_exps.weight=CPU,blk.54.ffn_down_exps.weight=CPU,blk.54.ffn_gate_exps.weight=CPU,blk.54.ffn_up_exps.weight=CPU,blk.55.ffn_down_exps.weight=CPU,blk.55.ffn_gate_exps.weight=CPU,blk.55.ffn_up_exps.weight=CPU,blk.57.ffn_down_exps.weight=CPU,blk.57.ffn_gate_exps.weight=CPU,blk.57.ffn_up_exps.weight=CPU,blk.58.ffn_down_exps.weight=CPU,blk.58.ffn_gate_exps.weight=CPU,blk.58.ffn_up_exps.weight=CPU,blk.59.ffn_down_exps.weight=CPU,blk.59.ffn_gate_exps.weight=CPU,blk.59.ffn_up_exps.weight=CPU,blk.60.ffn_down_exps.weight=CPU,blk.60.ffn_gate_exps.weight=CPU,blk.60.ffn_up_exps.weight=CPU'

First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384

BUT i must continue ...

martossien

25 days ago

1 test with context at --ctx-size 102400 with opencode :

prompt eval time = 542344.01 ms / 15061 tokens ( 36.01 ms per token, 27.77 tokens per second)
eval time = 8214.41 ms / 65 tokens ( 126.38 ms per token, 7.91 tokens per second)
total time = 550558.42 ms / 15126 tokens

prompt processing is low , but lot of this llm is on RAM/CPU

AesSedai

Owner 25 days ago

If you're having to use a swap partition for this, I'd recommend stepping down to the IQ3_S instead of the Q4_X. The entire model is needed for prompt processing so your swap partition offloading is what's dragging that token rate down more than expected I think.

martossien

25 days ago

i can't use IQ3_S : the llm have some trouble with code ( forget some " or ; ...) , the swap is only used for the load after it's use 94% of ram (without the swap )...
I continue the test

dehnhaide

25 days ago

First result :
Token: 8.4 t/s | Prompt: 7.2 t/s the 1 test : Ctx: 211 / 211 / 16384

BUT i must continue ...

Nice effort @martossien ! I will keep an eye on your progress with this beast model.

martossien

25 days ago

A new big change in ik_llama.cpp ( for me ;) ) the :
--fit and --fit-margin N,
other commit :

Date	Commit	Description	Impact
Mar 27	`93ae47e`	Fix CUDA Hadamard transform bug	CRITIQUE
Mar 26	`8ab016e`	Fix CPU flash attention bf16	IMPORTANT
Mar 26	`78977c0`	Fix jinja	IMPORTANT
Mar 26	`a84d90a`	Fix bug in #1506	Stabilité
Mar 25	`dd75fd0`	Fix KV cache split graph	Stabilité
Mar 24	`cdf9142`	Fix Qwen3.5 grammar	Compatibilité
Mar 25	`86f4f51`	Auto-fit MoE models	MAJEUR
Mar 25	`b6bac1a`	Auto-fit dense models	MAJEUR
Mar 25	`4b1a656`	Documentation --fit/--fit-margin	Documentation
Mar 24	`233225d`	Layer sizes for GPU layers	IMPORTANT
Mar 25	`1f3e832`	Improve MTP acceptance rate	Performance
Mar 28	`798af86`	Correct split modes llama-bench
Mar 26	`c06067f`	Ignore MTP in memory calc
Mar 26	`9eaf105`	Print pinned memory info
Mar 26	`095bd3d`	Pinned memory with mmap
Mar 26	`d66dc7c`	Restore pinned memory usage
Mar 26	`e46601c`	Log probs on sampling crash

so i test this new fonction :
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit
--host 0.0.0.0
--port 8081
--ctx-size 102400
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--fit-margin 3072
--cache-type-k q6_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
-muge
--jinja

so now ( at beginning of context , so i must test it at 100K ):
0 NVIDIA GeForce RTX 3090 22303MiB / 24576MiB
| 1 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 2 NVIDIA GeForce RTX 3090 21961MiB / 24576MiB
| 3 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 4 NVIDIA GeForce RTX 3090 22248MiB / 24576MiB
| 5 NVIDIA GeForce RTX 3090 21744MiB / 24576MiB
| 6 NVIDIA GeForce RTX 3090 21962MiB / 24576MiB
| 7 NVIDIA GeForce RTX 3090 22933MiB / 24576MiB
the trouble with GPU7 ( i have the same in manual ) ,
now 98 % of ram ( no swap ) BUT graph seems works ! So a little better result ...

dehnhaide

25 days ago

•

edited 25 days ago

--batch-size 256
--ubatch-size 256 \

Those also have to be adapted since your PP will by abysmally LOW & SLOW

--fit-margin 3072

Why so high? Not a headless machine?

martossien

25 days ago

--batch-size and --ubatch-size use lot of vram and when you have a lot in ram/cpu , the cpu is bottleneck , i want to test 512 but i need to work with 100K of context ( my dev team need this minimum limit, openode compress context at 70% )
--fit-margin 3072 in fact fit is not magic, you give a goal but the automatic fit make waht he can, and with this llm for 100K context the marge must be near this size.
and the load take long time , so i must test low parameters and test 100K and if not vram crash i can check if i can gain.
Load 1H
test 100K 2H ( it depends)
find better and divers 30 mn
so 3H 30/ complet test, and i think i must 10 to 20 complet test for find best result :code quality , big context, speed.
After this i give the acces to my dev team, they use it for confidential code task , and give their return.

dehnhaide

25 days ago

•

edited 25 days ago

Load 1H

Mother of God, why so slow? I'm on PCIE4 / 7443P with only 24 cores and it take a max of 3-5 minutes per load!!!

Ok, you could also try (since you're using threads on CPU a lot) to use the SCHED_RR.
sudo chrt --rr 99 ik-llama_PID

dehnhaide

25 days ago

•

edited 25 days ago

Also you might one to recompile ik_llama beacause since 2h ago: https://github.com/ikawrakow/ik_llama.cpp/pull/1540 --> I never knew about the bug HOWEVER I was going nuts seeing my "-ts" being constantly ignored! WTG!

martossien

24 days ago

Hello everyone,

First of all, thank you to all contributors in this thread and especially to the model author for this impressive quantization ( @AesSedai ). It’s quite something to run this model locally.

After quite a long night of testing (and many crashes), I wanted to share a working configuration and some feedback that might help others.

Working command
~/ik_llama.cpp/build/bin/llama-server
--model /home/admin_ia/.cache/lm-studio/models/AesSedai/Kimi-K2.5-GGUF/Kimi-K2.5-Q4_X-00001-of-00014.gguf
--alias Kimi-K2.5-Q4_X_fit_nomuge
--host 0.0.0.0
--port 8080
--ctx-size 92160
--threads 32
--threads-batch 64
--batch-size 256
--ubatch-size 256
--parallel 1
--flash-attn on
--n-gpu-layers 999
--fit
--cache-type-k q5_0
--cache-type-v q4_0
--k-cache-hadamard
--graph-reuse
--no-mmap
--jinja
--fit-margin 12000,12000,12000,12000,12000,12000,12000,12550

We recompiled with the latest commits as suggested in this discussion.

We did not test:
sudo chrt --rr 99
At this point the setup was already quite unstable during tuning (many crashes), so we avoided adding more risk.

Important clarification about --fit-margin
At first, I misunderstood how --fit-margin works:
It is NOT a margin applied after allocation
It is applied during the initial layer distribution phase
You can specify per-GPU values using commas
Why we reduced GPU7 load
GPU7 was consistently about ~1 GB more loaded than the other GPUs.
This caused:
crashes around ~32k context
VRAM exhaustion specifically on GPU7

To fix this, we used a higher margin:
12550 (for GPU7)
This prevents one layer from being loaded on GPU7, restoring balance across GPUs and allowing much higher context.

RAM constraints

We were running at:
~99% system RAM usage
< 4 GB free
without using swap

We initially tried -muge, but:
it pushed swap usage very aggressively
and in our case, it caused crashes
So we removed it entirely.

Results
After multiple runs and validations:
Stable context: ~90k tokens
Above that: unstable / crashes
KV cache growth becomes critical near the limit

We also considered:
--batch-size 512
--ubatch-size 512
But:
VRAM is already the limiting factor
testing this properly would require much more time
(we already spent more than 2 full day tuning this 😅)

VRAM usage per GPU (startup vs context growth)
Tested on 8× RTX 3090 (24GB):
After model load
GPU0: ~21.2 GiB / 24 GiB
GPU1: ~20.9 GiB
GPU2: ~20.9 GiB
GPU3: ~20.7 GiB
GPU4: ~20.9 GiB
GPU5: ~20.9 GiB
GPU6: ~20.9 GiB
GPU7: ~13.0–14.0 GiB (reduced via --fit-margin)

Most GPUs start already at ~85–88% VRAM usage.

With context growth (measured)
~15k tokens → ~20.9–21.2 GiB
~31k tokens → ~21.4–21.7 GiB
~45k tokens → ~21.7–22.2 GiB
~60k tokens → ~22.0–22.3 GiB

GPU0 is always the limiting GPU
(GPU7 becomes the limiting one if not reduced via fit-margin)

KV cache growth

From measurements:

~+1.0 to +1.2 GiB between 15k → 60k tokens
≈ 25–35 MiB per 1k tokens per GPU

Consistent with:

q5_0 K cache
q4_0 V cache
Hadamard enabled
Key behavior
KV cache accumulates with context
The limit is reached progressively (not at startup)
Projection
~80k → ~23+ GiB
~90k → very close to 24 GiB limit
beyond that → high risk of crash

Even with --ctx-size 92160,
the real usable context is ~90k max on this setup

Performance observations
Generation: ~7–8 tokens/sec
Prefill (prompt processing): ~26–27 tokens/sec
The main bottleneck is prefill at high context, not generation speed.

Practical conclusion
Even though it technically fits:
~90k context is the practical ceiling
prompt processing latency becomes significant
for our dev team, this is still a bit tight for large workflows

If this helps someone avoid a full night of crashes, then it was worth writing 🙂

johnsmithxx

17 days ago

Just for comparison: with one RTX PRO 6000 Q4_X gives me typically 9.6 t/s, once I saw 9.8 and a few times 9.7. Haven't been able to squeeze more so far. Mainline llama.cpp.

martossien

16 days ago

I have old cpu ( epyc 7532 ) old ram ( ddr4 2933mhz ) and 8 rtx 3090 ( 600 euros by gpu ) , so i very low cost computer. The rtx 6000 is a good gpu , but not very cheap ...
With 8 gpu , a lot of vram is used for synchro the other gpu ... And with 90 k token with this config i can test llama.cpp but when i mad it, ik_llama.cpp is Always the winner when the load have ram offload ...
But if someone can give me this beautiful gpu , i make other test ...
In my office computer ( 2 rtx 5090 with very old cpu ) i Ask 1 other rtx 5090 and 1 rtx 6000 pro Blackwell, if i have the Ok i come back with this same test ...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment