Thank you

by sousekd - opened Oct 22, 2025

Oct 22, 2025

Still catching up on "older" models :).
Obligatory benchmark, Epyc 9355 + RTX 5090:

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 4096 -ub 4096 \
    -ctk f16 -c 98304 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	18.667	219.43	60.762	16.85
4096	1024	4096	19.121	214.22	61.478	16.66
4096	1024	8192	19.661	208.33	62.508	16.38
4096	1024	12288	20.140	203.37	63.698	16.08
4096	1024	16384	20.860	196.36	63.808	16.05
4096	1024	20480	21.245	192.79	63.840	16.04
4096	1024	24576	21.928	186.80	65.525	15.63
4096	1024	28672	22.327	183.45	65.572	15.62
4096	1024	32768	23.053	177.68	66.077	15.50
4096	1024	36864	23.855	171.70	66.342	15.44
4096	1024	40960	24.469	167.39	66.464	15.41
4096	1024	45056	24.667	166.05	68.073	15.04
4096	1024	49152	25.231	162.34	68.208	15.01
4096	1024	53248	25.736	159.16	68.262	15.00
4096	1024	57344	26.515	154.48	68.810	14.88
4096	1024	61440	26.746	153.15	69.033	14.83
4096	1024	65536	27.611	148.35	70.796	14.46
4096	1024	69632	28.135	145.59	70.817	14.46
4096	1024	73728	28.682	142.81	70.902	14.44
4096	1024	77824	29.241	140.08	71.426	14.34
4096	1024	81920	29.827	137.32	71.585	14.30
4096	1024	86016	30.382	134.82	71.913	14.24
4096	1024	90112	31.014	132.07	73.761	13.88
4096	1024	94208	31.552	129.82	73.224	13.98

anikifoss

Owner Oct 22, 2025

Thanks again for sharing! These are great numbers... I can finally see how Kimi's lower compute requirements result in more tokens per second on larger contexts.

For some reason, Kimi-K2 performs slightly worse than DeepSeek on my 8-channel setup. Looks like the RAM bandwidth can be a very limiting bottleneck.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment