Thank you
#1
by sousekd - opened
Thank you @anikifoss .
Still catching up on "older" models :).
Obligatory benchmark, Epyc 9355 + RTX 5090:
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-mla 3 -fa -fmoe \
-amb 512 -b 4096 -ub 4096 \
-ctk f16 -c 98304 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 18.667 | 219.43 | 60.762 | 16.85 |
| 4096 | 1024 | 4096 | 19.121 | 214.22 | 61.478 | 16.66 |
| 4096 | 1024 | 8192 | 19.661 | 208.33 | 62.508 | 16.38 |
| 4096 | 1024 | 12288 | 20.140 | 203.37 | 63.698 | 16.08 |
| 4096 | 1024 | 16384 | 20.860 | 196.36 | 63.808 | 16.05 |
| 4096 | 1024 | 20480 | 21.245 | 192.79 | 63.840 | 16.04 |
| 4096 | 1024 | 24576 | 21.928 | 186.80 | 65.525 | 15.63 |
| 4096 | 1024 | 28672 | 22.327 | 183.45 | 65.572 | 15.62 |
| 4096 | 1024 | 32768 | 23.053 | 177.68 | 66.077 | 15.50 |
| 4096 | 1024 | 36864 | 23.855 | 171.70 | 66.342 | 15.44 |
| 4096 | 1024 | 40960 | 24.469 | 167.39 | 66.464 | 15.41 |
| 4096 | 1024 | 45056 | 24.667 | 166.05 | 68.073 | 15.04 |
| 4096 | 1024 | 49152 | 25.231 | 162.34 | 68.208 | 15.01 |
| 4096 | 1024 | 53248 | 25.736 | 159.16 | 68.262 | 15.00 |
| 4096 | 1024 | 57344 | 26.515 | 154.48 | 68.810 | 14.88 |
| 4096 | 1024 | 61440 | 26.746 | 153.15 | 69.033 | 14.83 |
| 4096 | 1024 | 65536 | 27.611 | 148.35 | 70.796 | 14.46 |
| 4096 | 1024 | 69632 | 28.135 | 145.59 | 70.817 | 14.46 |
| 4096 | 1024 | 73728 | 28.682 | 142.81 | 70.902 | 14.44 |
| 4096 | 1024 | 77824 | 29.241 | 140.08 | 71.426 | 14.34 |
| 4096 | 1024 | 81920 | 29.827 | 137.32 | 71.585 | 14.30 |
| 4096 | 1024 | 86016 | 30.382 | 134.82 | 71.913 | 14.24 |
| 4096 | 1024 | 90112 | 31.014 | 132.07 | 73.761 | 13.88 |
| 4096 | 1024 | 94208 | 31.552 | 129.82 | 73.224 | 13.98 |
Thanks again for sharing! These are great numbers... I can finally see how Kimi's lower compute requirements result in more tokens per second on larger contexts.
For some reason, Kimi-K2 performs slightly worse than DeepSeek on my 8-channel setup. Looks like the RAM bandwidth can be a very limiting bottleneck.