Thank you

#1
by sousekd - opened

Thank you @anikifoss .

Still catching up on "older" models :).
Obligatory benchmark, Epyc 9355 + RTX 5090:

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 -fa -fmoe \
    -amb 512 -b 4096 -ub 4096 \
    -ctk f16 -c 98304 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 18.667 219.43 60.762 16.85
4096 1024 4096 19.121 214.22 61.478 16.66
4096 1024 8192 19.661 208.33 62.508 16.38
4096 1024 12288 20.140 203.37 63.698 16.08
4096 1024 16384 20.860 196.36 63.808 16.05
4096 1024 20480 21.245 192.79 63.840 16.04
4096 1024 24576 21.928 186.80 65.525 15.63
4096 1024 28672 22.327 183.45 65.572 15.62
4096 1024 32768 23.053 177.68 66.077 15.50
4096 1024 36864 23.855 171.70 66.342 15.44
4096 1024 40960 24.469 167.39 66.464 15.41
4096 1024 45056 24.667 166.05 68.073 15.04
4096 1024 49152 25.231 162.34 68.208 15.01
4096 1024 53248 25.736 159.16 68.262 15.00
4096 1024 57344 26.515 154.48 68.810 14.88
4096 1024 61440 26.746 153.15 69.033 14.83
4096 1024 65536 27.611 148.35 70.796 14.46
4096 1024 69632 28.135 145.59 70.817 14.46
4096 1024 73728 28.682 142.81 70.902 14.44
4096 1024 77824 29.241 140.08 71.426 14.34
4096 1024 81920 29.827 137.32 71.585 14.30
4096 1024 86016 30.382 134.82 71.913 14.24
4096 1024 90112 31.014 132.07 73.761 13.88
4096 1024 94208 31.552 129.82 73.224 13.98

Thanks again for sharing! These are great numbers... I can finally see how Kimi's lower compute requirements result in more tokens per second on larger contexts.

For some reason, Kimi-K2 performs slightly worse than DeepSeek on my 8-channel setup. Looks like the RAM bandwidth can be a very limiting bottleneck.

Sign up or log in to comment