Testing smol-IQ3_KS
W790E Sage + QYFS + 512G + RTX5090
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 186624
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6644.32 MiB
llama_new_context_with_model: KV self size = 6644.29 MiB, c^KV (q8_0): 6644.29 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11660.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1569.88 MiB
llama_new_context_with_model: graph nodes = 4075
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
main: n_kv_max = 186624, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 39.148 | 104.47 | 83.660 | 12.22 |
| 4090 | 1022 | 4090 | 39.508 | 103.52 | 64.039 | 15.96 |
| 4090 | 1022 | 8180 | 39.877 | 102.57 | 84.079 | 12.16 |
| 4090 | 1022 | 12270 | 40.287 | 101.52 | 59.789 | 17.09 |
| 4090 | 1022 | 16360 | 40.699 | 100.49 | 61.000 | 16.75 |
| 4090 | 1022 | 20450 | 41.012 | 99.73 | 61.292 | 16.67 |
Thanks for testing!
Interesting your TG speed varies and even speeds up with longer context? Are you using --no-mmap to pre-load malloc everything into RAM? [i'm guessing you are not, and using default mmap?] So possibly warming up the disk page cache with weights during the initial testing?
Cool it works well with long context though, seems good for patient vibe coding haha...
What client do you use, opencode or something?
This time, I used Roo code in vs code for testing.
Belows are my options.
--merge-qkv
--ctx-size 186608
-ctk q8_0
-mla 3
--parallel 1
--threads 101
--no-mmap
--jinja
--special
--chat-template-file ./models/templates/Kimi-K2-Thinking.jinja
-b 4090 -ub 4090
-amb 512
--n-gpu-layers 99
--override-tensor exps=CPU \
In my own testing, I too was using the old Kimi-K2-Thinking.jinja chat template to get tool use working on K2.5. Not sure if there is an updated chat template that would work better maybe?
Otherwise, looks good! You might get a little more speed with -amb 1024 but probably fine as it is.
Thanks for the details!!
with -amb 1024
main: n_kv_max = 176640, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 41.731 | 98.01 | 60.487 | 16.90 |
| 4090 | 1022 | 4090 | 65.923 | 62.04 | 88.352 | 11.57 |
| 4090 | 1022 | 8180 | 66.369 | 61.62 | 89.885 | 11.37 |
| 4090 | 1022 | 12270 | 67.010 | 61.04 | 90.296 | 11.32 |
without amb option
context size is decreased by 60K
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 126720
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 4511.59 MiB
llama_new_context_with_model: KV self size = 4511.56 MiB, c^KV (q8_0): 4511.56 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 13595.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1101.88 MiB
llama_new_context_with_model: graph nodes = 3404
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
main: n_kv_max = 126720, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 38.827 | 105.34 | 55.546 | 18.40 |
| 4090 | 1022 | 4090 | 39.783 | 102.81 | 66.905 | 15.28 |
| 4090 | 1022 | 8180 | 40.214 | 101.71 | 58.281 | 17.54 |
| 4090 | 1022 | 12270 | 40.641 | 100.64 | 59.134 | 17.28 |
| 4090 | 1022 | 16360 | 41.105 | 99.50 | 60.388 | 16.92 |
| 4090 | 1022 | 20450 | 41.423 | 98.74 | 67.449 | 15.15 |
Thanks for checking, I sometimes use -amb 1024 or -amb 2048 but -amb 512 is usually the lowest I'll use, and to your point it does save enough compute buffer to add a lot of context!
Still interesting your machine is not monotonicly decreasing for TG speeds as context size increases..
Also a bit odd you're not using -ub 4096 -b 4096 (powers of 2) but chose 4090 ? There may be a benefit to choosing power of two, but i'm not 100% sure.
Anyway thanks as usual!
4090 is just typo mistake, Iβll check next time to see if it makes a difference.

