Testing smol-IQ3_KS

#1
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090


Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 186624
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6644.32 MiB
llama_new_context_with_model: KV self size = 6644.29 MiB, c^KV (q8_0): 6644.29 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11660.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1569.88 MiB
llama_new_context_with_model: graph nodes = 4075
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 186624, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 39.148 104.47 83.660 12.22
4090 1022 4090 39.508 103.52 64.039 15.96
4090 1022 8180 39.877 102.57 84.079 12.16
4090 1022 12270 40.287 101.52 59.789 17.09
4090 1022 16360 40.699 100.49 61.000 16.75
4090 1022 20450 41.012 99.73 61.292 16.67

2026-02-07_12-17

2026-02-07_13-56
Slightly less intelligent than the original model, but still high-performing and stable even with large context sizes.

Owner

Thanks for testing!

Interesting your TG speed varies and even speeds up with longer context? Are you using --no-mmap to pre-load malloc everything into RAM? [i'm guessing you are not, and using default mmap?] So possibly warming up the disk page cache with weights during the initial testing?

Cool it works well with long context though, seems good for patient vibe coding haha...

What client do you use, opencode or something?

This time, I used Roo code in vs code for testing.
Belows are my options.

--merge-qkv
--ctx-size 186608
-ctk q8_0
-mla 3
--parallel 1
--threads 101
--no-mmap
--jinja
--special
--chat-template-file ./models/templates/Kimi-K2-Thinking.jinja
-b 4090 -ub 4090
-amb 512
--n-gpu-layers 99
--override-tensor exps=CPU \

Owner

In my own testing, I too was using the old Kimi-K2-Thinking.jinja chat template to get tool use working on K2.5. Not sure if there is an updated chat template that would work better maybe?

Otherwise, looks good! You might get a little more speed with -amb 1024 but probably fine as it is.

Thanks for the details!!

with -amb 1024

main: n_kv_max = 176640, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 41.731 98.01 60.487 16.90
4090 1022 4090 65.923 62.04 88.352 11.57
4090 1022 8180 66.369 61.62 89.885 11.37
4090 1022 12270 67.010 61.04 90.296 11.32

without amb option
context size is decreased by 60K


Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 126720
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 4511.59 MiB
llama_new_context_with_model: KV self size = 4511.56 MiB, c^KV (q8_0): 4511.56 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 13595.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1101.88 MiB
llama_new_context_with_model: graph nodes = 3404
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 126720, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 38.827 105.34 55.546 18.40
4090 1022 4090 39.783 102.81 66.905 15.28
4090 1022 8180 40.214 101.71 58.281 17.54
4090 1022 12270 40.641 100.64 59.134 17.28
4090 1022 16360 41.105 99.50 60.388 16.92
4090 1022 20450 41.423 98.74 67.449 15.15
Owner

Thanks for checking, I sometimes use -amb 1024 or -amb 2048 but -amb 512 is usually the lowest I'll use, and to your point it does save enough compute buffer to add a lot of context!

Still interesting your machine is not monotonicly decreasing for TG speeds as context size increases..

Also a bit odd you're not using -ub 4096 -b 4096 (powers of 2) but chose 4090 ? There may be a benefit to choosing power of two, but i'm not 100% sure.

Anyway thanks as usual!

4090 is just typo mistake, I’ll check next time to see if it makes a difference.

Sign up or log in to comment