Testing smol-IQ3_KS

by shewin - opened Feb 7

Feb 7

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 186624
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6644.32 MiB
llama_new_context_with_model: KV self size = 6644.29 MiB, c^KV (q8_0): 6644.29 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11660.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1569.88 MiB
llama_new_context_with_model: graph nodes = 4075
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 186624, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	39.148	104.47	83.660	12.22
4090	1022	4090	39.508	103.52	64.039	15.96
4090	1022	8180	39.877	102.57	84.079	12.16
4090	1022	12270	40.287	101.52	59.789	17.09
4090	1022	16360	40.699	100.49	61.000	16.75
4090	1022	20450	41.012	99.73	61.292	16.67

shewin

Feb 7

Slightly less intelligent than the original model, but still high-performing and stable even with large context sizes.

ubergarm

Owner Feb 7

Thanks for testing!

Interesting your TG speed varies and even speeds up with longer context? Are you using --no-mmap to pre-load malloc everything into RAM? [i'm guessing you are not, and using default mmap?] So possibly warming up the disk page cache with weights during the initial testing?

Cool it works well with long context though, seems good for patient vibe coding haha...

What client do you use, opencode or something?

shewin

Feb 7

This time, I used Roo code in vs code for testing.
Belows are my options.

--merge-qkv
--ctx-size 186608
-ctk q8_0
-mla 3
--parallel 1
--threads 101
--no-mmap
--jinja
--special
--chat-template-file ./models/templates/Kimi-K2-Thinking.jinja
-b 4090 -ub 4090
-amb 512
--n-gpu-layers 99
--override-tensor exps=CPU \

ubergarm

Owner Feb 7

In my own testing, I too was using the old Kimi-K2-Thinking.jinja chat template to get tool use working on K2.5. Not sure if there is an updated chat template that would work better maybe?

Otherwise, looks good! You might get a little more speed with -amb 1024 but probably fine as it is.

Thanks for the details!!

shewin

Feb 8

with -amb 1024

main: n_kv_max = 176640, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	41.731	98.01	60.487	16.90
4090	1022	4090	65.923	62.04	88.352	11.57
4090	1022	8180	66.369	61.62	89.885	11.37
4090	1022	12270	67.010	61.04	90.296	11.32

shewin

Feb 8

without amb option
context size is decreased by 60K

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 126720
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 4511.59 MiB
llama_new_context_with_model: KV self size = 4511.56 MiB, c^KV (q8_0): 4511.56 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 13595.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1101.88 MiB
llama_new_context_with_model: graph nodes = 3404
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 126720, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	38.827	105.34	55.546	18.40
4090	1022	4090	39.783	102.81	66.905	15.28
4090	1022	8180	40.214	101.71	58.281	17.54
4090	1022	12270	40.641	100.64	59.134	17.28
4090	1022	16360	41.105	99.50	60.388	16.92
4090	1022	20450	41.423	98.74	67.449	15.15

ubergarm

Owner Feb 8

Thanks for checking, I sometimes use -amb 1024 or -amb 2048 but -amb 512 is usually the lowest I'll use, and to your point it does save enough compute buffer to add a lot of context!

Still interesting your machine is not monotonicly decreasing for TG speeds as context size increases..

Also a bit odd you're not using -ub 4096 -b 4096 (powers of 2) but chose 4090 ? There may be a benefit to choosing power of two, but i'm not 100% sure.

Anyway thanks as usual!

shewin

Feb 9

4090 is just typo mistake, I’ll check next time to see if it makes a difference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Testing smol-IQ3_KS

This time, I used Roo code in vs code for testing.Belows are my options.

with -amb 1024

This time, I used Roo code in vs code for testing.
Belows are my options.