Testing IQ4_KSS

#4
by shewin - opened

Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 188880.00 MiB
llm_load_tensors: CUDA_Host buffer size = 803.28 MiB
llm_load_tensors: CUDA0 buffer size = 9032.28 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3302.11 MiB
llama_init_from_model: KV self size = 3115.78 MiB, K (q8_0): 1557.89 MiB, V (q8_0): 1557.89 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 4946.23 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1628.16 MiB
llama_init_from_model: graph nodes = 75685
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 200192, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 17.380 235.67 37.590 27.24
4096 1024 4096 17.470 234.45 35.237 29.06
4096 1024 8192 17.534 233.61 35.473 28.87
4096 1024 12288 17.678 231.70 35.842 28.57
4096 1024 16384 17.587 232.90 35.968 28.47
4096 1024 20480 17.675 231.74 36.078 28.38

2026-02-21_10-02
Instead of writing the output gradually, it suddenly outputs everything at once after a long delay.
It does not consistently produce stable responses.

Instead of writing the output gradually, it suddenly outputs everything at once after a long delay.
It does not consistently produce stable responses.

Interesting findings, and you have run a lot of models!

I wonder if this could be improved with a different client or system prompt even to encourage it to work "step by step" or "iterate on the design testing each piece as you go" ?

I could make a bigger quant but not sure it would do any better...

Retesting:

llama_init_from_model: n_ctx = 190208
llama_init_from_model: n_batch = 8096
llama_init_from_model: n_ubatch = 8096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 8096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5758.83 MiB
llama_init_from_model: KV self size = 5572.50 MiB, K (f16): 2786.25 MiB, V (f16): 2786.25 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 7795.56 MiB
llama_init_from_model: CUDA_Host compute buffer size = 3063.89 MiB
llama_init_from_model: graph nodes = 3920
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 190208, n_batch = 8096, n_ubatch = 8096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8096 2024 0 7.023 1152.73 65.177 31.05
8096 2024 8096 7.151 1132.09 67.620 29.93
8096 2024 16192 7.221 1121.18 68.043 29.75
8096 2024 24288 7.439 1088.26 69.052 29.31
8096 2024 32384 7.528 1075.50 69.291 29.21

2026-04-11_14-32

2026-04-11_16-29

Sign up or log in to comment