Testing IQ4_KSS
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 188880.00 MiB
llm_load_tensors: CUDA_Host buffer size = 803.28 MiB
llm_load_tensors: CUDA0 buffer size = 9032.28 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3302.11 MiB
llama_init_from_model: KV self size = 3115.78 MiB, K (q8_0): 1557.89 MiB, V (q8_0): 1557.89 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 4946.23 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1628.16 MiB
llama_init_from_model: graph nodes = 75685
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 200192, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 17.380 | 235.67 | 37.590 | 27.24 |
| 4096 | 1024 | 4096 | 17.470 | 234.45 | 35.237 | 29.06 |
| 4096 | 1024 | 8192 | 17.534 | 233.61 | 35.473 | 28.87 |
| 4096 | 1024 | 12288 | 17.678 | 231.70 | 35.842 | 28.57 |
| 4096 | 1024 | 16384 | 17.587 | 232.90 | 35.968 | 28.47 |
| 4096 | 1024 | 20480 | 17.675 | 231.74 | 36.078 | 28.38 |
Instead of writing the output gradually, it suddenly outputs everything at once after a long delay.
It does not consistently produce stable responses.
Interesting findings, and you have run a lot of models!
I wonder if this could be improved with a different client or system prompt even to encourage it to work "step by step" or "iterate on the design testing each piece as you go" ?
I could make a bigger quant but not sure it would do any better...
Retesting:
llama_init_from_model: n_ctx = 190208
llama_init_from_model: n_batch = 8096
llama_init_from_model: n_ubatch = 8096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 8096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5758.83 MiB
llama_init_from_model: KV self size = 5572.50 MiB, K (f16): 2786.25 MiB, V (f16): 2786.25 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 7795.56 MiB
llama_init_from_model: CUDA_Host compute buffer size = 3063.89 MiB
llama_init_from_model: graph nodes = 3920
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 190208, n_batch = 8096, n_ubatch = 8096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8096 | 2024 | 0 | 7.023 | 1152.73 | 65.177 | 31.05 |
| 8096 | 2024 | 8096 | 7.151 | 1132.09 | 67.620 | 29.93 |
| 8096 | 2024 | 16192 | 7.221 | 1121.18 | 68.043 | 29.75 |
| 8096 | 2024 | 24288 | 7.439 | 1088.26 | 69.052 | 29.31 |
| 8096 | 2024 | 32384 | 7.528 | 1075.50 | 69.291 | 29.21 |


