Testing IQ4_KSS

by shewin - opened Feb 21

Feb 21

Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 188880.00 MiB
llm_load_tensors: CUDA_Host buffer size = 803.28 MiB
llm_load_tensors: CUDA0 buffer size = 9032.28 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3302.11 MiB
llama_init_from_model: KV self size = 3115.78 MiB, K (q8_0): 1557.89 MiB, V (q8_0): 1557.89 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 4946.23 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1628.16 MiB
llama_init_from_model: graph nodes = 75685
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 200192, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	17.380	235.67	37.590	27.24
4096	1024	4096	17.470	234.45	35.237	29.06
4096	1024	8192	17.534	233.61	35.473	28.87
4096	1024	12288	17.678	231.70	35.842	28.57
4096	1024	16384	17.587	232.90	35.968	28.47
4096	1024	20480	17.675	231.74	36.078	28.38

shewin

Feb 21

Instead of writing the output gradually, it suddenly outputs everything at once after a long delay.
It does not consistently produce stable responses.

ubergarm

Owner Feb 21

Instead of writing the output gradually, it suddenly outputs everything at once after a long delay.
It does not consistently produce stable responses.

Interesting findings, and you have run a lot of models!

I wonder if this could be improved with a different client or system prompt even to encourage it to work "step by step" or "iterate on the design testing each piece as you go" ?

I could make a bigger quant but not sure it would do any better...

shewin

12 days ago

Retesting:

llama_init_from_model: n_ctx = 190208
llama_init_from_model: n_batch = 8096
llama_init_from_model: n_ubatch = 8096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 8096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5758.83 MiB
llama_init_from_model: KV self size = 5572.50 MiB, K (f16): 2786.25 MiB, V (f16): 2786.25 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 7795.56 MiB
llama_init_from_model: CUDA_Host compute buffer size = 3063.89 MiB
llama_init_from_model: graph nodes = 3920
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 190208, n_batch = 8096, n_ubatch = 8096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8096	2024	0	7.023	1152.73	65.177	31.05
8096	2024	8096	7.151	1132.09	67.620	29.93
8096	2024	16192	7.221	1121.18	68.043	29.75
8096	2024	24288	7.439	1088.26	69.052	29.31
8096	2024	32384	7.528	1075.50	69.291	29.21

shewin

12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment