Testing Q4_0
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 17920.00 MiB
llm_load_tensors: CUDA_Host buffer size = 303.12 MiB
llm_load_tensors: CUDA0 buffer size = 2027.78 MiB
.................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 250112
llama_init_from_model: n_batch = 8096
llama_init_from_model: n_ubatch = 8096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 8096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 4947.81 MiB
llama_init_from_model: KV self size = 4885.00 MiB, K (f16): 2442.50 MiB, V (f16): 2442.50 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 7732.31 MiB
llama_init_from_model: CUDA_Host compute buffer size = 3925.72 MiB
llama_init_from_model: graph nodes = 95820
llama_init_from_model: graph splits = 82
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 250112, n_batch = 8096, n_ubatch = 8096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8096 | 2024 | 0 | 3.111 | 2602.01 | 26.582 | 76.14 |
| 8096 | 2024 | 8096 | 3.105 | 2607.24 | 26.399 | 76.67 |
| 8096 | 2024 | 16192 | 3.130 | 2586.94 | 25.682 | 78.81 |
| 8096 | 2024 | 24288 | 3.176 | 2549.05 | 26.232 | 77.16 |
| 8096 | 2024 | 32384 | 3.231 | 2505.65 | 26.152 | 77.39 |
| 8096 | 2024 | 40480 | 3.283 | 2465.88 | 27.841 | 72.70 |
| 8096 | 2024 | 48576 | 3.473 | 2330.81 | 29.835 | 67.84 |
