Testing Q4_0
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CPU buffer size = 43008.00 MiB
llm_load_tensors: CUDA_Host buffer size = 185.47 MiB
llm_load_tensors: CUDA0 buffer size = 2226.26 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 7096
llama_init_from_model: n_ubatch = 7096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 2048
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 4767.38 MiB
llama_init_from_model: KV self size = 4692.00 MiB, K (f16): 2346.00 MiB, V (f16): 2346.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.58 MiB
llama_init_from_model: CUDA0 compute buffer size = 4259.09 MiB
llama_init_from_model: CUDA_Host compute buffer size = 2768.16 MiB
llama_init_from_model: graph nodes = 101374
llama_init_from_model: graph splits = 98
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 200192, n_batch = 7096, n_ubatch = 7096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 7096 | 1774 | 0 | 5.241 | 1354.00 | 26.462 | 67.04 |
| 7096 | 1774 | 7096 | 5.130 | 1383.36 | 26.832 | 66.12 |
| 7096 | 1774 | 14192 | 5.145 | 1379.23 | 27.358 | 64.84 |
| 7096 | 1774 | 21288 | 5.166 | 1373.64 | 27.982 | 63.40 |
| 7096 | 1774 | 28384 | 5.229 | 1357.11 | 28.221 | 62.86 |
