Testing IQ4_KSS

#5
by shewin - opened

Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CPU buffer size = 56688.00 MiB
llm_load_tensors: CUDA_Host buffer size = 602.46 MiB
llm_load_tensors: CUDA0 buffer size = 5397.56 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 250112
llama_init_from_model: n_batch = 8096
llama_init_from_model: n_ubatch = 8096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 8096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 6011.06 MiB
llama_init_from_model: KV self size = 5862.00 MiB, K (f16): 2931.00 MiB, V (f16): 2931.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 7763.94 MiB
llama_init_from_model: CUDA_Host compute buffer size = 3957.34 MiB
llama_init_from_model: graph nodes = 114982
llama_init_from_model: graph splits = 98
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 250112, n_batch = 8096, n_ubatch = 8096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8096 2024 0 12.369 654.55 91.660 22.08
8096 2024 8096 12.166 665.44 89.755 22.55
8096 2024 16192 12.200 663.60 89.926 22.51
8096 2024 24288 12.317 657.31 90.337 22.41
8096 2024 32384 12.424 651.66 90.490 22.37

2026-02-25_23-14

2026-02-25_23-30

2026-02-26_00-01
A small but high quality model

Curios how it fairs against Minimax M2.5, currently my favorite.

@dehnhaide

I too am curious if @shewin has any specific comments, but you can see some info from a similar report here: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/12

@shewin

I like this model. I made a better one: IQ5_KS 77.341 GiB (5.441 BPW)

Sign up or log in to comment