Testing IQ3_KS
Tensor blk.77.ffn_up_exps.weight (size = 1225.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 10420224 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 27787264 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 3760128 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 83361792 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 208896 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 835584 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 6946816 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
Tensor blk.78.ffn_gate_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_gate_exps.weight (size = 2116026368 bytes) -- ignoring
Tensor blk.78.ffn_down_exps.weight (size = 2022.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_down_exps.weight (size = 2120220672 bytes) -- ignoring
Tensor blk.78.ffn_up_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_up_exps.weight (size = 2116026368 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 8282112 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
Allocating 299.91 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using
GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 299.91 GiB in 84226.1 ms
llm_load_tensors: offloading 79 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors: CUDA0 buffer size = 14498.95 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
============ llm_prepare_mla: need to compute 79 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.61.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.62.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.63.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.64.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.65.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.66.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.67.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.68.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.69.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.70.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.71.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.72.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.73.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.74.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.75.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.76.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.77.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 131072
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5967.04 MiB
llama_init_from_model: KV self size = 5967.00 MiB, c^KV (q8_0): 5967.00 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 3934.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1120.05 MiB
llama_init_from_model: graph nodes = 30842
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 8.638 | 474.17 | 73.191 | 13.99 |
| 4096 | 1024 | 4096 | 9.353 | 437.96 | 68.152 | 15.03 |
| 4096 | 1024 | 8192 | 9.690 | 422.72 | 70.148 | 14.60 |
| 4096 | 1024 | 12288 | 10.253 | 399.50 | 70.674 | 14.49 |
| 4096 | 1024 | 16384 | 10.882 | 376.41 | 91.145 | 11.23 |
Great to see you again! Looks good! Cheers!
Used it for a day for some work, looks like a solid quant.
For some reason kimi-cli & nano-coder broke entirely. Opencode is the only one I tried that handled switch of same quant GLM 5 > 5.1. They were all working fine on 5.0, but after the switch kimi-cli can't use any tools and ends up just looping thru 50+ calls. Funny enough, under kimi-cli, eventually it found a way to jailbreak by running everything via python scripts. Pretty cool stuff. Maybe this is some nuanced tools use templates kimi/nanocoder user or maybe default template needs to be adjusted. Opencode works. For some reason, when context gets ~70k ik_llama prompt caching breaks and it decides it needs to process the whole thing again. Could be opencode issue. There are those messages in logs, maybe related maybe not.
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues
I wonder if some template needs updating for 5.1.
Thanks for the report, I've only tested with opencode which is working well in my < 65k context tests. I get those messages too, and I always bake in the original upstream default chat template. You could probably try a custom one with llama-server --chat-template-file myCustomTemplate.jinja but I'm not sure what to change.
I did see a custom template is suggested with gemma-4-31b-it, and with the latest tokenizer fixes that seems to have it working for me finally. https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/ Just interesting there are two templates, one is "interleaved". Maybe I have to copy paste the original GLM-5.1 chat template from https://huggingface.co/zai-org/GLM-5.1/blob/main/chat_template.jinja and give it the gemma 4 interleaved one as an example and see if it can "fix itself" haha...
I always get an error when trying to run this quant, can someone check these details?
Error:
gguf_init_from_file_ptr: tensor 'output.weight' has invalid ggml type 141. should be in [0, 42)
gguf_init_from_file_ptr: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load GGUF split from /models/GLM5.1/GLM-5.1-IQ3_KS-00001-of-00008.gguf
Model hashes (matches HF pages):
61e1798dc098a2c442285963c7863ab0be43dcfe3cbfd9cab905b93ffccea9f2 GLM-5.1-IQ3_KS-00001-of-00008.gguf
01e8cb02bc9fcbb20e77a2439eec781e544ef3d27f6348ed06ee845a2d032fe7 GLM-5.1-IQ3_KS-00002-of-00008.gguf
f01b8508b6453c7f577abe08a2e1518e375a16ce0d6376dfc8feb547d7369b89 GLM-5.1-IQ3_KS-00003-of-00008.gguf
e171ebd288174e42d34408059d750a89ba95833016d2306ef874911d57058bbf GLM-5.1-IQ3_KS-00004-of-00008.gguf
eed179a22f4f7554274b5f2480abe42d576e2ee657aa1e8c59da9843f99895ef GLM-5.1-IQ3_KS-00005-of-00008.gguf
6f423703134b02d1dca257092d0ab0ca559315f05e0e9f67c510c8654d276272 GLM-5.1-IQ3_KS-00006-of-00008.gguf
afc13e4dda8dda7501d5708b12893ea9e4bf191d9e9c3f9627e689a4cd4949fb GLM-5.1-IQ3_KS-00007-of-00008.gguf
a5c14491398633fdd452f458f510cb0a54dd400e3c453dd8dda6987937ff715f GLM-5.1-IQ3_KS-00008-of-00008.gguf
llama.cpp versions:
self-compiled b8763-ff5ef8278
ggml-org's b8763 (https://github.com/ggml-org/llama.cpp/releases/tag/b8763)
unslothai's b1-d12cc3d (https://github.com/unslothai/llama.cpp/releases/tag/b8720)
All of those versions run GLM5 UD-Q3_K_XL and would run GLM5.1-UD_Q3_K_XL (except it is slightly too big for my setup).
If someone who has this quant working could check the model hashes I would appreciate it, or any other pointer in the right direction.
Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama
Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama
You are right, my mistake. I thought all i-quant support was merged in but I guess I just got used to using the non-k ones and made a wrong assumption.
Yeah check out the quickstart for how to get ik_llama.cpp going or look at their readme for instructions: https://github.com/ikawrakow/ik_llama.cpp/
yes, it is confusing as ik made many of the quant types still used in mainline, before beginning his own version to support newer quantization types (which is mostly what i work with).
there are some precompiled binaries around too if you prefer
