Testing IQ3_KS

by shewin - opened 12 days ago

Tensor blk.77.ffn_up_exps.weight (size = 1225.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 10420224 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 27787264 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 3760128 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 83361792 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 208896 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 835584 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 6946816 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
Tensor blk.78.ffn_gate_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_gate_exps.weight (size = 2116026368 bytes) -- ignoring
Tensor blk.78.ffn_down_exps.weight (size = 2022.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_down_exps.weight (size = 2120220672 bytes) -- ignoring
Tensor blk.78.ffn_up_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_up_exps.weight (size = 2116026368 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 8282112 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring

Allocating 299.91 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using

GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 299.91 GiB in 84226.1 ms

llm_load_tensors: offloading 79 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors: CUDA0 buffer size = 14498.95 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
============ llm_prepare_mla: need to compute 79 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.61.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.62.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.63.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.64.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.65.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.66.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.67.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.68.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.69.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.70.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.71.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.72.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.73.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.74.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.75.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.76.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.77.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 131072
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5967.04 MiB
llama_init_from_model: KV self size = 5967.00 MiB, c^KV (q8_0): 5967.00 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 3934.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1120.05 MiB
llama_init_from_model: graph nodes = 30842
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	8.638	474.17	73.191	13.99
4096	1024	4096	9.353	437.96	68.152	15.03
4096	1024	8192	9.690	422.72	70.148	14.60
4096	1024	12288	10.253	399.50	70.674	14.49
4096	1024	16384	10.882	376.41	91.145	11.23

shewin

12 days ago

ubergarm

Owner 12 days ago

Great to see you again! Looks good! Cheers!

curiouspp8

12 days ago

Used it for a day for some work, looks like a solid quant.

For some reason kimi-cli & nano-coder broke entirely. Opencode is the only one I tried that handled switch of same quant GLM 5 > 5.1. They were all working fine on 5.0, but after the switch kimi-cli can't use any tools and ends up just looping thru 50+ calls. Funny enough, under kimi-cli, eventually it found a way to jailbreak by running everything via python scripts. Pretty cool stuff. Maybe this is some nuanced tools use templates kimi/nanocoder user or maybe default template needs to be adjusted. Opencode works. For some reason, when context gets ~70k ik_llama prompt caching breaks and it decides it needs to process the whole thing again. Could be opencode issue. There are those messages in logs, maybe related maybe not.

render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues

I wonder if some template needs updating for 5.1.

ubergarm

Owner 12 days ago

@curiouspp8

Thanks for the report, I've only tested with opencode which is working well in my < 65k context tests. I get those messages too, and I always bake in the original upstream default chat template. You could probably try a custom one with llama-server --chat-template-file myCustomTemplate.jinja but I'm not sure what to change.

I did see a custom template is suggested with gemma-4-31b-it, and with the latest tokenizer fixes that seems to have it working for me finally. https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/ Just interesting there are two templates, one is "interleaved". Maybe I have to copy paste the original GLM-5.1 chat template from https://huggingface.co/zai-org/GLM-5.1/blob/main/chat_template.jinja and give it the gemma 4 interleaved one as an example and see if it can "fix itself" haha...

cgmind

9 days ago

I always get an error when trying to run this quant, can someone check these details?

Error:

gguf_init_from_file_ptr: tensor 'output.weight' has invalid ggml type 141. should be in [0, 42)
gguf_init_from_file_ptr: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load GGUF split from /models/GLM5.1/GLM-5.1-IQ3_KS-00001-of-00008.gguf

Model hashes (matches HF pages):

61e1798dc098a2c442285963c7863ab0be43dcfe3cbfd9cab905b93ffccea9f2  GLM-5.1-IQ3_KS-00001-of-00008.gguf
01e8cb02bc9fcbb20e77a2439eec781e544ef3d27f6348ed06ee845a2d032fe7  GLM-5.1-IQ3_KS-00002-of-00008.gguf
f01b8508b6453c7f577abe08a2e1518e375a16ce0d6376dfc8feb547d7369b89  GLM-5.1-IQ3_KS-00003-of-00008.gguf
e171ebd288174e42d34408059d750a89ba95833016d2306ef874911d57058bbf  GLM-5.1-IQ3_KS-00004-of-00008.gguf
eed179a22f4f7554274b5f2480abe42d576e2ee657aa1e8c59da9843f99895ef  GLM-5.1-IQ3_KS-00005-of-00008.gguf
6f423703134b02d1dca257092d0ab0ca559315f05e0e9f67c510c8654d276272  GLM-5.1-IQ3_KS-00006-of-00008.gguf
afc13e4dda8dda7501d5708b12893ea9e4bf191d9e9c3f9627e689a4cd4949fb  GLM-5.1-IQ3_KS-00007-of-00008.gguf
a5c14491398633fdd452f458f510cb0a54dd400e3c453dd8dda6987937ff715f  GLM-5.1-IQ3_KS-00008-of-00008.gguf

llama.cpp versions:

self-compiled b8763-ff5ef8278
ggml-org's b8763 (https://github.com/ggml-org/llama.cpp/releases/tag/b8763)
unslothai's b1-d12cc3d (https://github.com/unslothai/llama.cpp/releases/tag/b8720)

All of those versions run GLM5 UD-Q3_K_XL and would run GLM5.1-UD_Q3_K_XL (except it is slightly too big for my setup).
If someone who has this quant working could check the model hashes I would appreciate it, or any other pointer in the right direction.

curiouspp8

9 days ago

•

edited 9 days ago

Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama

cgmind

9 days ago

Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama

You are right, my mistake. I thought all i-quant support was merged in but I guess I just got used to using the non-k ones and made a wrong assumption.

ubergarm

Owner 9 days ago

@cgmind

Yeah check out the quickstart for how to get ik_llama.cpp going or look at their readme for instructions: https://github.com/ikawrakow/ik_llama.cpp/

yes, it is confusing as ik made many of the quant types still used in mainline, before beginning his own version to support newer quantization types (which is mostly what i work with).

there are some precompiled binaries around too if you prefer

cgmind

8 days ago

@ubergarm

I have it all working fine now, thank you. Sorry to take up people's time with my forest-for-the-trees moment.

ubergarm

Owner 8 days ago

@cgmind

all good, glad you got it going!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment