Testing smol-IQ4_K

by shewin - opened 11 days ago

Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 101120
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 4603.49 MiB
llama_init_from_model: KV self size = 4603.45 MiB, c^KV (q8_0): 4603.45 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 4706.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 886.05 MiB
llama_init_from_model: graph nodes = 30920
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 101120, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	10.636	385.12	89.402	11.45
4096	1024	4096	11.322	361.76	79.712	12.85
4096	1024	8192	11.665	351.13	81.079	12.63
4096	1024	12288	12.220	335.17	82.227	12.45
4096	1024	16384	12.856	318.61	90.644	11.30

shewin

11 days ago

it took long long~ time.
Results are very good.
better than IQ3

martossien

11 days ago

Download in progres !!!!
Thank's a lot for this Quant !!!!

martossien

10 days ago

i have a crash with :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --threads 32 --threads-batch 64 --batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on --n-gpu-layers 999 --fit --fit-margin 17200,17200,17200,17200,17200,17200,17200,21000 -ctk q4_0 -ctv q4_0 -khad -vhad --graph-reuse --no-mmap
at the inference, i try a lot of parameters but crash, i open an issue on ik_llama.cpp ...

ubergarm

Owner 10 days ago

@martossien

Just to get started leave batch sizes at default e.g. don't include ub/b or use -ub 512 -b 2048 which are defaults. Eventually after you get it going you'll want to increase -ub 1024 -b 2048 or so to increase PP if possible.
You pretty much always want to use -muge --merge-qkv on every model and it will just do the right thing and it will get you a good speed boost.
I have no idea how --fit stuff works, my guess is your issue is there and you might be able to go back to older style -ot ... stuff
i'm not sure on graph-reuse, or how it would effect this model or not... maybe try without?
this is a MLA model so either leave off all -ctk/-ctv/-khad/-vhad stuff as it doesn't work the same on MLA attention (deepseek/kimi-k2.5/glm-5's)... i'd recommend going with -mla 3 -amb 512 -ctk q8_0

martossien

10 days ago

•

edited 10 days ago

it's works with :
~$ ~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 1024 --ubatch-size 1024 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 1,1,1,1,0.9,1,1,0.75 --cpu-moe --n-cpu-moe 248 --cache-type-k q6_0 --cache-type-v q4_0 --k-cache-hadamard --graph-reuse -muge --cache-ram 32768 --jinja0

I lanch the test ... with your paramters and other tests ...

ubergarm

Owner 10 days ago

@martossien

Let's see, this MLA model does not support -sm graph so you could remove both --split-mode graph and --split-mode-graph-scheduling probably...

you can remove --cache-type-v q4_0 and i don't know if -khad works with MLA models honestly

also you can add --merge-qkv to get 1% boost maybe

thanks for workshopping your commands!

martossien

6 days ago

First of all, a huge thank you for this quantization work — GLM-5.1 at IQ4_K is impressive and the GGUF packaging is very clean. This took me quite a bit of debugging time to get running, but I finally got there!

Bug found in ik_llama.cpp + workaround ( --n-gpu-layers < layer of llm )

I hit a NaN crash on first token generation (all logits = NaN, Failed to sample token) with ik_llama.cpp. I filed the bug here: ikawrakow/ik_llama.cpp#1616 — it appears to come from the CPU FLASH_ATTN_EXT q4_0/q8_2_x4 fallback path with the GLM-DSA architecture.

I found a working configuration that avoids the crash and fully unlocks the 100k context window:
~/ik_llama.cpp/build/bin/llama-server
--model "$MODEL_PATH"
--alias GLM-5.1-IQ4_K
--host 0.0.0.0
--port 8080
--ctx-size 102400
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 512
--parallel 1
--flash-attn on
--n-gpu-layers 16
--tensor-split 0.8,1,1,1,0.9,1,1,1
--cpu-moe
--n-cpu-moe 999
--cache-type-k q4_0
-khad
--merge-qkv
-muge
--cache-ram 32768
--jinja

VRAM growth analysis

I also added timestamps to ik_llama.cpp logs and wrote a script to measure VRAM growth during inference. Key findings:
Growth is approximately ~0.59 GiB / 10k tokens / GPU
This growth is not the static VRAM already allocated at model load
It is not just the static KV self size
It is not due to host-side buffers (KQ_mask, host output buffer)
It corresponds to CUDA VMM pool growth during execution

Performance results (tested with OpenCode)
Metric Speed
Generation (t/s) 6–8 tokens/s
Prompt processing (t/s) 25–35 tokens/s
Full 100k context is usable with the config above on 8× RTX 3090 (24 GiB each).

Although this debug session took a fair amount of time, it was totally worth it. And now I'm moving on to test your Minimax 2.7 quantization — looking forward to it!
Thanks again for all the effort you put into these GGUFs!

ubergarm

Owner 4 days ago

@martossien

Nice! keep in mind that MLA style attention like GLM-5.1 and deepseek etc are already kind of "compressed" due to the latent attention mechanism. So i recommend against going below -ctk q8_0 and ideally leave it full -ctk f16 as it is already fairly efficient.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment