Testing smol-IQ4_K

#4
by shewin - opened

Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 101120
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 4603.49 MiB
llama_init_from_model: KV self size = 4603.45 MiB, c^KV (q8_0): 4603.45 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 4706.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 886.05 MiB
llama_init_from_model: graph nodes = 30920
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 101120, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 10.636 385.12 89.402 11.45
4096 1024 4096 11.322 361.76 79.712 12.85
4096 1024 8192 11.665 351.13 81.079 12.63
4096 1024 12288 12.220 335.17 82.227 12.45
4096 1024 16384 12.856 318.61 90.644 11.30

2026-04-10_14-59
it took long long~ time.
Results are very good.
better than IQ3

Download in progres !!!!
Thank's a lot for this Quant !!!!

i have a crash with :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --threads 32 --threads-batch 64 --batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on --n-gpu-layers 999 --fit --fit-margin 17200,17200,17200,17200,17200,17200,17200,21000 -ctk q4_0 -ctv q4_0 -khad -vhad --graph-reuse --no-mmap
at the inference, i try a lot of parameters but crash, i open an issue on ik_llama.cpp ...

@martossien

  1. Just to get started leave batch sizes at default e.g. don't include ub/b or use -ub 512 -b 2048 which are defaults. Eventually after you get it going you'll want to increase -ub 1024 -b 2048 or so to increase PP if possible.
  2. You pretty much always want to use -muge --merge-qkv on every model and it will just do the right thing and it will get you a good speed boost.
  3. I have no idea how --fit stuff works, my guess is your issue is there and you might be able to go back to older style -ot ... stuff
  4. i'm not sure on graph-reuse, or how it would effect this model or not... maybe try without?
  5. this is a MLA model so either leave off all -ctk/-ctv/-khad/-vhad stuff as it doesn't work the same on MLA attention (deepseek/kimi-k2.5/glm-5's)... i'd recommend going with -mla 3 -amb 512 -ctk q8_0

it's works with :
~$ ~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 1024 --ubatch-size 1024 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 1,1,1,1,0.9,1,1,0.75 --cpu-moe --n-cpu-moe 248 --cache-type-k q6_0 --cache-type-v q4_0 --k-cache-hadamard --graph-reuse -muge --cache-ram 32768 --jinja0

I lanch the test ... with your paramters and other tests ...

@martossien

Let's see, this MLA model does not support -sm graph so you could remove both --split-mode graph and --split-mode-graph-scheduling probably...

you can remove --cache-type-v q4_0 and i don't know if -khad works with MLA models honestly

also you can add --merge-qkv to get 1% boost maybe

thanks for workshopping your commands!

First of all, a huge thank you for this quantization work β€” GLM-5.1 at IQ4_K is impressive and the GGUF packaging is very clean. This took me quite a bit of debugging time to get running, but I finally got there!

Bug found in ik_llama.cpp + workaround ( --n-gpu-layers < layer of llm )

I hit a NaN crash on first token generation (all logits = NaN, Failed to sample token) with ik_llama.cpp. I filed the bug here: ikawrakow/ik_llama.cpp#1616 β€” it appears to come from the CPU FLASH_ATTN_EXT q4_0/q8_2_x4 fallback path with the GLM-DSA architecture.

I found a working configuration that avoids the crash and fully unlocks the 100k context window:
~/ik_llama.cpp/build/bin/llama-server
--model "$MODEL_PATH"
--alias GLM-5.1-IQ4_K
--host 0.0.0.0
--port 8080
--ctx-size 102400
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 512
--parallel 1
--flash-attn on
--n-gpu-layers 16
--tensor-split 0.8,1,1,1,0.9,1,1,1
--cpu-moe
--n-cpu-moe 999
--cache-type-k q4_0
-khad
--merge-qkv
-muge
--cache-ram 32768
--jinja

VRAM growth analysis

I also added timestamps to ik_llama.cpp logs and wrote a script to measure VRAM growth during inference. Key findings:
Growth is approximately ~0.59 GiB / 10k tokens / GPU
This growth is not the static VRAM already allocated at model load
It is not just the static KV self size
It is not due to host-side buffers (KQ_mask, host output buffer)
It corresponds to CUDA VMM pool growth during execution

Performance results (tested with OpenCode)
Metric Speed
Generation (t/s) 6–8 tokens/s
Prompt processing (t/s) 25–35 tokens/s
Full 100k context is usable with the config above on 8Γ— RTX 3090 (24 GiB each).

Although this debug session took a fair amount of time, it was totally worth it. And now I'm moving on to test your Minimax 2.7 quantization β€” looking forward to it!
Thanks again for all the effort you put into these GGUFs!

@martossien

Nice! keep in mind that MLA style attention like GLM-5.1 and deepseek etc are already kind of "compressed" due to the latent attention mechanism. So i recommend against going below -ctk q8_0 and ideally leave it full -ctk f16 as it is already fairly efficient.

Sign up or log in to comment