Testing smol-IQ4_K
Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 101120
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 4603.49 MiB
llama_init_from_model: KV self size = 4603.45 MiB, c^KV (q8_0): 4603.45 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 4706.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 886.05 MiB
llama_init_from_model: graph nodes = 30920
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 101120, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 10.636 | 385.12 | 89.402 | 11.45 |
| 4096 | 1024 | 4096 | 11.322 | 361.76 | 79.712 | 12.85 |
| 4096 | 1024 | 8192 | 11.665 | 351.13 | 81.079 | 12.63 |
| 4096 | 1024 | 12288 | 12.220 | 335.17 | 82.227 | 12.45 |
| 4096 | 1024 | 16384 | 12.856 | 318.61 | 90.644 | 11.30 |
Download in progres !!!!
Thank's a lot for this Quant !!!!
i have a crash with :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --threads 32 --threads-batch 64 --batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on --n-gpu-layers 999 --fit --fit-margin 17200,17200,17200,17200,17200,17200,17200,21000 -ctk q4_0 -ctv q4_0 -khad -vhad --graph-reuse --no-mmap
at the inference, i try a lot of parameters but crash, i open an issue on ik_llama.cpp ...
- Just to get started leave batch sizes at default e.g. don't include ub/b or use
-ub 512 -b 2048which are defaults. Eventually after you get it going you'll want to increase-ub 1024 -b 2048or so to increase PP if possible. - You pretty much always want to use
-muge --merge-qkvon every model and it will just do the right thing and it will get you a good speed boost. - I have no idea how
--fitstuff works, my guess is your issue is there and you might be able to go back to older style-ot ...stuff - i'm not sure on graph-reuse, or how it would effect this model or not... maybe try without?
- this is a MLA model so either leave off all
-ctk/-ctv/-khad/-vhadstuff as it doesn't work the same on MLA attention (deepseek/kimi-k2.5/glm-5's)... i'd recommend going with-mla 3 -amb 512 -ctk q8_0
it's works with :
~$ ~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-5.1-GGUF/GLM-5.1-smol-IQ4_K-00001-of-00010.gguf --alias GLM-5.1-IQ4_K --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 1024 --ubatch-size 1024 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 1,1,1,1,0.9,1,1,0.75 --cpu-moe --n-cpu-moe 248 --cache-type-k q6_0 --cache-type-v q4_0 --k-cache-hadamard --graph-reuse -muge --cache-ram 32768 --jinja0
I lanch the test ... with your paramters and other tests ...
Let's see, this MLA model does not support -sm graph so you could remove both --split-mode graph and --split-mode-graph-scheduling probably...
you can remove --cache-type-v q4_0 and i don't know if -khad works with MLA models honestly
also you can add --merge-qkv to get 1% boost maybe
thanks for workshopping your commands!
First of all, a huge thank you for this quantization work β GLM-5.1 at IQ4_K is impressive and the GGUF packaging is very clean. This took me quite a bit of debugging time to get running, but I finally got there!
Bug found in ik_llama.cpp + workaround ( --n-gpu-layers < layer of llm )
I hit a NaN crash on first token generation (all logits = NaN, Failed to sample token) with ik_llama.cpp. I filed the bug here: ikawrakow/ik_llama.cpp#1616 β it appears to come from the CPU FLASH_ATTN_EXT q4_0/q8_2_x4 fallback path with the GLM-DSA architecture.
I found a working configuration that avoids the crash and fully unlocks the 100k context window:
~/ik_llama.cpp/build/bin/llama-server
--model "$MODEL_PATH"
--alias GLM-5.1-IQ4_K
--host 0.0.0.0
--port 8080
--ctx-size 102400
--no-mmap
--threads 32
--threads-batch 64
--batch-size 2048
--ubatch-size 512
--parallel 1
--flash-attn on
--n-gpu-layers 16
--tensor-split 0.8,1,1,1,0.9,1,1,1
--cpu-moe
--n-cpu-moe 999
--cache-type-k q4_0
-khad
--merge-qkv
-muge
--cache-ram 32768
--jinja
VRAM growth analysis
I also added timestamps to ik_llama.cpp logs and wrote a script to measure VRAM growth during inference. Key findings:
Growth is approximately ~0.59 GiB / 10k tokens / GPU
This growth is not the static VRAM already allocated at model load
It is not just the static KV self size
It is not due to host-side buffers (KQ_mask, host output buffer)
It corresponds to CUDA VMM pool growth during execution
Performance results (tested with OpenCode)
Metric Speed
Generation (t/s) 6β8 tokens/s
Prompt processing (t/s) 25β35 tokens/s
Full 100k context is usable with the config above on 8Γ RTX 3090 (24 GiB each).
Although this debug session took a fair amount of time, it was totally worth it. And now I'm moving on to test your Minimax 2.7 quantization β looking forward to it!
Thanks again for all the effort you put into these GGUFs!
Nice! keep in mind that MLA style attention like GLM-5.1 and deepseek etc are already kind of "compressed" due to the latent attention mechanism. So i recommend against going below -ctk q8_0 and ideally leave it full -ctk f16 as it is already fairly efficient.
