GLM-5-GGUF-1.594bpw
This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.
The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.
Size
The FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.
The token_embd tensor will take about 510MiB, and that goes into System RAM as well.
The other tensors will take about 10.6GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.
Size from llama-server output:
llm_load_print_meta: model size = 139.907 GiB (1.594 BPW)
llm_load_print_meta: repeating layers = 138.826 GiB (1.586 BPW, 751.961 B parameters)
Buffer size with -cmoe --no-mmap (need a small swap to load):
llm_load_tensors: CPU buffer size = 129975.00 MiB
llm_load_tensors: CUDA_Host buffer size = 510.47 MiB
llm_load_tensors: CUDA0 buffer size = 10897.35 MiB
Buffer size with ncmoe 74 --no-mmap (doesn't need a swap):
llm_load_tensors: CPU buffer size = 123043.00 MiB
llm_load_tensors: CUDA_Host buffer size = 510.47 MiB
llm_load_tensors: CUDA0 buffer size = 17829.35 MiB
Quality
Recipe
# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0
blk\..*\.attn_kv_a_mqa\.weight=iq4_k
blk\..*\.attn_q_a\.weight=iq4_k
blk\..*\.attn_q_b\.weight=iq4_k
blk\..*\.attn_output\.weight=iq5_ks
# First 3 Dense Layers
blk\..*\.ffn_down\.weight=iq4_k
blk\..*\.ffn_(gate|up)\.weight=iq4_k
# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=iq4_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k
# Routed Experts Layers
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq1_s_r4
# Indexer
blk\..*\.indexer\.proj\.weight=iq4_k
blk\..*\.indexer\.attn_k\.weight=iq4_k
blk\..*\.indexer\.attn_q_b\.weight=iq4_k
# NextN MTP Layer
blk\..*\.nextn\.embed_tokens\.weight=iq4_k
blk\..*\.nextn\.shared_head_head\.weight=iq4_k
blk\..*\.nextn\.eh_proj\.weight=iq4_k
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq5_ks
PPL result with wiki.test.raw:
Final estimate: PPL over 565 chunks for n_ctx=512 = 6.2248 +/- 0.03964
Can check the graph from https://huggingface.co/ubergarm/GLM-5-GGUF for comparison.
This quant uses the imatrix from unsloth, which seems to allow the model to perform more reliably in actual tasks.
When using the imatrix from ubergarm, PPL is a bit better at 6.1469 +/- 0.03890, but performance is noticeably worse.
Flags
To have usable context size, we have to sacrifice PP by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.
These flags allow a 75000 context size:
-ot \.(73|74|75|76|77)\.ffn_down_exps=CUDA0 \
-ot \.(75|76|77)\.ffn_(up|gate)_exps=CUDA0 \
-ot exps=CPU \
-mla 1 -c 75000 -ctk q5_0 -khad \
-b 2048 -ub 2048 \
--jinja -cram 0 -mqkv -ger -cuda graphs=1
- 11 FFN tensors on GPU, the rest on CPU
-mla 1to squeeze 75000 context in Q5,-khadto reduce quantization error- 2048 batch size to allow GPU offload when processing larger prompt
Tested to be working well in both Q&A tasks and agentic tasks, with high difficulty.
- Downloads last month
- 247
We're not able to determine the quantization variants.
Model tree for sokann/GLM-5-GGUF-1.594bpw
Base model
zai-org/GLM-5