GLM-5.1-GGUF-1.673bpw

This is a 1.7 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.

The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.

Size

The FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.

The token_embd tensor will take about 510MiB, and that goes into System RAM as well.

The other tensors will take about 13.3GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.

Size from llama-server output:

llm_load_print_meta: model size       = 146.840 GiB (1.673 BPW)
llm_load_print_meta: repeating layers = 145.622 GiB (1.663 BPW, 751.961 B parameters)

Buffer size with -cmoe --no-mmap (need a small swap to load):

llm_load_tensors:        CPU buffer size = 130485.47 MiB
llm_load_tensors:      CUDA0 buffer size = 13625.82 MiB

Buffer size with ncmoe 74 --no-mmap (doesn't need a swap):

llm_load_tensors:        CPU buffer size = 123553.47 MiB
llm_load_tensors:      CUDA0 buffer size = 20557.82 MiB

Quality

Recipe

# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0
blk\..*\.attn_kv_a_mqa\.weight=q6_0
blk\..*\.attn_q_a\.weight=q6_0
blk\..*\.attn_q_b\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0

# First 3 Dense Layers
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks

# Routed Experts Layers
blk\.78\.ffn_(up|gate|down)_exps\.weight=iq5_ks
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq1_s_r4

# Indexer
blk\..*\.indexer\.proj\.weight=q6_0
blk\..*\.indexer\.attn_k\.weight=q6_0
blk\..*\.indexer\.attn_q_b\.weight=q6_0

# NextN MTP Layer
blk\..*\.nextn\.eh_proj\.weight=q6_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=q6_0

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 6.1947 +/- 0.03947

Can check the graph from https://huggingface.co/ubergarm/GLM-5.1-GGUF for comparison.

This quant uses the imatrix from ubergarm (thanks!), and seems to perform well enough in actual tasks, including some that can't be solved by Qwen3.5-397B-A17B / Kimi-K2.5 / Sonnet 4.5 (tested via API, presumably at full precision).

Interestingly, GLM-5.1 doesn't work well when using the recipe from GLM-5-GGUF-1.594bpw, even though the PPL result with that recipe is a bit better at 6.1780 +/- 0.03932. Instead, I have to give more bits to the non-FFN tensors, for the quant to actually perform in actual tasks. This does increase the VRAM usage by about 2.66 GiB, making it infeasible to increase the batch size from the default of -ub 512.

Flags

To have usable context size, we have to sacrifice PP by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.

These flags allow a 88064 context size:

-ot \.(76|77)\.ffn_down_exps=CUDA0 \
-ot \.(74|75|76|77)\.ffn_(up|gate)_exps=CUDA0 \
-ot exps=CPU \
-mla 1 -c 88064 -ctk q6_0 -khad \
-b 2048 -ub 512 -wgt 1 \
--jinja -cram 0 -mqkv -muge -cuda graphs=1

10 FFN tensors on GPU, the rest on CPU
-mla 1 to squeeze 88064 context in Q6, -khad to reduce quantization error
-wgt 1 to reduce CUDA compute buffer a little
No GPU offload during prompt processing with the default of -ub 512, will be painfully slow when processing large prompts

Downloads last month: -

GGUF

Model size

754B params

Architecture

glm-dsa

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/GLM-5.1-GGUF-1.673bpw

Base model

zai-org/GLM-5.1

Quantized

(26)

this model