Quantization Command

#3
by Mar2ck - opened

Can you post what the llama-quantize command is you used to make this quant? It's noticeably better then the ones from "normal" quantizers like mradermacher.

I assume you used various --tensor-type arguments but if you could share the exact regex used that would help me reproduce it much better.

Owner

Hi @Mar2ck , sure thing. In general, the recipe looks like this:

IMATRIX=~/imatrices/GLM-4.5V-ddh0_v2-imatrix.gguf
TYPE_DEFAULT=Q8_0
TYPE_FFN_UP_EXPS=IQ4_XS
TYPE_FFN_GATE_EXPS=IQ4_XS
TYPE_FFN_DOWN_EXPS=Q5_0
SRC_GGUF=~/gguf/GLM-4.5V-bf16.gguf
DST_GGUF=~/gguf/GLM-4.5V-$TYPE_DEFAULT-FFN-$TYPE_FFN_UP_EXPS-$TYPE_FFN_GATE_EXPS-$TYPE_FFN_DOWN_EXPS.gguf

time ~/llama.cpp/build/bin/llama-quantize \
--imatrix $IMATRIX \
--tensor-type ffn_up_exps=$TYPE_FFN_UP_EXPS \
--tensor-type ffn_gate_exps=$TYPE_FFN_GATE_EXPS \
--tensor-type ffn_down_exps=$TYPE_FFN_DOWN_EXPS \
$SRC_GGUF $DST_GGUF $TYPE_DEFAULT $(nproc)

This follows the naming scheme for the quantization uploaded in this repo - let me know if I can clarify anything. Glad to know I'm beating mradermacher :3 πŸ€—

I really ought to put this stuff on the model card, but I'm lazy. I only write it out when someone asks. LOL.

Thanks!
Yeah I almost wrote off GLM Air as useless because it didn't seem very intelligent at regular IQ4_XS, turns out keeping the non-conditional parts at a high quality really does make a huge difference. I though iMatrix would do something similar on it's own but I guess it's not strong enough/is too restricted by IQ4_XS to save the quant.

Mar2ck changed discussion status to closed

Sign up or log in to comment