gguf size
q4_k_m.gguf is 14.3 GB, but the recommended bf16_q4_k.gguf is 26.4 GB.
looks unusual, while other models have its bf16_q4_k only slightly larger than q4_k_m.
I have changed the number of layers that get left unquantized. previously it was only embeddings and output. now it includes other critical layers . The quant rules are : {
"override_types": ["BF16","F16"],
"layer_name": ["*attn_k","*attn_q"],
"experts" : true,
"order_low": 1,
"order_high": 9
},
{
"override_types": ["BF16","F16"],
"layer_name": ["*attn_v","*ffn_down","*attn_output","*attn_qkv"]
},
{
"override_types": ["BF16","F16"],
"layer_name": ["*ffn_down_exps","*ffn_gate_exps","*ffn_up_exps"],
"experts" : true,
"order_low": 1,
"order_high": 9
},
{
"override_types": ["BF16","F16"],
"layer_name": ["*ffn_down_shexp","*ffn_gate_shexp","*ffn_up_shexp"],
"experts" : true,
"order_low": 1,
"order_high": 9
} ... order_low and order_high are the point at which the rules is applied 1 is first 10% and 9 last 90%. Experts true only applies to MOE models. Here is the code that makes the quant list for injection into llama-quantize https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py . I can see that this now creates much larger file so I will have a think about if there is another way to do it.
I'm not an ai science expert - but perhaps there is a way to sparsify the model and 0 out the sparse parts like with a MoE model?
I'm not an ai science expert - but perhaps there is a way to sparsify the model and 0 out the sparse parts like with a MoE model?
Interesting question. I don't know how to sparsify a model so I put it to ChatGPT : https://chatgpt.com/share/6840a329-4704-800b-8f46-5b35e77aa3bd .
The layers that I do not quantize with the quant rules above definitely are not candidates to be zeroed but others maybe or at least given lower quants . What I am doing is using the imatrix as base line and then bumping up the important layers because an imatrix tends not to be enough at lower quantisation levels. The bf16 and f16 mixed quant gguf files, in question above, are meant for maximum precision with efficient memory reduction.