why Q4_K_M > Q4_K_XL

#2
by bobchenyx - opened

interesting thing that the size of Q4_K_M is larger than Q4_K_XL

perhaps it's with this part of ffn_down pattern matching which bump all ffn_down_exps and shexps as well ?
llama-quant.cpp#L336

is it designed on purpose to be like this?

[  53/1086]           blk.3.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, converting to q6_K .. size =  7168.00 MiB ->  2940.00 MiB
[  54/1086]          blk.3.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, converting to q6_K .. size =    28.00 MiB ->    11.48 MiB

image

bobchenyx changed discussion title from why Q4_K_M > Q4_K_XL (lol) to why Q4_K_M > Q4_K_XL

Q4_K_XL is actually a dynamic quant version of the model with long name as UD-Q4_K_XL, so given that my assumption would be that the base model of both is the Q4_K_M and the dynamic quant version has some portions that have been degraded to 1 bit while other have been left at native or varying degrees between. So it was a value judgement by usloth on what size to the dynamic quant version.

Sign up or log in to comment