why Q4_K_M > Q4_K_XL
#2
by bobchenyx - opened
interesting thing that the size of Q4_K_M is larger than Q4_K_XL
perhaps it's with this part of ffn_down pattern matching which bump all ffn_down_exps and shexps as well ?
llama-quant.cpp#L336
is it designed on purpose to be like this?
[ 53/1086] blk.3.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = bf16, converting to q6_K .. size = 7168.00 MiB -> 2940.00 MiB
[ 54/1086] blk.3.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = bf16, converting to q6_K .. size = 28.00 MiB -> 11.48 MiB
bobchenyx changed discussion title from why Q4_K_M > Q4_K_XL (lol) to why Q4_K_M > Q4_K_XL
Q4_K_XL is actually a dynamic quant version of the model with long name as UD-Q4_K_XL, so given that my assumption would be that the base model of both is the Q4_K_M and the dynamic quant version has some portions that have been degraded to 1 bit while other have been left at native or varying degrees between. So it was a value judgement by usloth on what size to the dynamic quant version.
