Curious why you don't use i-matrix?

#1
by RyanoSaurus-Wrex - opened

Your quant methodology seem sound but an i-matrix with a diverse subject matter really seems important, and I see none of your quants use one, and why is that? Also do you only evaluate coding and reasoning for your hybrid quants?

I don't use i-matrix to avoid selection bias on an arbitrary small set of training tokens. I explicitly do not want this method deciding to give some parameters more bits and others less because there is 100% chance that the parameters given less bits will be wanting more bits on other training token sets. The models are pretrained across trillions of tokens and then instruct tuned with billions of tokens. Absolutely do not want any selection bias on some tiny million token patch of tokens which doesn't even span a fraction of the instruct tune space let alone the pretrain itself, non biased quant is already the optimal solution.

The quants are only evaluated across a small set of curated reasoning and coding test prompts.

I understand, and that's what I was assuming you were thinking, but I don't necessarily know if that's true. Think about when Importance Matrix was first being developed and they found that using just gibberish text helped a lot. Because think about it like in our brains, there's a lot that activates for anything if you will. So even a small token subset covering, you know, a wide range of samples, we'll do a lot to help. Because that base foundation in that model is activating even for garbage and being preserved. That's how I look at it.

I understand, and that's what I was assuming you were thinking, but I don't necessarily know if that's true.

Its 100% true on a mathematical theory basis. Consider a two parameter model with equal size equal weight parameters. The mathematically minimum possible error from quantization occurs at one point, where both parameters are quantized with the same number of bits. If more bits are given to one and less to the other based on one excitation of the parameters, there will be more error noise in the quantization of this two parameter model over a uniformly distributed excitation. In the limit, if the excitation does not use one of the parameters, only one parameter gets all the bits and the other one is zeroed. This obviously breaks down for another excitation that wanted that other parameter. The identical thing happens on an actual model when some parameters are given more bits that others. The resulting model will have more error noise from the quant over a test set >> greater than the test set used to bias the quant.

Think about when Importance Matrix was first being developed and they found that using just gibberish text helped a lot

I believe that was Kalomaze (koboldcpp guy) who suggested using random noise to avoid an arbitrary bias on particular text. However the underlying identification of "non important" parameters with any kind of a small test set is still flawed because the test set is way too small to robustly identify them. When the model was backpropped the parameters formed based on trillions, then billions, of tokens of data, trying to identify some parameters which are "unimportant" with a small test set fundamentally does not work.

. Because think about it like in our brains, there's a lot that activates for anything if you will. So even a small token subset covering, you know, a wide range of samples, we'll do a lot to help. Because that base foundation in that model is activating even for garbage and being preserved. That's how I look at it.

I believe model parameter pruning is an interesting research area, but I don't see any easy way out... it needs to be incorporated into the pretune/fine tune stages and even if this is done (QAT, quantization aware training) its still tricky to manage because it introduces and nonlinearity int he training which can pre-botch the whole model because the parameters were not allowed to go where they naturally wanted. I did not have good luck with some google qat models and found much better performance with non qat with mixed precision layer quants.

Sign up or log in to comment