Any plans on trying to make Q1 or Q2 models?
Hey, your quantized models are really good and I was wondering if you could try reducing the model to Q1 and Q2 sized model. If the quality persists then this model becomes accessible to individuals with low VRAM availablity.
Sure if you need it i could make it. But expect pretty severre perplexity degradation it is unavoidable. i heard 1bit bonsai and 2bit turboquant are good. but... i had no idea how to make them. 1bit bonsai need retraining meanwhile turboquant 2bit infra wasn't there yet... so yeah... maybe regular and imatrix quant i could upload
That would be great, I am curious about the performance since your quantizations so far have been better than most I have seen and I am curious if it could translate in the smaller quants. I am aware you use the Ik_llama fork for the IQ models, however I dont think turboquant has been integrated into the ik llama fork yet, quite new to this so I am not sure if it could be at all.
Regardless, an IQ1/2 would be great if you could try to make that happen.
Thank you for your work.
Some after some test 1-bit and 2-bit simply wasn't enough bit to fit any weight. This is the result of a simple "Hi" test at f16 kv cache.
> hi formatted: <|im_start|>user hi <|im_end|> <|im_start|>assistant <think> The</think></think></think></think></think></think></think></think></think></think></think></think></think></think></think> ##HiHelloHelloIHelloHello</think>HelloHello</think>#HiHelloHihiformatted: The</think></think></think></think></think></think></think></think></think></think></think></think></think></think></think> ##HiHelloHelloIHelloHello</think>HelloHello</think>#HiHelloHihi<|im_end|> >
Quant: IQ2_XS
Ahh so no IQ1 or IQ2, thats unfortunate, thank you for attempting though, I really appreciate the trial and the very quick response.
Yeah, unfortunately. Those stuff are too technical for me to handle. Thanks for swinging by