Could you make a smaller quantized model?
#1
by lklklk248 - opened
Thank you for your model. I used your 3.2bpw model, and the quality is excellent. However, I only have 16GB of VRAM, and after a certain amount of context, the speed becomes very slow. Could you make a smaller quantized model?
Oh sorry I missed this!
I am running it overnight now, same quant as V4 but with a smaller GPU portion (IQ4_KT instead of IQ5_KS, mostly). It'll be uploaded in the morning.
Oh wait, I just realized what repo I'm in.
Yeah, I can crush some of the MLPs to 2bpw. But know that exllamav3 is naturally going to slow down at longer context, even if everything fits in your vram.
Thank you for your reply and work