Could you make a smaller quantized model?

by lklklk248 - opened Dec 16, 2025

Dec 16, 2025

Thank you for your model. I used your 3.2bpw model, and the quality is excellent. However, I only have 16GB of VRAM, and after a certain amount of context, the speed becomes very slow. Could you make a smaller quantized model?

Downtown-Case

Owner Dec 19, 2025

Oh sorry I missed this!

I am running it overnight now, same quant as V4 but with a smaller GPU portion (IQ4_KT instead of IQ5_KS, mostly). It'll be uploaded in the morning.

Downtown-Case

Owner Dec 19, 2025

•

edited Dec 19, 2025

Oh wait, I just realized what repo I'm in.

Yeah, I can crush some of the MLPs to 2bpw. But know that exllamav3 is naturally going to slow down at longer context, even if everything fits in your vram.

lklklk248

Dec 20, 2025

Thank you for your reply and work

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment