Usage with VLLM?
#3
by nfunctor - opened
Hi, thanks for the quant, looks very promising in terms of its VRAM footprint!
I haven't managed to run it with VLLM (install from source as of today). I'm getting the CUDA OOM errors even on GPUs like A100, even though the model should clearly fit. Do you know why that happens, do you have a workaround? Thanks!
Looks like its trying to load it full precision. If i try to specify '--quantization awq', it also fails to launch an online vllm server due to there being no quantization_config in the config.json. I tried to amend the model files and add one, but I'm not knowledge-able enough to figure out the right values.
Have you managed to make it work ? I get the same errors
Same here.