Usage with VLLM?

by nfunctor - opened Aug 22, 2025

Aug 22, 2025

Hi, thanks for the quant, looks very promising in terms of its VRAM footprint!

I haven't managed to run it with VLLM (install from source as of today). I'm getting the CUDA OOM errors even on GPUs like A100, even though the model should clearly fit. Do you know why that happens, do you have a workaround? Thanks!

CertaraAI

Sep 8, 2025

Looks like its trying to load it full precision. If i try to specify '--quantization awq', it also fails to launch an online vllm server due to there being no quantization_config in the config.json. I tried to amend the model files and add one, but I'm not knowledge-able enough to figure out the right values.

RaulBSM

Oct 3, 2025

Have you managed to make it work ? I get the same errors

ireoreandero

Oct 3, 2025

Same here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment