Serve with vLLM
#1
by meigami - opened
How to serve this model with vLLM?
Hi @meigami ! This Q5 repo is our legacy PolarQuant format and doesn't load directly in vLLM without a dequant step. For vLLM serving we recommend the v7-GPTQ version of the same model:
vllm serve caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ \
--enforce-eager --max-model-len 16384
That repo is in standard GPTQ INT4 format (Marlin kernel native, zero plugin) and fits in ~19 GB VRAM. It's the serving-friendly version of this model.
Quality caveat to be honest: on 27B HumanEval thinking mode the v7-GPTQ scores 78.66% vs the BF16 baseline's 97.56% — there's a real gap from INT4 quantization in thinking-heavy workloads. If you need maximum quality for thinking mode, the BF16 Jackrong original is the better pick. If you need fast, low-VRAM vLLM serving for general tasks, v7-GPTQ is the one.