Serve with vLLM

by meigami - opened 15 days ago

Discussion

meigami

15 days ago

How to serve this model with vLLM？

caiovicentino1

Owner 14 days ago

Hi @meigami ! This Q5 repo is our legacy PolarQuant format and doesn't load directly in vLLM without a dequant step. For vLLM serving we recommend the v7-GPTQ version of the same model:

vllm serve caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ \
    --enforce-eager --max-model-len 16384

That repo is in standard GPTQ INT4 format (Marlin kernel native, zero plugin) and fits in ~19 GB VRAM. It's the serving-friendly version of this model.

Quality caveat to be honest: on 27B HumanEval thinking mode the v7-GPTQ scores 78.66% vs the BF16 baseline's 97.56% — there's a real gap from INT4 quantization in thinking-heavy workloads. If you need maximum quality for thinking mode, the BF16 Jackrong original is the better pick. If you need fast, low-VRAM vLLM serving for general tasks, v7-GPTQ is the one.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment