How to deploy with VLLM?

#65

by onlysainaa - opened 22 days ago

I used CUDA_VISIBLE_DEVICES=0 vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled to deploy this model but got an error lol pls share ur exp guys .

lucas-coutinho

22 days ago

--tokenizer Qwen/Qwen3.5-27B worked for me

onlysainaa

22 days ago

@lucas-coutinho did you deploy this model with vllm serve?

lucas-coutinho

22 days ago

yes. First of all you need at least vllm > 0.1.
But vllm has dependency of an old version of transformers so you cant fully serve the model because it can find the TokenizerClass. As a workaround you can set the flag --tokenizer Qwen/Qwen3.5-27B or overide in the tokenizer_config from TokenizerClass to Qwen2TokenizerFast.

uscjake87

21 days ago

•

edited 21 days ago

Adjust model length based on your specs of course..

vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --tokenizer Qwen/Qwen3.5-27B \
  --host 127.0.0.1 \
  --port 8000 \
  --api-key your-key \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 131072 \
  --enforce-eager \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice

uscjake87

21 days ago

I tried to use it in cursor and it crashes cursor. So I don't know a solid parser to use here. Qwen coder users that parser just fine or sometimes the openai tool call parser. Maybe this model doesnt support the tool calling necessary for cursor.

onlysainaa

20 days ago

still could not figure how to run it, maybe its because it did not try anything lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment