How to deploy with VLLM?

#65
by onlysainaa - opened

I used CUDA_VISIBLE_DEVICES=0 vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled to deploy this model but got an error lol pls share ur exp guys .

--tokenizer Qwen/Qwen3.5-27B worked for me

@lucas-coutinho did you deploy this model with vllm serve?

yes. First of all you need at least vllm > 0.1.
But vllm has dependency of an old version of transformers so you cant fully serve the model because it can find the TokenizerClass. As a workaround you can set the flag --tokenizer Qwen/Qwen3.5-27B or overide in the tokenizer_config from TokenizerClass to Qwen2TokenizerFast.

Adjust model length based on your specs of course..

vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --tokenizer Qwen/Qwen3.5-27B \
  --host 127.0.0.1 \
  --port 8000 \
  --api-key your-key \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 131072 \
  --enforce-eager \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice

I tried to use it in cursor and it crashes cursor. So I don't know a solid parser to use here. Qwen coder users that parser just fine or sometimes the openai tool call parser. Maybe this model doesnt support the tool calling necessary for cursor.

still could not figure how to run it, maybe its because it did not try anything lol

Sign up or log in to comment