Jackrong/Qwopus3.5-27B-v3 · Is there anyone who can tell me how to run this model with vllm correctly?

Is there anyone who can tell me how to run this model with vllm correctly?

by beginor - opened 19 days ago

Discussion

beginor

19 days ago

•

edited 19 days ago

Hi, all. I have tried to serve this model with vllm 0.18.1

The vllm serve script is:

export CUDA_VISIBLE_DEVICES=0 
uv run vllm serve /data/scopemodels/qwopus3.5-27b-v3 \
  --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 131072 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3

Got an error message:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

After consulting Gemini, replace tokenizer_class TokenizersBackend with Qwen2Tokenizer in tokenizer_config.json, the this model can start, but got a log of warning message such as:

(APIServer pid=1691986) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=1691986) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore pid=1692728) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

Then I add fix_mistral_regex to tokenizer_config.json as below:

{
  "tokenizer_class": "Qwen2Tokenizer",
  "fix_mistral_regex": true
}

Now I can start the model with out warning or error message. But after I test with claude code, it can only chat, no tool call!!

Is there anyone who can tell me how to run this model correctly?

beginor changed discussion title from How to serve this model with vllm correctly? to Is there anyone who can tell me how to run this model with vllm correctly? 19 days ago

XiaoZaiyi

19 days ago

me too !

SongXiaoMao

17 days ago

修改好这一行就可以启动了，但是不能使用mtp加速。

andylee-2023

16 days ago

That's so weird. I have the same situation

memtalow

16 days ago

•

edited 16 days ago

Same here - although I did run Qwopus a few days agom and it worked fine.
This model was generated by Transformers v5, and vLLM ships with Transformers v4.
You need to update Transformers in vLLM for this model to work.

agodinezmm2007

16 days ago

•

edited 15 days ago

hello,

similar problem i had with Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. first i created this docker file

FROM vllm/vllm-openai:v0.18.1-cu130

RUN pip install --upgrade pip && \
    pip install --upgrade transformers && \
    pip install --upgrade tokenizers && \
    pip install huggingface-hub sentencepiece

# Fix list-vs-set in qwen3_5 config for transformers 5.x compatibility
RUN python3 -c "\
import pathlib; \
p = pathlib.Path('/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py'); \
t = p.read_text(); \
t = t.replace('kwargs[\"ignore_keys_at_rope_validation\"] = [', 'kwargs[\"ignore_keys_at_rope_validation\"] = {'); \
t = t.replace('\"mrope_interleaved\",\n        ]', '\"mrope_interleaved\",\n        }'); \
p.write_text(t); \
print('Patched qwen3_5.py: list -> set')"

docker build -f Dockerfile.qwen35-opus -t vllm/vllm-openai:v0.18.1-cu130-opus

then once its built (change env variables depending on your use case, these are for two RTX pro 6000 max q)

shave as bash script


docker run --name qwen35_opus_V3 --gpus '"device=0"' \
  --privileged \
  --ipc=host \
  -p 8016:8000 \
  -e OMP_NUM_THREADS=14 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TRACE_FUNCTION=0 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_SKIP_P2P_CHECK=1 \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_CUMEM_ENABLE=0 \
  -v ~/qwen3-docker/model_cache:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.1-cu130-opus \
  Jackrong/Qwopus3.5-27B-v3 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.94 \
  --disable-custom-all-reduce \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --language-model-only \
  --attention-backend FLASHINFER \
  -O3

run bash run_qwen35_27b.sh

Just found out this Jackrong/Qwopus3.5-27B-v3 model is missing the mtp tensors (Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is not) so speculative config will make it crash. skip "--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' "otherwise should work

XiaoZaiyi

16 days ago

I am unable to use tools.

DreamingFong

15 days ago

I got the model deployed with docker and vllm, and both reasoning and tool-calling work fine, using the following command :

docker pull vllm/vllm-openai:gemma4-cu130

docker run -itd --name qwen_distilled_v3
--gpus '"device=2"'
--ipc=host
--network host
--shm-size 16G
-e VLLM_LOGGING_LEVEL=INFO
-e VLLM_API_KEY="$VLLM_API_KEY"
-v /data/cfs/Qwen/Qwopus3.5-27B-v3:/data/model
vllm/vllm-openai:gemma4-cu130
--model /data/model
--served-model-name qwen-distilled-v3
--tensor-parallel-size 1
--quantization fp8
--max-model-len 131072
--gpu-memory-utilization 0.9
--enable-force-include-usage
--enable-log-requests
--enable-log-outputs
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser hermes
--host 0.0.0.0
--port 8502

guys check it out and notice that my cuda version is 13.0, and can change image to vllm/vllm-openai:gemma4 for 12.9, according to gemma4 user guide here: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment