Is there anyone who can tell me how to run this model with vllm correctly?
Hi, all. I have tried to serve this model with vllm 0.18.1
The vllm serve script is:
export CUDA_VISIBLE_DEVICES=0
uv run vllm serve /data/scopemodels/qwopus3.5-27b-v3 \
--tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 131072 \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3
Got an error message:
ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
After consulting Gemini, replace tokenizer_class TokenizersBackend with Qwen2Tokenizer in tokenizer_config.json, the this model can start, but got a log of warning message such as:
(APIServer pid=1691986) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=1691986) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore pid=1692728) The tokenizer you are loading from '/data/scopemodels/qwopus3.5-27b-v3' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Then I add fix_mistral_regex to tokenizer_config.json as below:
{
"tokenizer_class": "Qwen2Tokenizer",
"fix_mistral_regex": true
}
Now I can start the model with out warning or error message. But after I test with claude code, it can only chat, no tool call!!
Is there anyone who can tell me how to run this model correctly?
me too !
That's so weird. I have the same situation
Same here - although I did run Qwopus a few days agom and it worked fine.
This model was generated by Transformers v5, and vLLM ships with Transformers v4.
You need to update Transformers in vLLM for this model to work.
hello,
similar problem i had with Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. first i created this docker file
FROM vllm/vllm-openai:v0.18.1-cu130
RUN pip install --upgrade pip && \
pip install --upgrade transformers && \
pip install --upgrade tokenizers && \
pip install huggingface-hub sentencepiece
# Fix list-vs-set in qwen3_5 config for transformers 5.x compatibility
RUN python3 -c "\
import pathlib; \
p = pathlib.Path('/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py'); \
t = p.read_text(); \
t = t.replace('kwargs[\"ignore_keys_at_rope_validation\"] = [', 'kwargs[\"ignore_keys_at_rope_validation\"] = {'); \
t = t.replace('\"mrope_interleaved\",\n ]', '\"mrope_interleaved\",\n }'); \
p.write_text(t); \
print('Patched qwen3_5.py: list -> set')"
docker build -f Dockerfile.qwen35-opus -t vllm/vllm-openai:v0.18.1-cu130-opus
then once its built (change env variables depending on your use case, these are for two RTX pro 6000 max q)
shave as bash script
docker run --name qwen35_opus_V3 --gpus '"device=0"' \
--privileged \
--ipc=host \
-p 8016:8000 \
-e OMP_NUM_THREADS=14 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e VLLM_TRACE_FUNCTION=0 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e VLLM_SKIP_P2P_CHECK=1 \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e CUDA_VISIBLE_DEVICES=0 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_CUMEM_ENABLE=0 \
-v ~/qwen3-docker/model_cache:/root/.cache/huggingface \
vllm/vllm-openai:v0.18.1-cu130-opus \
Jackrong/Qwopus3.5-27B-v3 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.94 \
--disable-custom-all-reduce \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--enable-prefix-caching \
--language-model-only \
--attention-backend FLASHINFER \
-O3
run bash run_qwen35_27b.sh
Just found out this Jackrong/Qwopus3.5-27B-v3 model is missing the mtp tensors (Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is not) so speculative config will make it crash. skip "--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' "otherwise should work
I am unable to use tools.
I got the model deployed with docker and vllm, and both reasoning and tool-calling work fine, using the following command :
docker pull vllm/vllm-openai:gemma4-cu130
docker run -itd --name qwen_distilled_v3
--gpus '"device=2"'
--ipc=host
--network host
--shm-size 16G
-e VLLM_LOGGING_LEVEL=INFO
-e VLLM_API_KEY="$VLLM_API_KEY"
-v /data/cfs/Qwen/Qwopus3.5-27B-v3:/data/model
vllm/vllm-openai:gemma4-cu130
--model /data/model
--served-model-name qwen-distilled-v3
--tensor-parallel-size 1
--quantization fp8
--max-model-len 131072
--gpu-memory-utilization 0.9
--enable-force-include-usage
--enable-log-requests
--enable-log-outputs
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser hermes
--host 0.0.0.0
--port 8502
guys check it out and notice that my cuda version is 13.0, and can change image to vllm/vllm-openai:gemma4 for 12.9, according to gemma4 user guide here: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
