vLLM Support Query

#2
by hrithiksagar-bgen - opened

Dear team, is there vLLM support available for this model yet? @bluelike @littlebird13
Does this link only work? https://qwen.readthedocs.io/en/latest/deployment/vllm.html
like the libraries mentioned in this link?

@dineshananthi in the link they only mentioned about QWEN 3 VL 4B and QWEN 3 VL 30B, but not about the 235B thinking or instruct.
Screenshot 2025-09-25 at 12.11.49 PM

@bluelike @littlebird13 Could you please help me with this?

https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#online-serving:~:text=vllm.ai/nightly-,Online%20Serving,-You%20can%20start

I used the online serve code
Installation

pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-vl-utils==0.0.14
# pip install 'vllm>0.10.2' # If this is not working use the below one. 
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Online Serving
You can start either a vLLM or SGLang server to serve LLMs efficiently, and then access it using an OpenAI-style API.

vLLM server
# FP8 requires NVIDIA H100+ and CUDA 12+
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct\
  --served-model-name Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --port 22002 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.70 \
  --quantization fp8 \
  --distributed-executor-backend mp

This code has worked but i am trying to get offline serve code to load on my own GPUs, thats when I am getting the issues, i believe vLLM has still not added the support, am i right?

hrithiksagar-bgen changed discussion status to closed

Sign up or log in to comment