vllm (SM70) V100 support

by FayeQuant - opened Mar 4

Discussion

FayeQuant

Mar 4

(SM70) V100 can not run

MaxwellLyu

Mar 6

I made it work.

First, install vllm from source, editable.
Second, apply this bugfix:

https://github.com/vllm-project/vllm/pull/36026
Finally, use this script:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3000 \
vllm serve "/home/myname/models/Qwen3.5-35B-A3B-GPTQ-Int4" \
  --host 0.0.0.0 \
  --port 8018 \
  --served-model-name "Qwen3.5-35B-A3B-GPTQ-Int4" \
  --mm-encoder-attn-backend TORCH_SDPA \ # <<<< very important for multimodal on V100
  --max-model-len auto \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.79 \  # <<<< try 0.5 the first time, then increase based on free VRAM
  --tensor-parallel-size 4

MaxwellLyu

Mar 6

(SM70) V100 can not run

It could, on V100 32G, but you have to use vllm from source and fix a bug.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment