Working well with vllm on 5090

by tiho64 - opened Mar 6

Mar 6

Thanks for the model and your work. Running with vllm on 5090 with about 60 T/s.

My Config:

    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-5090
    restart: unless-stopped
    volumes:
      - /opt/models/huggingface:/root/.cache/huggingface
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu
      - OMP_NUM_THREADS=4
    command: >
      Kbenkhaled/Qwen3.5-27B-NVFP4
      --max-model-len 131072
      --gpu-memory-utilization 0.82
      --enable-prefix-caching
      --swap-space 16
      --max-num-seqs 32
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt.video 0
      --attention-backend FLASHINFER
      --async-scheduling
      --trust-remote-code
      --disable-log-requests
      --port 8000
```

Thyrannius

Mar 17

Absolute gold!

iwaitu

Mar 20

Thanks for the model and your work. Running with vllm on 5090 with about 60 T/s.

My Config:

    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-5090
    restart: unless-stopped
    volumes:
      - /opt/models/huggingface:/root/.cache/huggingface
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu
      - OMP_NUM_THREADS=4
    command: >
      Kbenkhaled/Qwen3.5-27B-NVFP4
      --max-model-len 131072
      --gpu-memory-utilization 0.82
      --enable-prefix-caching
      --swap-space 16
      --max-num-seqs 32
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt.video 0
      --attention-backend FLASHINFER
      --async-scheduling
      --trust-remote-code
      --disable-log-requests
      --port 8000
```

2x5090 ？

owao

about 1 month ago

•

edited about 1 month ago

yeah I think you need --tensor-parallel-size 2to use both

btw, for some reason I getting terrible accuracy with my RTX 3090 on my little test, but this card doesn't natively support NVFP4, maybe that's the reason? Anyone who tried running on an ampere (RTX 3000) Nvidia GPU?

iwaitu

about 1 month ago

•

edited about 1 month ago

yeah I think you need --tensor-parallel-size 2to use both

btw, for some reason I getting terrible accuracy with my RTX 3090 on my little test, but this card doesn't natively support NVFP4, maybe that's the reason? Anyone who tried running on an ampere (RTX 3000) Nvidia GPU?

It work fine with this params on win11 single 5090 , docker desktop latest version:

docker run -d --name qwen35-nvfp4 --gpus all --ipc=host -p 8000:8000 -v e:/LLMRoot/Qwen3.5-27B-NVFP4:/model vllm/vllm-openai:v0.18.0-cu130 /model --max-model-len 102400 --served-model-name qwen3.5-27b --gpu-memory-utilization 0.94 --max-num-seqs 4 --enable-prefix-caching --kv-cache-dtype fp8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --async-scheduling --trust-remote-code --port 8000

Thyrannius

about 1 month ago

•

edited about 1 month ago

actually , after a lot of testing in production environment i found 2 things:

cache type fp8 or the combinationn of prefix caching and nvfp4 causes more artifacts and mixed language . Thats very bad for verbatim text translation at the very least. The qwen3.5 base model in fp8 doesnt have these issues.
KV cache needs to be watched with these settings when supporting multiple users (num seqs >4)
also trust remote code isnt necessary. It loads all the necessary files from hf. just a unnecessary security risk

owao

about 1 month ago

Thanks for sharing @Thyrannius

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment