Working well with vllm on 5090

#1
by tiho64 - opened

Thanks for the model and your work. Running with vllm on 5090 with about 60 T/s.

My Config:

    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-5090
    restart: unless-stopped
    volumes:
      - /opt/models/huggingface:/root/.cache/huggingface
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu
      - OMP_NUM_THREADS=4
    command: >
      Kbenkhaled/Qwen3.5-27B-NVFP4
      --max-model-len 131072
      --gpu-memory-utilization 0.82
      --enable-prefix-caching
      --swap-space 16
      --max-num-seqs 32
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt.video 0
      --attention-backend FLASHINFER
      --async-scheduling
      --trust-remote-code
      --disable-log-requests
      --port 8000
```

Absolute gold!

Thanks for the model and your work. Running with vllm on 5090 with about 60 T/s.

My Config:

    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-5090
    restart: unless-stopped
    volumes:
      - /opt/models/huggingface:/root/.cache/huggingface
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu
      - OMP_NUM_THREADS=4
    command: >
      Kbenkhaled/Qwen3.5-27B-NVFP4
      --max-model-len 131072
      --gpu-memory-utilization 0.82
      --enable-prefix-caching
      --swap-space 16
      --max-num-seqs 32
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt.video 0
      --attention-backend FLASHINFER
      --async-scheduling
      --trust-remote-code
      --disable-log-requests
      --port 8000
```

2x5090 ?

yeah I think you need --tensor-parallel-size 2to use both

btw, for some reason I getting terrible accuracy with my RTX 3090 on my little test, but this card doesn't natively support NVFP4, maybe that's the reason? Anyone who tried running on an ampere (RTX 3000) Nvidia GPU?

yeah I think you need --tensor-parallel-size 2to use both

btw, for some reason I getting terrible accuracy with my RTX 3090 on my little test, but this card doesn't natively support NVFP4, maybe that's the reason? Anyone who tried running on an ampere (RTX 3000) Nvidia GPU?

It work fine with this params on win11 single 5090 , docker desktop latest version:

docker run -d --name qwen35-nvfp4 --gpus all --ipc=host -p 8000:8000 -v e:/LLMRoot/Qwen3.5-27B-NVFP4:/model vllm/vllm-openai:v0.18.0-cu130 /model --max-model-len 102400 --served-model-name qwen3.5-27b --gpu-memory-utilization 0.94 --max-num-seqs 4 --enable-prefix-caching --kv-cache-dtype fp8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --limit-mm-per-prompt.video 0 --async-scheduling --trust-remote-code --port 8000

actually , after a lot of testing in production environment i found 2 things:

  1. cache type fp8 or the combinationn of prefix caching and nvfp4 causes more artifacts and mixed language . Thats very bad for verbatim text translation at the very least. The qwen3.5 base model in fp8 doesnt have these issues.

  2. KV cache needs to be watched with these settings when supporting multiple users (num seqs >4)

  3. also trust remote code isnt necessary. It loads all the necessary files from hf. just a unnecessary security risk

Thanks for sharing @Thyrannius

Sign up or log in to comment