Performance report on RTX 4090D (48GB VRAM): 40 t/s

#11
by SlavikF - opened

System:

  • Intel Xeon W5-3425 with DDR5-4800 RAM
  • Nvidia RTX 4090D modded with 48GB VRAM

Tested with request which is 40k tokens and the response about 2k tokens.
Using MTP.
I can fit 128k context to my 48GB VRAM.

Getting speed:
PP: 4000 t/s
TG: from 40 to 44 t/k

My docker composer:

services:
  vllm:
    image: vllm/vllm-openai:v0.20.1-cu129-ubuntu2404
    container_name: vllm-qwen27B
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8000"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache:/root/.cache
    ipc: host
    command:
      - "--model"
      - "Qwen/Qwen3.6-27B-FP8"
      - "--max-model-len"
      - "131072"
      - "--served-model-name"
      - "local-vl-qwen27B"
      - "--gpu-memory-utilization"
      - "0.975"
      - "--performance-mode"
      - "interactivity"
      - "--trust-remote-code"
      - "--enable-auto-tool-choice"
      - "--tool-call-parser"
      - "qwen3_coder"
      - "--reasoning-parser"
      - "qwen3"
      - "--mm-encoder-tp-mode"
      - "data"
      - "--mm-processor-cache-type"
      - "shm"
      - "--speculative-config"
      - '{"method":"mtp","num_speculative_tokens":2}'
      - "--compilation-config"
      - '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}'
      - "--async-scheduling"
      - "--attention-backend"
      - "flashinfer"
      - "--kv-cache-dtype"
      - "bfloat16"
      - "--enable-prefix-caching"
Config Steady generation speed Speed vs no SpecDecoding
No SpecDecoding ~18.8 tok/s 1.0x
MTP=2 ~41.4 tok/s ~2.2x
MTP=3 ~45.4 tok/s ~2.4x
MTP=4 ~47.3 tok/s ~2.5x
MTP=5 ~48.0 tok/s ~2.55x

Compare that to llama.cpp with unsloth/Qwen3.6-27B-GGUF:Q6_K_XL:

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/discussions/7

getting 30 t/s with llama.cpp.
No MTP

Sign up or log in to comment