Performance report on RTX 4090D (48GB VRAM): 40 t/s
#11
by SlavikF - opened
System:
- Intel Xeon W5-3425 with DDR5-4800 RAM
- Nvidia RTX 4090D modded with 48GB VRAM
Tested with request which is 40k tokens and the response about 2k tokens.
Using MTP.
I can fit 128k context to my 48GB VRAM.
Getting speed:
PP: 4000 t/s
TG: from 40 to 44 t/k
My docker composer:
services:
vllm:
image: vllm/vllm-openai:v0.20.1-cu129-ubuntu2404
container_name: vllm-qwen27B
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8000"
environment:
TORCH_CUDA_ARCH_LIST: "8.9"
volumes:
- /home/slavik/.cache:/root/.cache
ipc: host
command:
- "--model"
- "Qwen/Qwen3.6-27B-FP8"
- "--max-model-len"
- "131072"
- "--served-model-name"
- "local-vl-qwen27B"
- "--gpu-memory-utilization"
- "0.975"
- "--performance-mode"
- "interactivity"
- "--trust-remote-code"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "qwen3_coder"
- "--reasoning-parser"
- "qwen3"
- "--mm-encoder-tp-mode"
- "data"
- "--mm-processor-cache-type"
- "shm"
- "--speculative-config"
- '{"method":"mtp","num_speculative_tokens":2}'
- "--compilation-config"
- '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}'
- "--async-scheduling"
- "--attention-backend"
- "flashinfer"
- "--kv-cache-dtype"
- "bfloat16"
- "--enable-prefix-caching"
| Config | Steady generation speed | Speed vs no SpecDecoding |
|---|---|---|
| No SpecDecoding | ~18.8 tok/s | 1.0x |
| MTP=2 | ~41.4 tok/s | ~2.2x |
| MTP=3 | ~45.4 tok/s | ~2.4x |
| MTP=4 | ~47.3 tok/s | ~2.5x |
| MTP=5 | ~48.0 tok/s | ~2.55x |
Compare that to llama.cpp with unsloth/Qwen3.6-27B-GGUF:Q6_K_XL:
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/discussions/7
getting 30 t/s with llama.cpp.
No MTP