Running on 2 RTX Pro 6000 Blackwell GPUs at ~30 tps (Instructions that worked for me)

#17

by CarouselAether - opened 5 days ago

Discussion

CarouselAether

5 days ago

Prereqs

- CUDA 13.2 toolkit installed at /usr/local/cuda-13.x

- GCC 11+ (12 preferred), Python 3.12, recent pip

- At least 64GB RAM and 100GB free disk for the build (Model weights are an additional ~140 GB)

- 2 RTX PRO 6000 Blackwell GPUs

# Clean environment
python3.12 -m venv ~/vllm-build-env
source ~/vllm-build-env/bin/activate
pip install --upgrade pip wheel setuptools

# Get vLLM source
git clone https://github.com/vllm-project/vllm.git
cd vllm
# Pin to a known-good commit
git checkout c3ad791e1   # from your version string 0.20.1rc1.dev152+gc3ad791e1

# Install build-time torch matching your runtime
pip install torch==2.11.0 torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu130

# Build environment
export CUDA_HOME=/usr/local/cuda-13.0   # or wherever yours lives
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# THE KEY FLAG — target SM120f specifically
# Use 12.0a + 12.0f together so kernels with arch-specific (a) and family-specific (f)
# variants both get compiled. Drop other archs to make the build faster and the wheel smaller.
export TORCH_CUDA_ARCH_LIST="12.0a;12.0f"

# Optional but recommended: limit parallel jobs so you don't OOM on the host
export MAX_JOBS=8
export NVCC_THREADS=2

# Build the wheel
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
python setup.py bdist_wheel
# Result: dist/vllm-0.20.1rc1.dev152+gc3ad791e1-cp312-cp312-linux_x86_64.whl

# Test it locally first
pip install dist/vllm-*.whl
python -c "import vllm; print(vllm.__version__)"

My launch script:

#!/usr/bin/env bash
source ~/vllm-env/bin/activate
TORCH_CUDA_ARCH_LIST="12.0f" \
CUDA_HOME=/usr/local/cuda-13.2 \
vllm serve /wherever/your/models/exist/Mistral-Medium-3.5-128B \
  --host 127.0.0.1 \
  --port 5001 \
  --served-model-name mistral-medium \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --load-format mistral \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --limit-mm-per-prompt '{"image": 4}' \
  --speculative-config '{"model": "/wherever/your/models/exist/Mistral-Medium-3.5-128B-EAGLE", "num_speculative_tokens": 1, "method": "eagle", "draft_tensor_parallel_size": 2}'

Quadrapole

4 days ago

how is it? how would you rate it compared to other models that work on 2x rtx 6000 pros?

SuperbEmphasis

3 days ago

Also, how is the KV Cache? With 196G of VRAM, and 140GB for the weights, I am curious as to how many parallel calls you are getting with this being a dense model. I have 4X H100 GPUs, currently serving an image gen model, docling, Nemotron 3 Nano Omni, Nemotron 3 Super and Gemma 4 26B (Most of these I have switched to NVFP4 to save on VRAM).

However, it might be worth it to to ditch it all, and run on 4xH100 GPUs if it can match or outperform Claude Sonnet 4.5 . I have emailed mistral about that "enterprise license" though. I am still waiting on a reply.

eousphoros

3 days ago


docker run -d --gpus all -p 8000:8000 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/models:/models -e OMP_NUM_THREADS=32 --entrypoint bash vllm/vllm-openai:nightly -c 'apt update; apt install -y git; pip install git+https://github.com/mistralai/mistral-common.git git+https://github.com/huggingface/transformers.git && exec python3 -m vllm.entrypoints.openai.api_server --model /models/mistral-medium-3.5 --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --tokenizer-mode mistral --config-format mistral --load-format mistral --served-model-name mistral-medium --gpu_memory_utilization 0.93 --kv-cache-dtype fp8_per_token_head --max-num-seqs 2 --max-num-batched-tokens 8192 --enable-log-requests --max-log-len 65536  --default-chat-template-kwargs '\''{"reasoning_effort":"high"}'\'' --override-generation-config '\''{"temperature":0.7}'\'''

This is what I have been using, main diff is using fp8_per_token_head instead of fp8 which seems to not hit the model as hard as basic fp8

mtcl

2 days ago

How about the eagle model?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment