paperhawk / infra /vllm /README.md
Nándorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119
|
raw
history blame
5.74 kB

vLLM serving on AMD MI300X

This directory contains the infrastructure to serve Qwen 2.5 Instruct via vLLM on an AMD Instinct MI300X GPU through the AMD Developer Cloud.

The Streamlit app (app/main.py) and the LangGraph pipeline call this endpoint via the OpenAI-compatible REST API (/v1/chat/completions), using langchain-openai's ChatOpenAI adapter with a custom base_url.


1. Prerequisites

  • AMD AI Developer Program approval ($100 cloud credit per team member)
  • AMD Developer Cloud account, MI300X instance available
  • SSH access to the MI300X instance
  • (Optional) Hugging Face token if the model is gated (Qwen 2.5 is open, so this is not required for the default model)

2. Provision the MI300X instance

Follow the AMD Developer Cloud Getting Started guide: https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html

The default ROCm-enabled image already includes Docker and the AMD GPU driver. Verify GPU access:

rocm-smi
# Expected: 1 × AMD Instinct MI300X listed

3. Pull the vLLM ROCm image

docker pull rocm/vllm:latest

Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies).


4. Start the vLLM server

Option A — Docker (recommended)

docker run --rm \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --ipc=host \
    --shm-size 16g \
    -p 8000:8000 \
    -e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
    -e VLLM_API_KEY=$(openssl rand -hex 32) \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    rocm/vllm:latest \
    sh -c 'vllm serve $VLLM_MODEL \
        --host 0.0.0.0 --port 8000 \
        --tensor-parallel-size 1 \
        --dtype auto \
        --gpu-memory-utilization 0.9 \
        --max-model-len 32768 \
        --api-key $VLLM_API_KEY'

The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on container restart.

Print the API key that was generated (echo $VLLM_API_KEY from inside the container, or use a fixed string instead of openssl rand). You will paste this into the Streamlit app's .env as VLLM_API_KEY.

Option B — serve.sh directly

If vLLM is pip-installed in a ROCm-enabled environment on the host:

chmod +x infra/vllm/serve.sh
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
VLLM_API_KEY=<your-key> \
./infra/vllm/serve.sh

5. Verify the endpoint

From any machine with network access to the MI300X:

curl http://<mi300x-public-ip>:8000/v1/models \
    -H "Authorization: Bearer <your-api-key>"

Expected response (truncated):

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen2.5-14B-Instruct",
      "object": "model",
      "owned_by": "vllm",
      ...
    }
  ]
}

A simple chat-completion smoke test:

curl http://<mi300x-public-ip>:8000/v1/chat/completions \
    -H "Authorization: Bearer <your-api-key>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-14B-Instruct",
        "messages": [{"role": "user", "content": "What is 2+2?"}],
        "temperature": 0.0
    }'

6. Connect the Streamlit app

In the project root .env:

LLM_PROFILE=vllm
VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
VLLM_API_KEY=<your-key>

Then start the Streamlit app:

docker compose up langgraph-app

Or directly:

streamlit run app/main.py

7. Performance benchmark (expected)

On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+):

Metric Qwen 2.5 14B Qwen 2.5 32B
Time-to-first-token ~0.5 s ~1.0 s
Throughput (single user) 50-80 tok/s 25-40 tok/s
Concurrent capacity (KV-cache) ~50 sessions ~20 sessions
Max context length 32K (configured) 32K (configured)

These numbers depend on prompt length, batch size, and the exact ROCm/vLLM version. Run a benchmark with vllm bench after startup for the actual numbers on your instance.


8. Cost monitoring

AMD Developer Cloud MI300X pricing (as of May 2026):

  • ~$4-8/hour pay-as-you-go

$100 / team-member × 3 team-members = $300 total credit. At $5/h, that's 60 hours of MI300X uptime. Plan accordingly:

  • Only run during demo/test/build sessions — stop the instance when idle
  • Keep one teammate's credit as failover/buffer for the final 24 hours
  • Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits

9. Plan B — local fallback if MI300X is unavailable

If the AMD credit doesn't arrive in time, or the MI300X instance has issues:

# Switch the Streamlit app to Ollama profile
LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py

Pull the model first:

ollama pull qwen2.5:7b-instruct

This runs on a laptop GPU (or CPU) and lets development continue. Quality will be lower (7B vs 14B/32B), but the demo-flow stays alive.


10. Production hardening (post-hackathon)

For an actual production deployment, beyond the hackathon scope:

  • Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port
  • Rotate VLLM_API_KEY regularly
  • Set up Prometheus + Grafana for vLLM /metrics
  • Use --quantization flag for fp8/int8 to fit a larger model on smaller hardware
  • Configure --enable-prefix-caching for repeated long system prompts
  • Use vllm-deploy (sky pilot) for multi-GPU and multi-region scaling