Spaces:

lablab-ai-amd-developer-hackathon
/

paperhawk

Running

App Files Files

paperhawk / infra /vllm /README.md

Nándorfi Vince

Initial paperhawk push to HF Space (LFS for binaries)

7ff7119 5 days ago

preview code

raw

history blame

5.74 kB

vLLM serving on AMD MI300X

This directory contains the infrastructure to serve Qwen 2.5 Instruct via vLLM on an AMD Instinct MI300X GPU through the AMD Developer Cloud.

The Streamlit app (app/main.py) and the LangGraph pipeline call this endpoint via the OpenAI-compatible REST API (/v1/chat/completions), using langchain-openai's ChatOpenAI adapter with a custom base_url.

1. Prerequisites

AMD AI Developer Program approval ($100 cloud credit per team member)
- Sign up: https://www.amd.com/en/developer/ai-dev-program.html
- Approval typically takes 2 business days, up to 1 week
AMD Developer Cloud account, MI300X instance available
SSH access to the MI300X instance
(Optional) Hugging Face token if the model is gated (Qwen 2.5 is open, so this is not required for the default model)

2. Provision the MI300X instance

Follow the AMD Developer Cloud Getting Started guide: https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html

The default ROCm-enabled image already includes Docker and the AMD GPU driver. Verify GPU access:

rocm-smi
# Expected: 1 × AMD Instinct MI300X listed

3. Pull the vLLM ROCm image

docker pull rocm/vllm:latest

Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies).

4. Start the vLLM server

Option A — Docker (recommended)

docker run --rm \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --ipc=host \
    --shm-size 16g \
    -p 8000:8000 \
    -e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
    -e VLLM_API_KEY=$(openssl rand -hex 32) \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    rocm/vllm:latest \
    sh -c 'vllm serve $VLLM_MODEL \
        --host 0.0.0.0 --port 8000 \
        --tensor-parallel-size 1 \
        --dtype auto \
        --gpu-memory-utilization 0.9 \
        --max-model-len 32768 \
        --api-key $VLLM_API_KEY'

The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on container restart.

Print the API key that was generated (echo $VLLM_API_KEY from inside the container, or use a fixed string instead of openssl rand). You will paste this into the Streamlit app's .env as VLLM_API_KEY.

Option B — `serve.sh` directly

If vLLM is pip-installed in a ROCm-enabled environment on the host:

chmod +x infra/vllm/serve.sh
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
VLLM_API_KEY=<your-key> \
./infra/vllm/serve.sh

5. Verify the endpoint

From any machine with network access to the MI300X:

curl http://<mi300x-public-ip>:8000/v1/models \
    -H "Authorization: Bearer <your-api-key>"

Expected response (truncated):

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen2.5-14B-Instruct",
      "object": "model",
      "owned_by": "vllm",
      ...
    }
  ]
}

A simple chat-completion smoke test:

curl http://<mi300x-public-ip>:8000/v1/chat/completions \
    -H "Authorization: Bearer <your-api-key>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-14B-Instruct",
        "messages": [{"role": "user", "content": "What is 2+2?"}],
        "temperature": 0.0
    }'

6. Connect the Streamlit app

In the project root .env:

LLM_PROFILE=vllm
VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
VLLM_API_KEY=<your-key>

Then start the Streamlit app:

docker compose up langgraph-app

Or directly:

streamlit run app/main.py

7. Performance benchmark (expected)

On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+):

Metric	Qwen 2.5 14B	Qwen 2.5 32B
Time-to-first-token	~0.5 s	~1.0 s
Throughput (single user)	50-80 tok/s	25-40 tok/s
Concurrent capacity (KV-cache)	~50 sessions	~20 sessions
Max context length	32K (configured)	32K (configured)

These numbers depend on prompt length, batch size, and the exact ROCm/vLLM version. Run a benchmark with vllm bench after startup for the actual numbers on your instance.

8. Cost monitoring

AMD Developer Cloud MI300X pricing (as of May 2026):

~$4-8/hour pay-as-you-go

$100 / team-member × 3 team-members = $300 total credit. At $5/h, that's 60 hours of MI300X uptime. Plan accordingly:

Only run during demo/test/build sessions — stop the instance when idle
Keep one teammate's credit as failover/buffer for the final 24 hours
Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits

9. Plan B — local fallback if MI300X is unavailable

If the AMD credit doesn't arrive in time, or the MI300X instance has issues:

# Switch the Streamlit app to Ollama profile
LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py

Pull the model first:

ollama pull qwen2.5:7b-instruct

This runs on a laptop GPU (or CPU) and lets development continue. Quality will be lower (7B vs 14B/32B), but the demo-flow stays alive.

10. Production hardening (post-hackathon)

For an actual production deployment, beyond the hackathon scope:

Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port
Rotate VLLM_API_KEY regularly
Set up Prometheus + Grafana for vLLM /metrics
Use --quantization flag for fp8/int8 to fit a larger model on smaller hardware
Configure --enable-prefix-caching for repeated long system prompts
Use vllm-deploy (sky pilot) for multi-GPU and multi-region scaling