vLLM serving on AMD MI300X
This directory contains the infrastructure to serve Qwen 2.5 Instruct via vLLM on an AMD Instinct MI300X GPU through the AMD Developer Cloud.
The Streamlit app (app/main.py) and the LangGraph pipeline call this
endpoint via the OpenAI-compatible REST API (/v1/chat/completions),
using langchain-openai's ChatOpenAI adapter with a custom base_url.
1. Prerequisites
- AMD AI Developer Program approval (
$100cloud credit per team member)- Sign up: https://www.amd.com/en/developer/ai-dev-program.html
- Approval typically takes 2 business days, up to 1 week
- AMD Developer Cloud account, MI300X instance available
- SSH access to the MI300X instance
- (Optional) Hugging Face token if the model is gated (Qwen 2.5 is open, so this is not required for the default model)
2. Provision the MI300X instance
Follow the AMD Developer Cloud Getting Started guide: https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html
The default ROCm-enabled image already includes Docker and the AMD GPU driver. Verify GPU access:
rocm-smi
# Expected: 1 × AMD Instinct MI300X listed
3. Pull the vLLM ROCm image
docker pull rocm/vllm:latest
Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies).
4. Start the vLLM server
Option A — Docker (recommended)
docker run --rm \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 16g \
-p 8000:8000 \
-e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
-e VLLM_API_KEY=$(openssl rand -hex 32) \
-v ~/.cache/huggingface:/root/.cache/huggingface \
rocm/vllm:latest \
sh -c 'vllm serve $VLLM_MODEL \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--api-key $VLLM_API_KEY'
The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on container restart.
Print the API key that was generated (echo $VLLM_API_KEY from inside
the container, or use a fixed string instead of openssl rand). You will
paste this into the Streamlit app's .env as VLLM_API_KEY.
Option B — serve.sh directly
If vLLM is pip-installed in a ROCm-enabled environment on the host:
chmod +x infra/vllm/serve.sh
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
VLLM_API_KEY=<your-key> \
./infra/vllm/serve.sh
5. Verify the endpoint
From any machine with network access to the MI300X:
curl http://<mi300x-public-ip>:8000/v1/models \
-H "Authorization: Bearer <your-api-key>"
Expected response (truncated):
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen2.5-14B-Instruct",
"object": "model",
"owned_by": "vllm",
...
}
]
}
A simple chat-completion smoke test:
curl http://<mi300x-public-ip>:8000/v1/chat/completions \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"temperature": 0.0
}'
6. Connect the Streamlit app
In the project root .env:
LLM_PROFILE=vllm
VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
VLLM_API_KEY=<your-key>
Then start the Streamlit app:
docker compose up langgraph-app
Or directly:
streamlit run app/main.py
7. Performance benchmark (expected)
On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+):
| Metric | Qwen 2.5 14B | Qwen 2.5 32B |
|---|---|---|
| Time-to-first-token | ~0.5 s | ~1.0 s |
| Throughput (single user) | 50-80 tok/s | 25-40 tok/s |
| Concurrent capacity (KV-cache) | ~50 sessions | ~20 sessions |
| Max context length | 32K (configured) | 32K (configured) |
These numbers depend on prompt length, batch size, and the exact ROCm/vLLM
version. Run a benchmark with vllm bench after startup for the actual
numbers on your instance.
8. Cost monitoring
AMD Developer Cloud MI300X pricing (as of May 2026):
- ~$4-8/hour pay-as-you-go
$100 / team-member × 3 team-members = $300 total credit. At $5/h, that's
60 hours of MI300X uptime. Plan accordingly:
- Only run during demo/test/build sessions — stop the instance when idle
- Keep one teammate's credit as failover/buffer for the final 24 hours
- Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits
9. Plan B — local fallback if MI300X is unavailable
If the AMD credit doesn't arrive in time, or the MI300X instance has issues:
# Switch the Streamlit app to Ollama profile
LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py
Pull the model first:
ollama pull qwen2.5:7b-instruct
This runs on a laptop GPU (or CPU) and lets development continue. Quality will be lower (7B vs 14B/32B), but the demo-flow stays alive.
10. Production hardening (post-hackathon)
For an actual production deployment, beyond the hackathon scope:
- Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port
- Rotate
VLLM_API_KEYregularly - Set up Prometheus + Grafana for vLLM
/metrics - Use
--quantizationflag for fp8/int8 to fit a larger model on smaller hardware - Configure
--enable-prefix-cachingfor repeated long system prompts - Use
vllm-deploy(sky pilot) for multi-GPU and multi-region scaling