paperhawk / infra /vllm /README.md
Nándorfi Vince
Initial paperhawk push to HF Space (LFS for binaries)
7ff7119
|
raw
history blame
5.74 kB
# vLLM serving on AMD MI300X
This directory contains the infrastructure to serve **Qwen 2.5 Instruct** via
[vLLM](https://github.com/vllm-project/vllm) on an **AMD Instinct MI300X**
GPU through the AMD Developer Cloud.
The Streamlit app (`app/main.py`) and the LangGraph pipeline call this
endpoint via the OpenAI-compatible REST API (`/v1/chat/completions`),
using `langchain-openai`'s `ChatOpenAI` adapter with a custom `base_url`.
---
## 1. Prerequisites
- **AMD AI Developer Program** approval (`$100` cloud credit per team member)
- Sign up: https://www.amd.com/en/developer/ai-dev-program.html
- Approval typically takes 2 business days, up to 1 week
- **AMD Developer Cloud** account, MI300X instance available
- **SSH access** to the MI300X instance
- (Optional) **Hugging Face token** if the model is gated (Qwen 2.5 is open,
so this is **not required** for the default model)
---
## 2. Provision the MI300X instance
Follow the AMD Developer Cloud Getting Started guide:
https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html
The default ROCm-enabled image already includes Docker and the AMD GPU
driver. Verify GPU access:
```bash
rocm-smi
# Expected: 1 × AMD Instinct MI300X listed
```
---
## 3. Pull the vLLM ROCm image
```bash
docker pull rocm/vllm:latest
```
Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies).
---
## 4. Start the vLLM server
### Option A — Docker (recommended)
```bash
docker run --rm \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 16g \
-p 8000:8000 \
-e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
-e VLLM_API_KEY=$(openssl rand -hex 32) \
-v ~/.cache/huggingface:/root/.cache/huggingface \
rocm/vllm:latest \
sh -c 'vllm serve $VLLM_MODEL \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--api-key $VLLM_API_KEY'
```
The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on
container restart.
**Print the API key** that was generated (`echo $VLLM_API_KEY` from inside
the container, or use a fixed string instead of `openssl rand`). You will
paste this into the Streamlit app's `.env` as `VLLM_API_KEY`.
### Option B — `serve.sh` directly
If vLLM is pip-installed in a ROCm-enabled environment on the host:
```bash
chmod +x infra/vllm/serve.sh
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
VLLM_API_KEY=<your-key> \
./infra/vllm/serve.sh
```
---
## 5. Verify the endpoint
From any machine with network access to the MI300X:
```bash
curl http://<mi300x-public-ip>:8000/v1/models \
-H "Authorization: Bearer <your-api-key>"
```
Expected response (truncated):
```json
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen2.5-14B-Instruct",
"object": "model",
"owned_by": "vllm",
...
}
]
}
```
A simple chat-completion smoke test:
```bash
curl http://<mi300x-public-ip>:8000/v1/chat/completions \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"temperature": 0.0
}'
```
---
## 6. Connect the Streamlit app
In the project root `.env`:
```dotenv
LLM_PROFILE=vllm
VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
VLLM_API_KEY=<your-key>
```
Then start the Streamlit app:
```bash
docker compose up langgraph-app
```
Or directly:
```bash
streamlit run app/main.py
```
---
## 7. Performance benchmark (expected)
On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+):
| Metric | Qwen 2.5 14B | Qwen 2.5 32B |
|--------|--------------|--------------|
| Time-to-first-token | ~0.5 s | ~1.0 s |
| Throughput (single user) | 50-80 tok/s | 25-40 tok/s |
| Concurrent capacity (KV-cache) | ~50 sessions | ~20 sessions |
| Max context length | 32K (configured) | 32K (configured) |
These numbers depend on prompt length, batch size, and the exact ROCm/vLLM
version. Run a benchmark with `vllm bench` after startup for the actual
numbers on your instance.
---
## 8. Cost monitoring
AMD Developer Cloud MI300X pricing (as of May 2026):
- ~$4-8/hour pay-as-you-go
`$100 / team-member × 3 team-members = $300 total credit`. At $5/h, that's
**60 hours of MI300X uptime**. Plan accordingly:
- **Only run during demo/test/build sessions** — stop the instance when idle
- Keep one teammate's credit as **failover/buffer** for the final 24 hours
- Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits
---
## 9. Plan B — local fallback if MI300X is unavailable
If the AMD credit doesn't arrive in time, or the MI300X instance has issues:
```bash
# Switch the Streamlit app to Ollama profile
LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py
```
Pull the model first:
```bash
ollama pull qwen2.5:7b-instruct
```
This runs on a laptop GPU (or CPU) and lets development continue. Quality
will be lower (7B vs 14B/32B), but the demo-flow stays alive.
---
## 10. Production hardening (post-hackathon)
For an actual production deployment, beyond the hackathon scope:
- Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port
- Rotate `VLLM_API_KEY` regularly
- Set up Prometheus + Grafana for vLLM `/metrics`
- Use `--quantization` flag for fp8/int8 to fit a larger model on smaller hardware
- Configure `--enable-prefix-caching` for repeated long system prompts
- Use `vllm-deploy` (sky pilot) for multi-GPU and multi-region scaling