| # vLLM serving on AMD MI300X |
|
|
| This directory contains the infrastructure to serve **Qwen 2.5 Instruct** via |
| [vLLM](https://github.com/vllm-project/vllm) on an **AMD Instinct MI300X** |
| GPU through the AMD Developer Cloud. |
|
|
| The Streamlit app (`app/main.py`) and the LangGraph pipeline call this |
| endpoint via the OpenAI-compatible REST API (`/v1/chat/completions`), |
| using `langchain-openai`'s `ChatOpenAI` adapter with a custom `base_url`. |
|
|
| --- |
|
|
| ## 1. Prerequisites |
|
|
| - **AMD AI Developer Program** approval (`$100` cloud credit per team member) |
| - Sign up: https://www.amd.com/en/developer/ai-dev-program.html |
| - Approval typically takes 2 business days, up to 1 week |
| - **AMD Developer Cloud** account, MI300X instance available |
| - **SSH access** to the MI300X instance |
| - (Optional) **Hugging Face token** if the model is gated (Qwen 2.5 is open, |
| so this is **not required** for the default model) |
|
|
| --- |
|
|
| ## 2. Provision the MI300X instance |
|
|
| Follow the AMD Developer Cloud Getting Started guide: |
| https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html |
|
|
| The default ROCm-enabled image already includes Docker and the AMD GPU |
| driver. Verify GPU access: |
|
|
| ```bash |
| rocm-smi |
| # Expected: 1 × AMD Instinct MI300X listed |
| ``` |
|
|
| --- |
|
|
| ## 3. Pull the vLLM ROCm image |
|
|
| ```bash |
| docker pull rocm/vllm:latest |
| ``` |
|
|
| Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies). |
|
|
| --- |
|
|
| ## 4. Start the vLLM server |
|
|
| ### Option A — Docker (recommended) |
|
|
| ```bash |
| docker run --rm \ |
| --device=/dev/kfd \ |
| --device=/dev/dri \ |
| --group-add video \ |
| --ipc=host \ |
| --shm-size 16g \ |
| -p 8000:8000 \ |
| -e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \ |
| -e VLLM_API_KEY=$(openssl rand -hex 32) \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| rocm/vllm:latest \ |
| sh -c 'vllm serve $VLLM_MODEL \ |
| --host 0.0.0.0 --port 8000 \ |
| --tensor-parallel-size 1 \ |
| --dtype auto \ |
| --gpu-memory-utilization 0.9 \ |
| --max-model-len 32768 \ |
| --api-key $VLLM_API_KEY' |
| ``` |
|
|
| The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on |
| container restart. |
|
|
| **Print the API key** that was generated (`echo $VLLM_API_KEY` from inside |
| the container, or use a fixed string instead of `openssl rand`). You will |
| paste this into the Streamlit app's `.env` as `VLLM_API_KEY`. |
|
|
| ### Option B — `serve.sh` directly |
|
|
| If vLLM is pip-installed in a ROCm-enabled environment on the host: |
|
|
| ```bash |
| chmod +x infra/vllm/serve.sh |
| VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \ |
| VLLM_API_KEY=<your-key> \ |
| ./infra/vllm/serve.sh |
| ``` |
|
|
| --- |
|
|
| ## 5. Verify the endpoint |
|
|
| From any machine with network access to the MI300X: |
|
|
| ```bash |
| curl http://<mi300x-public-ip>:8000/v1/models \ |
| -H "Authorization: Bearer <your-api-key>" |
| ``` |
|
|
| Expected response (truncated): |
|
|
| ```json |
| { |
| "object": "list", |
| "data": [ |
| { |
| "id": "Qwen/Qwen2.5-14B-Instruct", |
| "object": "model", |
| "owned_by": "vllm", |
| ... |
| } |
| ] |
| } |
| ``` |
|
|
| A simple chat-completion smoke test: |
|
|
| ```bash |
| curl http://<mi300x-public-ip>:8000/v1/chat/completions \ |
| -H "Authorization: Bearer <your-api-key>" \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "Qwen/Qwen2.5-14B-Instruct", |
| "messages": [{"role": "user", "content": "What is 2+2?"}], |
| "temperature": 0.0 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## 6. Connect the Streamlit app |
|
|
| In the project root `.env`: |
|
|
| ```dotenv |
| LLM_PROFILE=vllm |
| VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1 |
| VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct |
| VLLM_API_KEY=<your-key> |
| ``` |
|
|
| Then start the Streamlit app: |
|
|
| ```bash |
| docker compose up langgraph-app |
| ``` |
|
|
| Or directly: |
|
|
| ```bash |
| streamlit run app/main.py |
| ``` |
|
|
| --- |
|
|
| ## 7. Performance benchmark (expected) |
|
|
| On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+): |
|
|
| | Metric | Qwen 2.5 14B | Qwen 2.5 32B | |
| |--------|--------------|--------------| |
| | Time-to-first-token | ~0.5 s | ~1.0 s | |
| | Throughput (single user) | 50-80 tok/s | 25-40 tok/s | |
| | Concurrent capacity (KV-cache) | ~50 sessions | ~20 sessions | |
| | Max context length | 32K (configured) | 32K (configured) | |
|
|
| These numbers depend on prompt length, batch size, and the exact ROCm/vLLM |
| version. Run a benchmark with `vllm bench` after startup for the actual |
| numbers on your instance. |
|
|
| --- |
|
|
| ## 8. Cost monitoring |
|
|
| AMD Developer Cloud MI300X pricing (as of May 2026): |
| - ~$4-8/hour pay-as-you-go |
|
|
| `$100 / team-member × 3 team-members = $300 total credit`. At $5/h, that's |
| **60 hours of MI300X uptime**. Plan accordingly: |
|
|
| - **Only run during demo/test/build sessions** — stop the instance when idle |
| - Keep one teammate's credit as **failover/buffer** for the final 24 hours |
| - Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits |
|
|
| --- |
|
|
| ## 9. Plan B — local fallback if MI300X is unavailable |
|
|
| If the AMD credit doesn't arrive in time, or the MI300X instance has issues: |
|
|
| ```bash |
| # Switch the Streamlit app to Ollama profile |
| LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py |
| ``` |
|
|
| Pull the model first: |
|
|
| ```bash |
| ollama pull qwen2.5:7b-instruct |
| ``` |
|
|
| This runs on a laptop GPU (or CPU) and lets development continue. Quality |
| will be lower (7B vs 14B/32B), but the demo-flow stays alive. |
|
|
| --- |
|
|
| ## 10. Production hardening (post-hackathon) |
|
|
| For an actual production deployment, beyond the hackathon scope: |
|
|
| - Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port |
| - Rotate `VLLM_API_KEY` regularly |
| - Set up Prometheus + Grafana for vLLM `/metrics` |
| - Use `--quantization` flag for fp8/int8 to fit a larger model on smaller hardware |
| - Configure `--enable-prefix-caching` for repeated long system prompts |
| - Use `vllm-deploy` (sky pilot) for multi-GPU and multi-region scaling |
|
|