# vLLM serving on AMD MI300X This directory contains the infrastructure to serve **Qwen 2.5 Instruct** via [vLLM](https://github.com/vllm-project/vllm) on an **AMD Instinct MI300X** GPU through the AMD Developer Cloud. The Streamlit app (`app/main.py`) and the LangGraph pipeline call this endpoint via the OpenAI-compatible REST API (`/v1/chat/completions`), using `langchain-openai`'s `ChatOpenAI` adapter with a custom `base_url`. --- ## 1. Prerequisites - **AMD AI Developer Program** approval (`$100` cloud credit per team member) - Sign up: https://www.amd.com/en/developer/ai-dev-program.html - Approval typically takes 2 business days, up to 1 week - **AMD Developer Cloud** account, MI300X instance available - **SSH access** to the MI300X instance - (Optional) **Hugging Face token** if the model is gated (Qwen 2.5 is open, so this is **not required** for the default model) --- ## 2. Provision the MI300X instance Follow the AMD Developer Cloud Getting Started guide: https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html The default ROCm-enabled image already includes Docker and the AMD GPU driver. Verify GPU access: ```bash rocm-smi # Expected: 1 × AMD Instinct MI300X listed ``` --- ## 3. Pull the vLLM ROCm image ```bash docker pull rocm/vllm:latest ``` Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies). --- ## 4. Start the vLLM server ### Option A — Docker (recommended) ```bash docker run --rm \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --ipc=host \ --shm-size 16g \ -p 8000:8000 \ -e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \ -e VLLM_API_KEY=$(openssl rand -hex 32) \ -v ~/.cache/huggingface:/root/.cache/huggingface \ rocm/vllm:latest \ sh -c 'vllm serve $VLLM_MODEL \ --host 0.0.0.0 --port 8000 \ --tensor-parallel-size 1 \ --dtype auto \ --gpu-memory-utilization 0.9 \ --max-model-len 32768 \ --api-key $VLLM_API_KEY' ``` The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on container restart. **Print the API key** that was generated (`echo $VLLM_API_KEY` from inside the container, or use a fixed string instead of `openssl rand`). You will paste this into the Streamlit app's `.env` as `VLLM_API_KEY`. ### Option B — `serve.sh` directly If vLLM is pip-installed in a ROCm-enabled environment on the host: ```bash chmod +x infra/vllm/serve.sh VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \ VLLM_API_KEY= \ ./infra/vllm/serve.sh ``` --- ## 5. Verify the endpoint From any machine with network access to the MI300X: ```bash curl http://:8000/v1/models \ -H "Authorization: Bearer " ``` Expected response (truncated): ```json { "object": "list", "data": [ { "id": "Qwen/Qwen2.5-14B-Instruct", "object": "model", "owned_by": "vllm", ... } ] } ``` A simple chat-completion smoke test: ```bash curl http://:8000/v1/chat/completions \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-14B-Instruct", "messages": [{"role": "user", "content": "What is 2+2?"}], "temperature": 0.0 }' ``` --- ## 6. Connect the Streamlit app In the project root `.env`: ```dotenv LLM_PROFILE=vllm VLLM_BASE_URL=http://:8000/v1 VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct VLLM_API_KEY= ``` Then start the Streamlit app: ```bash docker compose up langgraph-app ``` Or directly: ```bash streamlit run app/main.py ``` --- ## 7. Performance benchmark (expected) On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+): | Metric | Qwen 2.5 14B | Qwen 2.5 32B | |--------|--------------|--------------| | Time-to-first-token | ~0.5 s | ~1.0 s | | Throughput (single user) | 50-80 tok/s | 25-40 tok/s | | Concurrent capacity (KV-cache) | ~50 sessions | ~20 sessions | | Max context length | 32K (configured) | 32K (configured) | These numbers depend on prompt length, batch size, and the exact ROCm/vLLM version. Run a benchmark with `vllm bench` after startup for the actual numbers on your instance. --- ## 8. Cost monitoring AMD Developer Cloud MI300X pricing (as of May 2026): - ~$4-8/hour pay-as-you-go `$100 / team-member × 3 team-members = $300 total credit`. At $5/h, that's **60 hours of MI300X uptime**. Plan accordingly: - **Only run during demo/test/build sessions** — stop the instance when idle - Keep one teammate's credit as **failover/buffer** for the final 24 hours - Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits --- ## 9. Plan B — local fallback if MI300X is unavailable If the AMD credit doesn't arrive in time, or the MI300X instance has issues: ```bash # Switch the Streamlit app to Ollama profile LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py ``` Pull the model first: ```bash ollama pull qwen2.5:7b-instruct ``` This runs on a laptop GPU (or CPU) and lets development continue. Quality will be lower (7B vs 14B/32B), but the demo-flow stays alive. --- ## 10. Production hardening (post-hackathon) For an actual production deployment, beyond the hackathon scope: - Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port - Rotate `VLLM_API_KEY` regularly - Set up Prometheus + Grafana for vLLM `/metrics` - Use `--quantization` flag for fp8/int8 to fit a larger model on smaller hardware - Configure `--enable-prefix-caching` for repeated long system prompts - Use `vllm-deploy` (sky pilot) for multi-GPU and multi-region scaling