NΓ‘ndorfi Vince
Sync documentation overhaul from main (markdown only, LFS history preserved)
3385e0e | # AMD MI300X Deployment | |
| How we deployed Qwen 2.5 14B Instruct via vLLM on AMD Instinct MI300X using the AMD Developer Cloud (DigitalOcean-powered). End-to-end, with copy-paste commands and the costs we actually paid. | |
| --- | |
| ## What you get | |
| - **AMD Instinct MI300X** β 192 GB HBM3 GPU, 20 vCPU, 240 GB RAM, 720 GB NVMe boot disk | |
| - **vLLM 0.17.1 + ROCm 7.0** β pre-installed via the Quick Start image | |
| - **OpenAI-compatible REST endpoint** at `http://<droplet-ip>:8000/v1` | |
| - **Cost**: $1.99 / GPU / hour. Free $100 credit covers ~50 hours. | |
| --- | |
| ## Prerequisites | |
| 1. **AMD AI Developer Program signup** β <https://www.amd.com/en/developer/ai-dev-program.html> | |
| - Approval takes 1β2 business days; you receive a $100 cloud credit by email automatically | |
| 2. **lablab.ai event Enroll** (for hackathon participants) β <https://lablab.ai/event/amd-developer-hackathon> | |
| 3. **SSH key on your local machine** (we recommend a dedicated key, not your default GitHub key β see step 1 below) | |
| --- | |
| ## Step 1 β Generate a dedicated SSH key | |
| The default `~/.ssh/id_ed25519` is often passphrase-protected and routed through a GNOME-keyring agent that interferes with non-interactive `ssh-add`. Sidestep it with a passphrase-less, dedicated key: | |
| ```bash | |
| ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_amd_paperhawk -N "" -C "you@paperhawk-amd" | |
| cat ~/.ssh/id_ed25519_amd_paperhawk.pub | |
| ``` | |
| Copy the public key to clipboard for the next step. | |
| --- | |
| ## Step 2 β Create a GPU Droplet | |
| Go to <https://cloud.amd.com/> (or <https://amd.digitalocean.com/>) and click **Create a GPU Droplet** on the homepage card. | |
| **Caution**: the left-sidebar `GPU Droplets` link routes to the CPU Droplet flow as of May 2026 (a UI bug). Use the homepage card or the top-right `Create βΌ` dropdown. | |
| ### Configuration | |
| - **GPU Plan**: AMD MI300X (single-GPU, $1.99/hr) β **not** the 8-GPU variant | |
| - **Region**: ATL1 (Atlanta) β NYC1 is often "out of capacity" for MI300X. If the Plan card is greyed out, the URL parameter `?region=atl1` switches you over. | |
| - **Image**: Quick Start β vLLM (0.17.1, ROCm 7.0) β comes with Docker, JupyterLab, and a pre-built `rocm` container | |
| - **SSH Key**: Add a new key, paste the public key from step 1, name it `paperhawk-amd-deploy` | |
| - **Visibility**: doesn't matter; the droplet is private to your account | |
| Click **Create GPU Droplet**. It takes 5β10 minutes to come up. Once `Active`, note the Public IPv4 address. | |
| --- | |
| ## Step 3 β SSH in | |
| ```bash | |
| ssh -i ~/.ssh/id_ed25519_amd_paperhawk -o IdentityAgent=none root@<DROPLET_IP> | |
| ``` | |
| The `-o IdentityAgent=none` flag bypasses the GNOME-keyring SSH agent if it's misbehaving on your local machine. | |
| You'll see a welcome banner with two key facts: | |
| ``` | |
| Access the Jupyter Server: http://<IP>:80 (we don't use this) | |
| docker exec -it rocm /bin/bash (we DO use this) | |
| ``` | |
| --- | |
| ## Step 4 β Open port 8000 in the firewall | |
| The Quick Start image ships with UFW enabled, allowing only SSH (22), HTTP (80), and HTTPS (443). vLLM runs on 8000, so we need to open it: | |
| ```bash | |
| ufw allow 8000 | |
| ufw status | grep 8000 | |
| ``` | |
| You should see `8000 ALLOW Anywhere` and the IPv6 equivalent. | |
| The `--api-key` flag we pass to vLLM in step 6 prevents anyone scanning the public internet from using your endpoint β opening port 8000 is safe with API-key auth. | |
| --- | |
| ## Step 5 β (Optional) System upgrade and reboot | |
| The Quick Start image ships with ~120 outdated packages including security updates. Recommended before snapshotting: | |
| ```bash | |
| apt-get update && DEBIAN_FRONTEND=noninteractive apt-get upgrade -y | |
| reboot | |
| ``` | |
| Wait ~1.5β2 minutes, then SSH in again. **The `rocm` Docker container does not auto-restart after the reboot**, so: | |
| ```bash | |
| docker start rocm | |
| docker ps # confirm `rocm` is Up | |
| ``` | |
| --- | |
| ## Step 6 β Start vLLM serving Qwen 2.5 14B | |
| Enter the Docker container: | |
| ```bash | |
| docker exec -it rocm /bin/bash | |
| ``` | |
| Run vLLM in one long line (line continuations with `\` sometimes break under paste β single-line is most reliable): | |
| ```bash | |
| vllm serve Qwen/Qwen2.5-14B-Instruct --api-key sk-paperhawk-2026 --port 8000 --host 0.0.0.0 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code | |
| ``` | |
| What this does: | |
| | Flag | Why | | |
| |---|---| | |
| | `Qwen/Qwen2.5-14B-Instruct` | Model ID on Hugging Face Hub. vLLM auto-downloads on first run (~28 GB, ~6 sec from ATL DC) | | |
| | `--api-key sk-paperhawk-2026` | Bearer token required by every request. Anti-misuse for the public-internet endpoint. | | |
| | `--port 8000` | OpenAI-compat REST at `:8000/v1` | | |
| | `--host 0.0.0.0` | Bind on all interfaces so the public IP is reachable | | |
| | `--enable-auto-tool-choice` + `--tool-call-parser hermes` | Required for our 5-tool agentic chat. Qwen 2.5 uses Hermes-style tool calls. | | |
| | `--trust-remote-code` | Tokenizer ships custom code; flag is no-op for Qwen 2.5 but kept for compatibility | | |
| **What you'll see on first run** (~70 seconds total): | |
| ``` | |
| INFO 05-04 20:56:36 [utils.py:302] ββ ββ β β β βββ β version 0.17.1 | |
| INFO 05-04 20:56:36 [utils.py:302] ββββ β β β β model Qwen/Qwen2.5-14B-Instruct | |
| config.json: 100%|ββββββββββββββββββββ| 663/663 [00:00<00:00, 8.25MB/s] | |
| model-00001-of-00008.safetensors: 100%|ββββββ| 3.89G/3.89G [00:05<00:00, 745MB/s] | |
| ... (8 shards, ~28 GB total in 5.9 sec) | |
| INFO 05-04 20:57:08 [gpu_model_runner.py:4364] Model loading took 27.63 GiB memory and 17.358448 seconds | |
| INFO 05-04 20:57:32 [gpu_worker.py:424] Available KV cache memory: 141.96 GiB | |
| INFO 05-04 20:57:32 [kv_cache_utils.py:1314] GPU KV cache size: 775,280 tokens | |
| INFO 05-04 20:57:32 [kv_cache_utils.py:1319] Maximum concurrency for 32,768 tokens per request: 23.66x | |
| INFO: Application startup complete. | |
| INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) | |
| ``` | |
| The vLLM server now serves OpenAI-compatible requests. **Don't close this SSH session** β closing it kills the server. Open a second SSH window for the smoke test. | |
| --- | |
| ## Step 7 β Smoke-test the endpoint | |
| From your local machine: | |
| ```bash | |
| # List models | |
| curl http://<DROPLET_IP>:8000/v1/models -H "Authorization: Bearer sk-paperhawk-2026" | |
| # Chat completion | |
| curl http://<DROPLET_IP>:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -H "Authorization: Bearer sk-paperhawk-2026" \ | |
| -d '{"model":"Qwen/Qwen2.5-14B-Instruct","messages":[{"role":"user","content":"Hello, who are you? Answer in one sentence."}],"max_tokens":50,"temperature":0}' | |
| ``` | |
| Expected response: `"I am Qwen, a large language model created by Alibaba Cloud."` | |
| If you get `401 Unauthorized`, the Bearer token is wrong (must match the `--api-key` value exactly). If you get `Connection refused`, port 8000 isn't open or the vLLM server didn't start β check the SSH window from step 6. | |
| --- | |
| ## Step 8 β Snapshot the droplet (cost optimization) | |
| Once everything works, take a live snapshot. It captures the entire boot disk (~96 GB including the Docker container with the cached Qwen model), so a future restart is **30 seconds** instead of a 70-second cold start. | |
| In the AMD Cloud UI: | |
| 1. Droplet β **Backups & Snapshots** tab β **Take a Snapshot** | |
| 2. Name: `paperhawk-vllm-tested-YYYY-MM-DD` | |
| 3. Click **Take Live Snapshot** (live works fine β vLLM does only read-only inference) | |
| The snapshot takes 10β15 minutes. Storage cost: $0.06 / GB / month Γ ~96 GB = **~$0.32 / day**. | |
| --- | |
| ## Step 9 β Destroy the droplet (stop the meter) | |
| When you're done with the dev session, **destroy** the droplet (do not just power-off β powered-off droplets still bill at $1.99/hr). | |
| In the UI: Droplet β **Actions** βΌ β **Destroy** β type the droplet name to confirm. | |
| **Important**: when the destroy dialog asks if you also want to destroy the snapshot, **leave it unchecked**. The snapshot survives the destroy and is what you'll use to recreate the droplet. | |
| --- | |
| ## Step 10 β Recreate from snapshot (Friday morning) | |
| When you need the endpoint live again (e.g., for a demo or judging window): | |
| 1. AMD Cloud β **Backups & Snapshots** β click `β¦` next to your snapshot β **Create GPU Droplet** | |
| 2. Configuration: same MI300X / ATL1 / SSH key | |
| 3. Wait 5β10 minutes for `Active`. Note the new public IP. | |
| Then SSH in (with the new IP) and: | |
| ```bash | |
| docker start rocm | |
| docker exec -it rocm /bin/bash | |
| vllm serve Qwen/Qwen2.5-14B-Instruct --api-key sk-paperhawk-2026 --port 8000 --host 0.0.0.0 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code | |
| ``` | |
| Because the snapshot includes the cached model in the Docker container layer, **vLLM startup is ~30 seconds** instead of 70. | |
| --- | |
| ## Live performance numbers (measured) | |
| From our end-to-end test on May 5, 2026: | |
| | Metric | Value | | |
| |---|---| | |
| | HF Hub model download (8 safetensors, 28 GB) | 5.9 sec (700+ MB/s from ATL DC) | | |
| | Model load to MI300X VRAM | 17.4 sec | | |
| | CUDA graph compile (51 size-buckets) | 20.5 sec | | |
| | **Total cold-start** | **~70 sec** | | |
| | **Warm restart from snapshot** | **~30 sec** | | |
| | Available KV cache (192 GB β 27.6 GB model β 22 GB headroom) | 141.96 GiB | | |
| | KV cache token capacity | 775,280 tokens | | |
| | Max concurrency at 32k context | 23.66Γ parallel requests | | |
| | Prompt throughput (live audit demo) | 307 tokens/sec | | |
| | Generation throughput (live audit demo) | 252 tokens/sec | | |
| | Prefix cache hit rate (multi-agent prompts) | 30.4% | | |
| | End-to-end audit demo (3 PDFs from HF Space) | 23.3 sec / 61.7Γ speedup vs manual | | |
| --- | |
| ## Cost breakdown (our actual hackathon spend) | |
| | Item | Cost | | |
| |---|---| | |
| | Initial dev session (provisioning, vLLM setup, debugging) | ~$3 | | |
| | Live validation session (30 minutes) | ~$1 | | |
| | Snapshot storage (5 days from Tuesday to Friday) | ~$1.60 | | |
| | Live judging window (estimated 24 hours) | ~$48 | | |
| | **Total estimated** | **~$54** of the free $100 credit | | |
| Plenty of buffer for a longer judging window or a second iteration. | |
| --- | |
| ## Common pitfalls | |
| - **"Out of capacity in the selected region"**: Switch to ATL1. NYC1 frequently runs out of MI300X. Pass `?region=atl1` in the Create-Droplet URL. | |
| - **`Permission denied (publickey)` on SSH**: Either the `~/.ssh/id_ed25519` is passphrase-protected and the agent isn't unlocked, or you have the wrong key. Use a dedicated passphrase-less key (step 1) and `-o IdentityAgent=none` on the ssh command. | |
| - **vLLM exits with `Triton FlashAttention error` on first run**: Older vLLM 0.8.x builds had this issue. The 0.17.1 + ROCm 7.0 build we use has it fixed. If you're stuck on an older image, prefix with `VLLM_USE_TRITON_FLASH_ATTN=0`. | |
| - **Docker container `rocm` not running after reboot**: Manual `docker start rocm`. Not auto-started by default. | |
| - **Powered-off droplet still billing**: Power-off does **not** stop billing. Only **Destroy** does. Snapshot first if you want to keep the state. | |
| --- | |
| ## Cross-references | |
| - [`docs/HUGGINGFACE_DEPLOYMENT.md`](HUGGINGFACE_DEPLOYMENT.md) β how the Streamlit Space talks to this vLLM endpoint | |
| - [`docs/ARCHITECTURE.md`](ARCHITECTURE.md) β how the application uses the vLLM endpoint via the provider abstraction | |
| - [`docs/AMD_DEPLOY_LESSONS_LEARNED.md`](AMD_DEPLOY_LESSONS_LEARNED.md) β extended history of every push iteration, error message, and workaround we hit | |