Spaces:

lablab-ai-amd-developer-hackathon
/

paperhawk

Running

File size: 11,281 Bytes

3385e0e

# AMD MI300X Deployment

How we deployed Qwen 2.5 14B Instruct via vLLM on AMD Instinct MI300X using the AMD Developer Cloud (DigitalOcean-powered). End-to-end, with copy-paste commands and the costs we actually paid.

---

## What you get

- **AMD Instinct MI300X** — 192 GB HBM3 GPU, 20 vCPU, 240 GB RAM, 720 GB NVMe boot disk
- **vLLM 0.17.1 + ROCm 7.0** — pre-installed via the Quick Start image
- **OpenAI-compatible REST endpoint** at `http://<droplet-ip>:8000/v1`
- **Cost**: $1.99 / GPU / hour. Free $100 credit covers ~50 hours.

---

## Prerequisites

1. **AMD AI Developer Program signup** — <https://www.amd.com/en/developer/ai-dev-program.html>
   - Approval takes 1–2 business days; you receive a $100 cloud credit by email automatically
2. **lablab.ai event Enroll** (for hackathon participants) — <https://lablab.ai/event/amd-developer-hackathon>
3. **SSH key on your local machine** (we recommend a dedicated key, not your default GitHub key — see step 1 below)

---

## Step 1 — Generate a dedicated SSH key

The default `~/.ssh/id_ed25519` is often passphrase-protected and routed through a GNOME-keyring agent that interferes with non-interactive `ssh-add`. Sidestep it with a passphrase-less, dedicated key:

```bash
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_amd_paperhawk -N "" -C "you@paperhawk-amd"
cat ~/.ssh/id_ed25519_amd_paperhawk.pub
```

Copy the public key to clipboard for the next step.

---

## Step 2 — Create a GPU Droplet

Go to <https://cloud.amd.com/> (or <https://amd.digitalocean.com/>) and click **Create a GPU Droplet** on the homepage card.

**Caution**: the left-sidebar `GPU Droplets` link routes to the CPU Droplet flow as of May 2026 (a UI bug). Use the homepage card or the top-right `Create ▼` dropdown.

### Configuration

- **GPU Plan**: AMD MI300X (single-GPU, $1.99/hr) — **not** the 8-GPU variant
- **Region**: ATL1 (Atlanta) — NYC1 is often "out of capacity" for MI300X. If the Plan card is greyed out, the URL parameter `?region=atl1` switches you over.
- **Image**: Quick Start → vLLM (0.17.1, ROCm 7.0) — comes with Docker, JupyterLab, and a pre-built `rocm` container
- **SSH Key**: Add a new key, paste the public key from step 1, name it `paperhawk-amd-deploy`
- **Visibility**: doesn't matter; the droplet is private to your account

Click **Create GPU Droplet**. It takes 5–10 minutes to come up. Once `Active`, note the Public IPv4 address.

---

## Step 3 — SSH in

```bash
ssh -i ~/.ssh/id_ed25519_amd_paperhawk -o IdentityAgent=none root@<DROPLET_IP>
```

The `-o IdentityAgent=none` flag bypasses the GNOME-keyring SSH agent if it's misbehaving on your local machine.

You'll see a welcome banner with two key facts:

```
Access the Jupyter Server: http://<IP>:80   (we don't use this)
docker exec -it rocm /bin/bash              (we DO use this)
```

---

## Step 4 — Open port 8000 in the firewall

The Quick Start image ships with UFW enabled, allowing only SSH (22), HTTP (80), and HTTPS (443). vLLM runs on 8000, so we need to open it:

```bash
ufw allow 8000
ufw status | grep 8000
```

You should see `8000 ALLOW Anywhere` and the IPv6 equivalent.

The `--api-key` flag we pass to vLLM in step 6 prevents anyone scanning the public internet from using your endpoint — opening port 8000 is safe with API-key auth.

---

## Step 5 — (Optional) System upgrade and reboot

The Quick Start image ships with ~120 outdated packages including security updates. Recommended before snapshotting:

```bash
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
reboot
```

Wait ~1.5–2 minutes, then SSH in again. **The `rocm` Docker container does not auto-restart after the reboot**, so:

```bash
docker start rocm
docker ps   # confirm `rocm` is Up
```

---

## Step 6 — Start vLLM serving Qwen 2.5 14B

Enter the Docker container:

```bash
docker exec -it rocm /bin/bash
```

Run vLLM in one long line (line continuations with `\` sometimes break under paste — single-line is most reliable):

```bash
vllm serve Qwen/Qwen2.5-14B-Instruct --api-key sk-paperhawk-2026 --port 8000 --host 0.0.0.0 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code
```

What this does:

| Flag | Why |
|---|---|
| `Qwen/Qwen2.5-14B-Instruct` | Model ID on Hugging Face Hub. vLLM auto-downloads on first run (~28 GB, ~6 sec from ATL DC) |
| `--api-key sk-paperhawk-2026` | Bearer token required by every request. Anti-misuse for the public-internet endpoint. |
| `--port 8000` | OpenAI-compat REST at `:8000/v1` |
| `--host 0.0.0.0` | Bind on all interfaces so the public IP is reachable |
| `--enable-auto-tool-choice` + `--tool-call-parser hermes` | Required for our 5-tool agentic chat. Qwen 2.5 uses Hermes-style tool calls. |
| `--trust-remote-code` | Tokenizer ships custom code; flag is no-op for Qwen 2.5 but kept for compatibility |

**What you'll see on first run** (~70 seconds total):

```
INFO 05-04 20:56:36 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
INFO 05-04 20:56:36 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-14B-Instruct
config.json: 100%|████████████████████| 663/663 [00:00<00:00, 8.25MB/s]
model-00001-of-00008.safetensors: 100%|██████| 3.89G/3.89G [00:05<00:00, 745MB/s]
... (8 shards, ~28 GB total in 5.9 sec)
INFO 05-04 20:57:08 [gpu_model_runner.py:4364] Model loading took 27.63 GiB memory and 17.358448 seconds
INFO 05-04 20:57:32 [gpu_worker.py:424] Available KV cache memory: 141.96 GiB
INFO 05-04 20:57:32 [kv_cache_utils.py:1314] GPU KV cache size: 775,280 tokens
INFO 05-04 20:57:32 [kv_cache_utils.py:1319] Maximum concurrency for 32,768 tokens per request: 23.66x
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

The vLLM server now serves OpenAI-compatible requests. **Don't close this SSH session** — closing it kills the server. Open a second SSH window for the smoke test.

---

## Step 7 — Smoke-test the endpoint

From your local machine:

```bash
# List models
curl http://<DROPLET_IP>:8000/v1/models -H "Authorization: Bearer sk-paperhawk-2026"

# Chat completion
curl http://<DROPLET_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-paperhawk-2026" \
  -d '{"model":"Qwen/Qwen2.5-14B-Instruct","messages":[{"role":"user","content":"Hello, who are you? Answer in one sentence."}],"max_tokens":50,"temperature":0}'
```

Expected response: `"I am Qwen, a large language model created by Alibaba Cloud."`

If you get `401 Unauthorized`, the Bearer token is wrong (must match the `--api-key` value exactly). If you get `Connection refused`, port 8000 isn't open or the vLLM server didn't start — check the SSH window from step 6.

---

## Step 8 — Snapshot the droplet (cost optimization)

Once everything works, take a live snapshot. It captures the entire boot disk (~96 GB including the Docker container with the cached Qwen model), so a future restart is **30 seconds** instead of a 70-second cold start.

In the AMD Cloud UI:

1. Droplet → **Backups & Snapshots** tab → **Take a Snapshot**
2. Name: `paperhawk-vllm-tested-YYYY-MM-DD`
3. Click **Take Live Snapshot** (live works fine — vLLM does only read-only inference)

The snapshot takes 10–15 minutes. Storage cost: $0.06 / GB / month × ~96 GB = **~$0.32 / day**.

---

## Step 9 — Destroy the droplet (stop the meter)

When you're done with the dev session, **destroy** the droplet (do not just power-off — powered-off droplets still bill at $1.99/hr).

In the UI: Droplet → **Actions** ▼ → **Destroy** → type the droplet name to confirm.

**Important**: when the destroy dialog asks if you also want to destroy the snapshot, **leave it unchecked**. The snapshot survives the destroy and is what you'll use to recreate the droplet.

---

## Step 10 — Recreate from snapshot (Friday morning)

When you need the endpoint live again (e.g., for a demo or judging window):

1. AMD Cloud → **Backups & Snapshots** → click `…` next to your snapshot → **Create GPU Droplet**
2. Configuration: same MI300X / ATL1 / SSH key
3. Wait 5–10 minutes for `Active`. Note the new public IP.

Then SSH in (with the new IP) and:

```bash
docker start rocm
docker exec -it rocm /bin/bash
vllm serve Qwen/Qwen2.5-14B-Instruct --api-key sk-paperhawk-2026 --port 8000 --host 0.0.0.0 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code
```

Because the snapshot includes the cached model in the Docker container layer, **vLLM startup is ~30 seconds** instead of 70.

---

## Live performance numbers (measured)

From our end-to-end test on May 5, 2026:

| Metric | Value |
|---|---|
| HF Hub model download (8 safetensors, 28 GB) | 5.9 sec (700+ MB/s from ATL DC) |
| Model load to MI300X VRAM | 17.4 sec |
| CUDA graph compile (51 size-buckets) | 20.5 sec |
| **Total cold-start** | **~70 sec** |
| **Warm restart from snapshot** | **~30 sec** |
| Available KV cache (192 GB − 27.6 GB model − 22 GB headroom) | 141.96 GiB |
| KV cache token capacity | 775,280 tokens |
| Max concurrency at 32k context | 23.66× parallel requests |
| Prompt throughput (live audit demo) | 307 tokens/sec |
| Generation throughput (live audit demo) | 252 tokens/sec |
| Prefix cache hit rate (multi-agent prompts) | 30.4% |
| End-to-end audit demo (3 PDFs from HF Space) | 23.3 sec / 61.7× speedup vs manual |

---

## Cost breakdown (our actual hackathon spend)

| Item | Cost |
|---|---|
| Initial dev session (provisioning, vLLM setup, debugging) | ~$3 |
| Live validation session (30 minutes) | ~$1 |
| Snapshot storage (5 days from Tuesday to Friday) | ~$1.60 |
| Live judging window (estimated 24 hours) | ~$48 |
| **Total estimated** | **~$54** of the free $100 credit |

Plenty of buffer for a longer judging window or a second iteration.

---

## Common pitfalls

- **"Out of capacity in the selected region"**: Switch to ATL1. NYC1 frequently runs out of MI300X. Pass `?region=atl1` in the Create-Droplet URL.
- **`Permission denied (publickey)` on SSH**: Either the `~/.ssh/id_ed25519` is passphrase-protected and the agent isn't unlocked, or you have the wrong key. Use a dedicated passphrase-less key (step 1) and `-o IdentityAgent=none` on the ssh command.
- **vLLM exits with `Triton FlashAttention error` on first run**: Older vLLM 0.8.x builds had this issue. The 0.17.1 + ROCm 7.0 build we use has it fixed. If you're stuck on an older image, prefix with `VLLM_USE_TRITON_FLASH_ATTN=0`.
- **Docker container `rocm` not running after reboot**: Manual `docker start rocm`. Not auto-started by default.
- **Powered-off droplet still billing**: Power-off does **not** stop billing. Only **Destroy** does. Snapshot first if you want to keep the state.

---

## Cross-references

- [`docs/HUGGINGFACE_DEPLOYMENT.md`](HUGGINGFACE_DEPLOYMENT.md) — how the Streamlit Space talks to this vLLM endpoint
- [`docs/ARCHITECTURE.md`](ARCHITECTURE.md) — how the application uses the vLLM endpoint via the provider abstraction
- [`docs/AMD_DEPLOY_LESSONS_LEARNED.md`](AMD_DEPLOY_LESSONS_LEARNED.md) — extended history of every push iteration, error message, and workaround we hit