Spaces:

lablab-ai-amd-developer-hackathon
/

paperhawk

Running

App Files Files

paperhawk / infra /vllm /README.md

Nándorfi Vince

Initial paperhawk push to HF Space (LFS for binaries)

7ff7119 5 days ago

preview code

raw

history blame

5.74 kB

	# vLLM serving on AMD MI300X

	This directory contains the infrastructure to serve Qwen 2.5 Instruct via
	[vLLM](https://github.com/vllm-project/vllm) on an AMD Instinct MI300X
	GPU through the AMD Developer Cloud.

	The Streamlit app (`app/main.py`) and the LangGraph pipeline call this
	endpoint via the OpenAI-compatible REST API (`/v1/chat/completions`),
	using `langchain-openai`'s `ChatOpenAI` adapter with a custom `base_url`.

	---

	## 1. Prerequisites

	- AMD AI Developer Program approval (`$100` cloud credit per team member)
	- Sign up: https://www.amd.com/en/developer/ai-dev-program.html
	- Approval typically takes 2 business days, up to 1 week
	- AMD Developer Cloud account, MI300X instance available
	- SSH access to the MI300X instance
	- (Optional) Hugging Face token if the model is gated (Qwen 2.5 is open,
	so this is not required for the default model)

	---

	## 2. Provision the MI300X instance

	Follow the AMD Developer Cloud Getting Started guide:
	https://www.amd.com/en/developer/resources/technical-articles/2025/how-to-get-started-on-the-amd-developer-cloud-.html

	The default ROCm-enabled image already includes Docker and the AMD GPU
	driver. Verify GPU access:

	```bash
	rocm-smi
	# Expected: 1 × AMD Instinct MI300X listed
	```

	---

	## 3. Pull the vLLM ROCm image

	```bash
	docker pull rocm/vllm:latest
	```

	Image size: ~30 GB (ROCm runtime + PyTorch + vLLM + dependencies).

	---

	## 4. Start the vLLM server

	### Option A — Docker (recommended)

	```bash
	docker run --rm \
	--device=/dev/kfd \
	--device=/dev/dri \
	--group-add video \
	--ipc=host \
	--shm-size 16g \
	-p 8000:8000 \
	-e VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
	-e VLLM_API_KEY=$(openssl rand -hex 32) \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	rocm/vllm:latest \
	sh -c 'vllm serve $VLLM_MODEL \
	--host 0.0.0.0 --port 8000 \
	--tensor-parallel-size 1 \
	--dtype auto \
	--gpu-memory-utilization 0.9 \
	--max-model-len 32768 \
	--api-key $VLLM_API_KEY'
	```

	The HF cache mount avoids re-downloading the ~28 GB Qwen 2.5 weights on
	container restart.

	Print the API key that was generated (`echo $VLLM_API_KEY` from inside
	the container, or use a fixed string instead of `openssl rand`). You will
	paste this into the Streamlit app's `.env` as `VLLM_API_KEY`.

	### Option B — `serve.sh` directly

	If vLLM is pip-installed in a ROCm-enabled environment on the host:

	```bash
	chmod +x infra/vllm/serve.sh
	VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct \
	VLLM_API_KEY=<your-key> \
	./infra/vllm/serve.sh
	```

	---

	## 5. Verify the endpoint

	From any machine with network access to the MI300X:

	```bash
	curl http://<mi300x-public-ip>:8000/v1/models \
	-H "Authorization: Bearer <your-api-key>"
	```

	Expected response (truncated):

	```json
	{
	"object": "list",
	"data": [
	{
	"id": "Qwen/Qwen2.5-14B-Instruct",
	"object": "model",
	"owned_by": "vllm",
	...
	}
	]
	}
	```

	A simple chat-completion smoke test:

	```bash
	curl http://<mi300x-public-ip>:8000/v1/chat/completions \
	-H "Authorization: Bearer <your-api-key>" \
	-H "Content-Type: application/json" \
	-d '{
	"model": "Qwen/Qwen2.5-14B-Instruct",
	"messages": [{"role": "user", "content": "What is 2+2?"}],
	"temperature": 0.0
	}'
	```

	---

	## 6. Connect the Streamlit app

	In the project root `.env`:

	```dotenv
	LLM_PROFILE=vllm
	VLLM_BASE_URL=http://<mi300x-public-ip>:8000/v1
	VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
	VLLM_API_KEY=<your-key>
	```

	Then start the Streamlit app:

	```bash
	docker compose up langgraph-app
	```

	Or directly:

	```bash
	streamlit run app/main.py
	```

	---

	## 7. Performance benchmark (expected)

	On a single AMD MI300X (192 GB HBM3, ROCm 6.2+, vLLM 0.6+):

	\| Metric \| Qwen 2.5 14B \| Qwen 2.5 32B \|
	\|--------\|--------------\|--------------\|
	\| Time-to-first-token \| ~0.5 s \| ~1.0 s \|
	\| Throughput (single user) \| 50-80 tok/s \| 25-40 tok/s \|
	\| Concurrent capacity (KV-cache) \| ~50 sessions \| ~20 sessions \|
	\| Max context length \| 32K (configured) \| 32K (configured) \|

	These numbers depend on prompt length, batch size, and the exact ROCm/vLLM
	version. Run a benchmark with `vllm bench` after startup for the actual
	numbers on your instance.

	---

	## 8. Cost monitoring

	AMD Developer Cloud MI300X pricing (as of May 2026):
	- ~$4-8/hour pay-as-you-go

	`$100 / team-member × 3 team-members = $300 total credit`. At $5/h, that's
	60 hours of MI300X uptime. Plan accordingly:

	- Only run during demo/test/build sessions — stop the instance when idle
	- Keep one teammate's credit as failover/buffer for the final 24 hours
	- Run end-to-end smoke tests early so a hot fix doesn't burn deadline-day credits

	---

	## 9. Plan B — local fallback if MI300X is unavailable

	If the AMD credit doesn't arrive in time, or the MI300X instance has issues:

	```bash
	# Switch the Streamlit app to Ollama profile
	LLM_PROFILE=ollama OLLAMA_MODEL=qwen2.5:7b-instruct streamlit run app/main.py
	```

	Pull the model first:

	```bash
	ollama pull qwen2.5:7b-instruct
	```

	This runs on a laptop GPU (or CPU) and lets development continue. Quality
	will be lower (7B vs 14B/32B), but the demo-flow stays alive.

	---

	## 10. Production hardening (post-hackathon)

	For an actual production deployment, beyond the hackathon scope:

	- Use a real reverse proxy (Caddy / Nginx) with TLS instead of the raw vLLM port
	- Rotate `VLLM_API_KEY` regularly
	- Set up Prometheus + Grafana for vLLM `/metrics`
	- Use `--quantization` flag for fp8/int8 to fit a larger model on smaller hardware
	- Configure `--enable-prefix-caching` for repeated long system prompts
	- Use `vllm-deploy` (sky pilot) for multi-GPU and multi-region scaling