Spaces:

Pratyush-01
/

physix-infer

Sleeping

App Files Files Community

physix-infer / README.md

Pratyush-01

Re-create physix-infer: sequential vLLM boot, gpu_mem 0.40 each, python3 fix

7959cdc verified 15 days ago

preview code

raw

history blame contribute delete

4.12 kB

	---
	title: PhysiX-Infer
	emoji: ⚡
	colorFrom: yellow
	colorTo: red
	sdk: docker
	app_port: 7860
	pinned: false
	license: apache-2.0
	short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
	suggested_hardware: l4x1
	tags:
	- inference
	- vllm
	- qwen2
	- physix
	---

	<!--
	Note: `hardware:` and `sleep_time:` are NOT readable from this frontmatter.
	Only `suggested_hardware:` is, and even that is informational (it shows up
	on the Space card but does not auto-upgrade). After the first push, run
	`scripts/configure_space.py` once to:
	1. Upgrade the Space to L4 (l4x1)
	2. Set sleep_time to 300 seconds
	See that script's docstring for details.
	-->


	# PhysiX-Infer — dual-model inference Space

	OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo:

	\| Model id (use as `model` field) \| Role \|
	\| --- \| --- \|
	\| `Qwen/Qwen2.5-3B-Instruct` \| Untrained baseline \|
	\| `Pratyush-01/physix-3b-rl` \| GRPO-trained variant \|

	## Why this Space exists

	The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field.

	## Architecture

	```
	┌────────────────── Space (L4, 24 GB) ──────────────────┐
	│ │
	│ :8001 vllm serve Qwen/Qwen2.5-3B-Instruct │
	│ :8002 vllm serve Pratyush-01/physix-3b-rl │
	│ │
	│ :7860 proxy.py (FastAPI) │
	│ routes by JSON `model` field │
	└───────────────────────────────────────────────────────┘
	```

	Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted sequentially (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.

	## Sleep behavior

	`sleep_time: 300` in the frontmatter — the Space pauses after 5 minutes idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes ~90-120 s on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.

	## Endpoints

	\| Method \| Path \| Notes \|
	\| --- \| --- \| --- \|
	\| `POST` \| `/v1/chat/completions` \| OpenAI spec; `model` field selects upstream \|
	\| `POST` \| `/v1/completions` \| same routing, kept for older clients \|
	\| `GET` \| `/v1/models` \| lists both ids \|
	\| `GET` \| `/health` \| 200 iff both vLLMs healthy \|
	\| `GET` \| `/` \| plain HTML landing page \|

	## Auth

	None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.

	## Local smoke test

	You need a CUDA GPU with 16+ GB free.

	```bash
	docker build -t physix-infer .
	docker run --rm --gpus all -p 7860:7860 physix-infer
	# wait ~90s, then:
	curl -sS http://localhost:7860/health
	curl -sS -X POST http://localhost:7860/v1/chat/completions \
	-H 'content-type: application/json' \
	-d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'
	```

	## Wiring into the demo

	In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the PhysiX-Infer (GPU) preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.