physix-infer / README.md
Pratyush-01's picture
Re-create physix-infer: sequential vLLM boot, gpu_mem 0.40 each, python3 fix
7959cdc verified
---
title: PhysiX-Infer
emoji:
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
suggested_hardware: l4x1
tags:
- inference
- vllm
- qwen2
- physix
---
<!--
Note: `hardware:` and `sleep_time:` are NOT readable from this frontmatter.
Only `suggested_hardware:` is, and even that is informational (it shows up
on the Space card but does not auto-upgrade). After the first push, run
`scripts/configure_space.py` once to:
1. Upgrade the Space to L4 (l4x1)
2. Set sleep_time to 300 seconds
See that script's docstring for details.
-->
# PhysiX-Infer — dual-model inference Space
OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo:
| Model id (use as `model` field) | Role |
| --- | --- |
| `Qwen/Qwen2.5-3B-Instruct` | Untrained baseline |
| `Pratyush-01/physix-3b-rl` | GRPO-trained variant |
## Why this Space exists
The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field.
## Architecture
```
┌────────────────── Space (L4, 24 GB) ──────────────────┐
│ │
│ :8001 vllm serve Qwen/Qwen2.5-3B-Instruct │
│ :8002 vllm serve Pratyush-01/physix-3b-rl │
│ │
│ :7860 proxy.py (FastAPI) │
│ routes by JSON `model` field │
└───────────────────────────────────────────────────────┘
```
Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted **sequentially** (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.
## Sleep behavior
`sleep_time: 300` in the frontmatter — the Space pauses after **5 minutes** idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes **~90-120 s** on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.
## Endpoints
| Method | Path | Notes |
| --- | --- | --- |
| `POST` | `/v1/chat/completions` | OpenAI spec; `model` field selects upstream |
| `POST` | `/v1/completions` | same routing, kept for older clients |
| `GET` | `/v1/models` | lists both ids |
| `GET` | `/health` | 200 iff both vLLMs healthy |
| `GET` | `/` | plain HTML landing page |
## Auth
None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.
## Local smoke test
You need a CUDA GPU with 16+ GB free.
```bash
docker build -t physix-infer .
docker run --rm --gpus all -p 7860:7860 physix-infer
# wait ~90s, then:
curl -sS http://localhost:7860/health
curl -sS -X POST http://localhost:7860/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'
```
## Wiring into the demo
In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the **PhysiX-Infer (GPU)** preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.