Spaces:
Sleeping
Sleeping
| title: PhysiX-Infer | |
| emoji: ⚡ | |
| colorFrom: yellow | |
| colorTo: red | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl) | |
| suggested_hardware: l4x1 | |
| tags: | |
| - inference | |
| - vllm | |
| - qwen2 | |
| - physix | |
| <!-- | |
| Note: `hardware:` and `sleep_time:` are NOT readable from this frontmatter. | |
| Only `suggested_hardware:` is, and even that is informational (it shows up | |
| on the Space card but does not auto-upgrade). After the first push, run | |
| `scripts/configure_space.py` once to: | |
| 1. Upgrade the Space to L4 (l4x1) | |
| 2. Set sleep_time to 300 seconds | |
| See that script's docstring for details. | |
| --> | |
| # PhysiX-Infer — dual-model inference Space | |
| OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo: | |
| | Model id (use as `model` field) | Role | | |
| | --- | --- | | |
| | `Qwen/Qwen2.5-3B-Instruct` | Untrained baseline | | |
| | `Pratyush-01/physix-3b-rl` | GRPO-trained variant | | |
| ## Why this Space exists | |
| The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field. | |
| ## Architecture | |
| ``` | |
| ┌────────────────── Space (L4, 24 GB) ──────────────────┐ | |
| │ │ | |
| │ :8001 vllm serve Qwen/Qwen2.5-3B-Instruct │ | |
| │ :8002 vllm serve Pratyush-01/physix-3b-rl │ | |
| │ │ | |
| │ :7860 proxy.py (FastAPI) │ | |
| │ routes by JSON `model` field │ | |
| └───────────────────────────────────────────────────────┘ | |
| ``` | |
| Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted **sequentially** (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives. | |
| ## Sleep behavior | |
| `sleep_time: 300` in the frontmatter — the Space pauses after **5 minutes** idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes **~90-120 s** on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge. | |
| ## Endpoints | |
| | Method | Path | Notes | | |
| | --- | --- | --- | | |
| | `POST` | `/v1/chat/completions` | OpenAI spec; `model` field selects upstream | | |
| | `POST` | `/v1/completions` | same routing, kept for older clients | | |
| | `GET` | `/v1/models` | lists both ids | | |
| | `GET` | `/health` | 200 iff both vLLMs healthy | | |
| | `GET` | `/` | plain HTML landing page | | |
| ## Auth | |
| None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle. | |
| ## Local smoke test | |
| You need a CUDA GPU with 16+ GB free. | |
| ```bash | |
| docker build -t physix-infer . | |
| docker run --rm --gpus all -p 7860:7860 physix-infer | |
| # wait ~90s, then: | |
| curl -sS http://localhost:7860/health | |
| curl -sS -X POST http://localhost:7860/v1/chat/completions \ | |
| -H 'content-type: application/json' \ | |
| -d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}' | |
| ``` | |
| ## Wiring into the demo | |
| In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the **PhysiX-Infer (GPU)** preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required. | |