--- title: PhysiX-Infer emoji: ⚡ colorFrom: yellow colorTo: red sdk: docker app_port: 7860 pinned: false license: apache-2.0 short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl) suggested_hardware: l4x1 tags: - inference - vllm - qwen2 - physix --- # PhysiX-Infer — dual-model inference Space OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo: | Model id (use as `model` field) | Role | | --- | --- | | `Qwen/Qwen2.5-3B-Instruct` | Untrained baseline | | `Pratyush-01/physix-3b-rl` | GRPO-trained variant | ## Why this Space exists The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field. ## Architecture ``` ┌────────────────── Space (L4, 24 GB) ──────────────────┐ │ │ │ :8001 vllm serve Qwen/Qwen2.5-3B-Instruct │ │ :8002 vllm serve Pratyush-01/physix-3b-rl │ │ │ │ :7860 proxy.py (FastAPI) │ │ routes by JSON `model` field │ └───────────────────────────────────────────────────────┘ ``` Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted **sequentially** (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives. ## Sleep behavior `sleep_time: 300` in the frontmatter — the Space pauses after **5 minutes** idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes **~90-120 s** on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge. ## Endpoints | Method | Path | Notes | | --- | --- | --- | | `POST` | `/v1/chat/completions` | OpenAI spec; `model` field selects upstream | | `POST` | `/v1/completions` | same routing, kept for older clients | | `GET` | `/v1/models` | lists both ids | | `GET` | `/health` | 200 iff both vLLMs healthy | | `GET` | `/` | plain HTML landing page | ## Auth None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle. ## Local smoke test You need a CUDA GPU with 16+ GB free. ```bash docker build -t physix-infer . docker run --rm --gpus all -p 7860:7860 physix-infer # wait ~90s, then: curl -sS http://localhost:7860/health curl -sS -X POST http://localhost:7860/v1/chat/completions \ -H 'content-type: application/json' \ -d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}' ``` ## Wiring into the demo In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the **PhysiX-Infer (GPU)** preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.