Spaces:
Sleeping
title: PhysiX-Infer
emoji: ⚡
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
suggested_hardware: l4x1
tags:
- inference
- vllm
- qwen2
- physix
PhysiX-Infer — dual-model inference Space
OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the PhysiX-Live demo:
Model id (use as model field) |
Role |
|---|---|
Qwen/Qwen2.5-3B-Instruct |
Untrained baseline |
Pratyush-01/physix-3b-rl |
GRPO-trained variant |
Why this Space exists
The HF Inference Router does not currently serve Qwen/Qwen2.5-3B-Instruct (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — ~6.2 GB each in fp16, plus KV cache — so we just run two vllm serve processes side by side and dispatch on the model field.
Architecture
┌────────────────── Space (L4, 24 GB) ──────────────────┐
│ │
│ :8001 vllm serve Qwen/Qwen2.5-3B-Instruct │
│ :8002 vllm serve Pratyush-01/physix-3b-rl │
│ │
│ :7860 proxy.py (FastAPI) │
│ routes by JSON `model` field │
└───────────────────────────────────────────────────────┘
Each vLLM gets --gpu-memory-utilization 0.40 and --max-model-len 4096, and they're booted sequentially (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is ~150 lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.
Sleep behavior
sleep_time: 300 in the frontmatter — the Space pauses after 5 minutes idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes ~90-120 s on a warm Hub cache. The proxy's /health returns 503 while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.
Endpoints
| Method | Path | Notes |
|---|---|---|
POST |
/v1/chat/completions |
OpenAI spec; model field selects upstream |
POST |
/v1/completions |
same routing, kept for older clients |
GET |
/v1/models |
lists both ids |
GET |
/health |
200 iff both vLLMs healthy |
GET |
/ |
plain HTML landing page |
Auth
None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.
Local smoke test
You need a CUDA GPU with 16+ GB free.
docker build -t physix-infer .
docker run --rm --gpus all -p 7860:7860 physix-infer
# wait ~90s, then:
curl -sS http://localhost:7860/health
curl -sS -X POST http://localhost:7860/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'
Wiring into the demo
In the physix-live frontend, this Space is exposed as the PhysiX-Infer (GPU) preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.