physix-infer / README.md
Pratyush-01's picture
Re-create physix-infer: sequential vLLM boot, gpu_mem 0.40 each, python3 fix
7959cdc verified
metadata
title: PhysiX-Infer
emoji: 
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
suggested_hardware: l4x1
tags:
  - inference
  - vllm
  - qwen2
  - physix

PhysiX-Infer — dual-model inference Space

OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the PhysiX-Live demo:

Model id (use as model field) Role
Qwen/Qwen2.5-3B-Instruct Untrained baseline
Pratyush-01/physix-3b-rl GRPO-trained variant

Why this Space exists

The HF Inference Router does not currently serve Qwen/Qwen2.5-3B-Instruct (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — ~6.2 GB each in fp16, plus KV cache — so we just run two vllm serve processes side by side and dispatch on the model field.

Architecture

┌────────────────── Space (L4, 24 GB) ──────────────────┐
│                                                       │
│  :8001  vllm serve Qwen/Qwen2.5-3B-Instruct           │
│  :8002  vllm serve Pratyush-01/physix-3b-rl           │
│                                                       │
│  :7860  proxy.py (FastAPI)                            │
│         routes by JSON `model` field                  │
└───────────────────────────────────────────────────────┘

Each vLLM gets --gpu-memory-utilization 0.40 and --max-model-len 4096, and they're booted sequentially (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is ~150 lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.

Sleep behavior

sleep_time: 300 in the frontmatter — the Space pauses after 5 minutes idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes ~90-120 s on a warm Hub cache. The proxy's /health returns 503 while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.

Endpoints

Method Path Notes
POST /v1/chat/completions OpenAI spec; model field selects upstream
POST /v1/completions same routing, kept for older clients
GET /v1/models lists both ids
GET /health 200 iff both vLLMs healthy
GET / plain HTML landing page

Auth

None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.

Local smoke test

You need a CUDA GPU with 16+ GB free.

docker build -t physix-infer .
docker run --rm --gpus all -p 7860:7860 physix-infer
# wait ~90s, then:
curl -sS http://localhost:7860/health
curl -sS -X POST http://localhost:7860/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'

Wiring into the demo

In the physix-live frontend, this Space is exposed as the PhysiX-Infer (GPU) preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.