---
title: PhysiX-Infer
emoji: ⚡
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
suggested_hardware: l4x1
tags:
  - inference
  - vllm
  - qwen2
  - physix
---

<!--
  Note: `hardware:` and `sleep_time:` are NOT readable from this frontmatter.
  Only `suggested_hardware:` is, and even that is informational (it shows up
  on the Space card but does not auto-upgrade). After the first push, run
  `scripts/configure_space.py` once to:
    1. Upgrade the Space to L4 (l4x1)
    2. Set sleep_time to 300 seconds
  See that script's docstring for details.
-->


# PhysiX-Infer — dual-model inference Space

OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo:

| Model id (use as `model` field) | Role |
| --- | --- |
| `Qwen/Qwen2.5-3B-Instruct` | Untrained baseline |
| `Pratyush-01/physix-3b-rl` | GRPO-trained variant |

## Why this Space exists

The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field.

## Architecture

```
┌────────────────── Space (L4, 24 GB) ──────────────────┐
│                                                       │
│  :8001  vllm serve Qwen/Qwen2.5-3B-Instruct           │
│  :8002  vllm serve Pratyush-01/physix-3b-rl           │
│                                                       │
│  :7860  proxy.py (FastAPI)                            │
│         routes by JSON `model` field                  │
└───────────────────────────────────────────────────────┘
```

Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted **sequentially** (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.

## Sleep behavior

`sleep_time: 300` in the frontmatter — the Space pauses after **5 minutes** idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes **~90-120 s** on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.

## Endpoints

| Method | Path | Notes |
| --- | --- | --- |
| `POST` | `/v1/chat/completions` | OpenAI spec; `model` field selects upstream |
| `POST` | `/v1/completions` | same routing, kept for older clients |
| `GET` | `/v1/models` | lists both ids |
| `GET` | `/health` | 200 iff both vLLMs healthy |
| `GET` | `/` | plain HTML landing page |

## Auth

None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.

## Local smoke test

You need a CUDA GPU with 16+ GB free.

```bash
docker build -t physix-infer .
docker run --rm --gpus all -p 7860:7860 physix-infer
# wait ~90s, then:
curl -sS http://localhost:7860/health
curl -sS -X POST http://localhost:7860/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'
```

## Wiring into the demo

In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the **PhysiX-Infer (GPU)** preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.