Spaces:

Pratyush-01
/

physix-infer

Sleeping

App Files Files Community

Pratyush-01 commited on 12 days ago

Commit

7959cdc

verified ·

1 Parent(s): 49ae6eb

Re-create physix-infer: sequential vLLM boot, gpu_mem 0.40 each, python3 fix

Browse files

Files changed (6) hide show

.dockerignore +32 -0
.gitattributes +5 -35
Dockerfile +89 -0
README.md +87 -5
entrypoint.sh +123 -0
proxy.py +260 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,32 @@

+# Patterns excluded from the Docker build context.
+#
+# Keeps anything heavy/host-specific out of BuildKit. The image only
+# needs proxy.py + entrypoint.sh + the README; everything else is noise
+# or actively harmful (e.g. a host venv landing under /app would shadow
+# the image's own python install).
+# Python venv / caches.
+.venv
+**/__pycache__
+**/*.pyc
+**/*.pyo
+.pytest_cache
+.ruff_cache
+.mypy_cache
+# Build / packaging artefacts.
+*.egg-info
+build
+dist
+# Editor / OS detritus.
+.DS_Store
+*.swp
+.vscode
+.idea
+.git
+.github
+# scripts/ holds host-side deploy helpers (configure_space.py, deploy.py).
+# They run from your laptop, never inside the image.
+scripts

.gitattributes CHANGED Viewed

@@ -1,35 +1,5 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+# Make sure shell scripts and Dockerfile aren't treated as binary by HF's
+# diff viewer (default heuristic occasionally trips on `set -e` lines).
+*.sh text eol=lf
+Dockerfile text eol=lf
+*.py text eol=lf

Dockerfile ADDED Viewed

	@@ -0,0 +1,89 @@

+# PhysiX-Infer — dual-model OpenAI-compatible inference Space.
+#
+# Hosts BOTH:
+#   * Qwen/Qwen2.5-3B-Instruct        (untrained baseline)
+#   * Pratyush-01/physix-3b-rl        (GRPO-trained variant)
+#
+# Why this Space exists:
+#   The HF Inference Router does not serve Qwen/Qwen2.5-3B-Instruct (no
+#   provider has it loaded), and won't serve a private/fine-tune unless
+#   the owner pays for an Inference Endpoint. Both checkpoints we want
+#   to compare are 3B Qwen2 fp16 models, and on a single 24 GB L4 we can
+#   keep two vLLM processes resident at ~45% gpu_memory each and never
+#   pay router/endpoint fees.
+#
+# Architecture (one container, three processes):
+#   :8001  vllm serve  Qwen/Qwen2.5-3B-Instruct       --gpu-memory-util 0.40
+#   :8002  vllm serve  Pratyush-01/physix-3b-rl       --gpu-memory-util 0.40
+#   :7860  uvicorn proxy.py:app    routes by JSON `model` field
+#
+# Boot order matters: vLLMs come up SEQUENTIALLY, not in parallel. Both
+# read `nvidia-smi` free-memory at startup; if they race, the second
+# crashes with "No available memory for the cache blocks." See
+# entrypoint.sh for the full reasoning.
+#
+# Why the official vllm/vllm-openai image:
+#   vLLM ships pre-compiled CUDA kernels that target the cuda toolkit
+#   and pytorch versions it was built against. Building from a generic
+#   nvidia/cuda image means recompiling vLLM's C++ kernels (~20 min,
+#   often fragile across CUDA minor versions). Starting from
+#   vllm/vllm-openai:<tag> guarantees torch / cu / nccl / vllm are all
+#   ABI-compatible. We just layer fastapi + httpx for the proxy on top.
+#
+# Cold start on a fresh HF Spaces L4 (no persistent /data):
+#   * Image pull:           ~30 s
+#   * vLLM startup:         ~30 s after weights are local
+#   * Weight download:      ~45 s for both models from Hub CDN
+#   ── total ~90-120 s before /health flips green ──
+FROM vllm/vllm-openai:v0.7.3
+# vllm/vllm-openai sets ENTRYPOINT to `python -m vllm.entrypoints.openai.api_server`.
+# We need to override that to launch our own multi-process entrypoint, so reset.
+ENTRYPOINT []
+ENV PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    HOME=/tmp/home \
+    HF_HOME=/tmp/hf_cache \
+    XDG_CACHE_HOME=/tmp/xdg-cache \
+    # vLLM's torch.compile cache must land somewhere writable. The image's
+    # default ($HOME/.cache/vllm) breaks on HF Spaces because the runtime
+    # user has no writable home.
+    VLLM_CACHE_ROOT=/tmp/vllm_cache \
+    TORCH_HOME=/tmp/torch_cache \
+    TRITON_CACHE_DIR=/tmp/triton_cache \
+    PORT=7860
+# fastapi/uvicorn/httpx for the routing proxy. The image already has them
+# transitively (vllm depends on fastapi), but pin minimums to be safe.
+# `pip install --no-deps` would be tighter but trades safety for ~5 MB.
+RUN pip install \
+        "fastapi>=0.110" \
+        "uvicorn[standard]>=0.29" \
+        "httpx>=0.27"
+WORKDIR /app
+COPY proxy.py entrypoint.sh ./
+RUN chmod +x /app/entrypoint.sh
+# HF Spaces runs containers as a non-root UID with no /etc/passwd entry,
+# so any cache path under $HOME must exist and be world-writable BEFORE
+# the runtime user shows up. Pre-creating /tmp subdirs (which Spaces
+# always lets us write to) is the standard workaround.
+RUN mkdir -p \
+        "$HOME" "$HF_HOME" "$XDG_CACHE_HOME" \
+        "$VLLM_CACHE_ROOT" "$TORCH_HOME" "$TRITON_CACHE_DIR" \
+        /tmp/logs \
+    && chmod -R 0777 /tmp
+EXPOSE 7860
+# /health is served by proxy.py and turns 200 only when BOTH vLLMs are up.
+# Generous start-period covers the ~120 s cold boot.
+HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \
+    CMD curl -fsS "http://127.0.0.1:${PORT}/health" || exit 1
+CMD ["/app/entrypoint.sh"]

README.md CHANGED Viewed

@@ -1,10 +1,92 @@
 ---
-title: Physix Infer
-emoji: 🐨
-colorFrom: blue
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: PhysiX-Infer
+emoji: ⚡
+colorFrom: yellow
+colorTo: red
 sdk: docker
+app_port: 7860
 pinned: false
+license: apache-2.0
+short_description: Dual-model inference (Qwen 2.5 3B + physix-3b-rl)
+suggested_hardware: l4x1
+tags:
+  - inference
+  - vllm
+  - qwen2
+  - physix
 ---
+<!--
+  Note: `hardware:` and `sleep_time:` are NOT readable from this frontmatter.
+  Only `suggested_hardware:` is, and even that is informational (it shows up
+  on the Space card but does not auto-upgrade). After the first push, run
+  `scripts/configure_space.py` once to:
+    1. Upgrade the Space to L4 (l4x1)
+    2. Set sleep_time to 300 seconds
+  See that script's docstring for details.
+-->
+# PhysiX-Infer — dual-model inference Space
+OpenAI-compatible inference for the two 3B Qwen2 checkpoints used by the [PhysiX-Live](https://huggingface.co/spaces/Pratyush-01/physix-live) demo:
+| Model id (use as `model` field) | Role |
+| --- | --- |
+| `Qwen/Qwen2.5-3B-Instruct` | Untrained baseline |
+| `Pratyush-01/physix-3b-rl` | GRPO-trained variant |
+## Why this Space exists
+The HF Inference Router does not currently serve `Qwen/Qwen2.5-3B-Instruct` (no provider has it loaded), and won't serve the fine-tune unless its owner runs a paid Inference Endpoint. Both checkpoints are small enough to share a single L4 (24 GB) — `~6.2 GB` each in fp16, plus KV cache — so we just run two `vllm serve` processes side by side and dispatch on the `model` field.
+## Architecture
+```
+┌────────────────── Space (L4, 24 GB) ──────────────────┐
+│                                                       │
+│  :8001  vllm serve Qwen/Qwen2.5-3B-Instruct           │
+│  :8002  vllm serve Pratyush-01/physix-3b-rl           │
+│                                                       │
+│  :7860  proxy.py (FastAPI)                            │
+│         routes by JSON `model` field                  │
+└───────────────────────────────────────────────────────┘
+```
+Each vLLM gets `--gpu-memory-utilization 0.40` and `--max-model-len 4096`, and they're booted **sequentially** (Qwen first, then PhysiX) so the second process correctly observes the post-first-process free VRAM — booting in parallel caused a "No available memory for the cache blocks" crash on the first deploy attempt. Proxy is `~150` lines of FastAPI + httpx; streaming bytes are forwarded verbatim so SSE framing survives.
+## Sleep behavior
+`sleep_time: 300` in the frontmatter — the Space pauses after **5 minutes** idle and stops billing immediately. First request after a sleep cold-boots both vLLMs, which takes **~90-120 s** on a warm Hub cache. The proxy's `/health` returns `503` while either upstream is still booting; the demo's frontend uses that to render a "warming up" badge.
+## Endpoints
+| Method | Path | Notes |
+| --- | --- | --- |
+| `POST` | `/v1/chat/completions` | OpenAI spec; `model` field selects upstream |
+| `POST` | `/v1/completions` | same routing, kept for older clients |
+| `GET` | `/v1/models` | lists both ids |
+| `GET` | `/health` | 200 iff both vLLMs healthy |
+| `GET` | `/` | plain HTML landing page |
+## Auth
+None. The Space is open access, bounded by the 5-min sleep window — anyone can hit it, but they can't run it for free past one idle cycle.
+## Local smoke test
+You need a CUDA GPU with 16+ GB free.
+```bash
+docker build -t physix-infer .
+docker run --rm --gpus all -p 7860:7860 physix-infer
+# wait ~90s, then:
+curl -sS http://localhost:7860/health
+curl -sS -X POST http://localhost:7860/v1/chat/completions \
+  -H 'content-type: application/json' \
+  -d '{"model":"Qwen/Qwen2.5-3B-Instruct","messages":[{"role":"user","content":"hi"}]}'
+```
+## Wiring into the demo
+In the [physix-live](https://github.com/openenv-hackathon/physix-live) frontend, this Space is exposed as the **PhysiX-Infer (GPU)** preset. Pick it from the endpoint dropdown and pick either model id from the suggestions. No API key required.

entrypoint.sh ADDED Viewed

	@@ -0,0 +1,123 @@

+#!/usr/bin/env bash
+# Boot two vLLM processes + the FastAPI proxy, all in one container.
+#
+# Lifecycle:
+#   1. Launch vLLM(qwen) on :8001, wait until /health returns 200.
+#   2. THEN launch vLLM(physix) on :8002, wait until /health returns 200.
+#      (Sequential, not parallel — see below.)
+#   3. Exec uvicorn proxy in foreground (PID 1) — when HF Spaces sends
+#      SIGTERM at sleep time, uvicorn exits cleanly and the children get
+#      reaped via the signal trap.
+#
+# Why sequential, not parallel:
+#   The first deploy attempt booted both vLLMs in parallel and the second
+#   one died with `ValueError: No available memory for the cache blocks.
+#   Try increasing gpu_memory_utilization`. Cause: vLLM reads the GPU's
+#   *currently free* memory at startup and then reserves
+#   `--gpu-memory-utilization * (free at this moment)`. When two processes
+#   start simultaneously, both read "all 24 GB free" and both try to grab
+#   ~10 GB; whichever one finalises last loses, because by then there's
+#   only ~10-12 GB actually free.
+#
+#   Sequential boot makes the second vLLM observe the post-first-process
+#   free memory, so its allocation gets sized correctly.
+#
+# Why --gpu-memory-utilization 0.40 each (= 80% total reserved):
+#   On L4 (24 GB), 0.40 * 24 ≈ 9.6 GB per process. Qwen2.5-3B fp16 weights
+#   are ~6.2 GB; that leaves ~3.4 GB for KV cache + activations, which
+#   sustains max_model_len=4096 with comfortable margin. The 20% reserve
+#   covers the CUDA workspace + Python/uvicorn heap + the second vLLM's
+#   own ~600 MB CUDA context overhead. We deliberately do NOT push to
+#   0.45 each — the previous deploy proved the residual headroom isn't
+#   there once both contexts coexist.
+set -euo pipefail
+QWEN_MODEL="${QWEN_MODEL:-Qwen/Qwen2.5-3B-Instruct}"
+PHYSIX_MODEL="${PHYSIX_MODEL:-Pratyush-01/physix-3b-rl}"
+QWEN_GPU_FRAC="${QWEN_GPU_FRAC:-0.40}"
+PHYSIX_GPU_FRAC="${PHYSIX_GPU_FRAC:-0.40}"
+# 4096 is enough for the PhysiX prompt (~1500 tok) + completion (~512 tok)
+# with comfortable headroom, and tightening it materially shrinks the KV
+# cache footprint vs vLLM's default of model.max_position_embeddings
+# (32k for Qwen2.5).
+MAX_LEN="${MAX_LEN:-4096}"
+LOG_DIR=/tmp/logs
+mkdir -p "$LOG_DIR"
+# Track child PIDs so the signal trap can terminate them all on
+# SIGTERM/SIGINT. HF Spaces sends SIGTERM when pausing the Space.
+PIDS=()
+cleanup() {
+    echo "[entrypoint] SIGTERM/SIGINT — killing children: ${PIDS[*]:-}" >&2
+    for pid in "${PIDS[@]:-}"; do
+        kill -TERM "$pid" 2>/dev/null || true
+    done
+    wait || true
+    exit 0
+}
+trap cleanup TERM INT
+wait_healthy() {
+    local name="$1" port="$2" pid="$3" budget="${4:-480}"
+    local deadline=$((SECONDS + budget))
+    while (( SECONDS < deadline )); do
+        # If the child died, surface its log and bail out — silently
+        # waiting forever for a corpse is the worst failure mode.
+        if ! kill -0 "$pid" 2>/dev/null; then
+            echo "[entrypoint] FATAL: $name (pid $pid) died during boot. Tail of log:" >&2
+            tail -n 80 "$LOG_DIR/vllm-${name}.log" >&2 || true
+            return 1
+        fi
+        if curl -fsS "http://127.0.0.1:${port}/health" >/dev/null 2>&1; then
+            echo "[entrypoint] $name healthy on :$port (after ${SECONDS}s)"
+            return 0
+        fi
+        sleep 5
+    done
+    echo "[entrypoint] FATAL: $name failed to become healthy in ${budget}s" >&2
+    tail -n 80 "$LOG_DIR/vllm-${name}.log" >&2 || true
+    return 1
+}
+echo "[entrypoint] step 1/3 — booting vLLM(qwen) = $QWEN_MODEL on :8001 (gpu=${QWEN_GPU_FRAC})"
+# vllm/vllm-openai base image ships only `python3` (no `python` symlink),
+# so use python3 explicitly. Using `python -m vllm...` here cost us a
+# full failed deploy on first try.
+python3 -m vllm.entrypoints.openai.api_server \
+    --model "$QWEN_MODEL" \
+    --served-model-name "$QWEN_MODEL" \
+    --host 0.0.0.0 --port 8001 \
+    --gpu-memory-utilization "$QWEN_GPU_FRAC" \
+    --max-model-len "$MAX_LEN" \
+    --dtype auto \
+    --disable-log-requests \
+    > "$LOG_DIR/vllm-qwen.log" 2>&1 &
+QWEN_PID=$!
+PIDS+=("$QWEN_PID")
+wait_healthy qwen 8001 "$QWEN_PID"
+echo "[entrypoint] step 2/3 — booting vLLM(physix) = $PHYSIX_MODEL on :8002 (gpu=${PHYSIX_GPU_FRAC})"
+python3 -m vllm.entrypoints.openai.api_server \
+    --model "$PHYSIX_MODEL" \
+    --served-model-name "$PHYSIX_MODEL" \
+    --host 0.0.0.0 --port 8002 \
+    --gpu-memory-utilization "$PHYSIX_GPU_FRAC" \
+    --max-model-len "$MAX_LEN" \
+    --dtype auto \
+    --disable-log-requests \
+    > "$LOG_DIR/vllm-physix.log" 2>&1 &
+PHYSIX_PID=$!
+PIDS+=("$PHYSIX_PID")
+wait_healthy physix 8002 "$PHYSIX_PID"
+echo "[entrypoint] step 3/3 — both vLLMs healthy; starting proxy on :${PORT}"
+# `exec` so uvicorn becomes PID 1's foreground job and HF Spaces sees
+# our process as healthy. The trap above forwards termination back to
+# the vLLM children when the Space is paused.
+exec python3 -m uvicorn proxy:app \
+    --host 0.0.0.0 --port "${PORT}" \
+    --log-level info

proxy.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""OpenAI-compatible proxy that fans out to two local vLLM servers.
+Why a custom proxy and not nginx:
+  * nginx routing on a JSON body field requires lua-nginx or njs (build
+    pain on a CUDA base image), and we need to PEEK at the body without
+    consuming it for streaming requests.
+  * httpx async streaming + FastAPI is ~80 LoC, debuggable in plain Python,
+    and reuses the same connection pool across requests.
+Endpoints exposed on :7860 (matches OpenAI spec):
+  * GET  /v1/models               — lists both registered model ids
+  * GET  /v1/models/{model_id}    — single model lookup
+  * POST /v1/chat/completions     — main route. Reads `model` from body
+                                    and forwards to whichever vLLM owns it.
+  * POST /v1/completions          — same routing, kept for old clients.
+  * GET  /health                  — 200 iff both upstreams are healthy.
+                                    HF's container monitor uses the Docker
+                                    HEALTHCHECK from the Dockerfile, but
+                                    we expose this for the demo's frontend
+                                    so it can show a "warming up..." badge
+                                    during cold starts.
+  * GET  /                        — friendly landing page so the bare
+                                    Space URL doesn't 404.
+Streaming is forwarded byte-for-byte (StreamingResponse over the upstream's
+chunks) so SSE `data: {...}\n\n` framing survives intact.
+"""
+from __future__ import annotations
+import json
+import logging
+import os
+from contextlib import asynccontextmanager
+from typing import AsyncIterator
+import httpx
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
+logger = logging.getLogger("physix-infer-proxy")
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s %(message)s")
+QWEN_MODEL = os.environ.get("QWEN_MODEL", "Qwen/Qwen2.5-3B-Instruct")
+PHYSIX_MODEL = os.environ.get("PHYSIX_MODEL", "Pratyush-01/physix-3b-rl")
+QWEN_UPSTREAM = "http://127.0.0.1:8001"
+PHYSIX_UPSTREAM = "http://127.0.0.1:8002"
+ROUTING: dict[str, str] = {
+    QWEN_MODEL: QWEN_UPSTREAM,
+    PHYSIX_MODEL: PHYSIX_UPSTREAM,
+}
+# Generous timeout — first request after a cold start can sit on the
+# upstream for ~30 s while CUDA graphs warm up. Streaming tokens come
+# back fast once that's done.
+TIMEOUT = httpx.Timeout(connect=10.0, read=600.0, write=60.0, pool=5.0)
+@asynccontextmanager
+async def lifespan(_app: FastAPI):
+    """Open one shared httpx client for the proxy's lifetime.
+    Keep-alive across requests matters: every chat completion otherwise
+    pays a TCP+HTTP/1.1 handshake (~1-2 ms localhost, but it adds up
+    under autoplay loops that fire 8 turns/episode).
+    """
+    async with httpx.AsyncClient(timeout=TIMEOUT) as client:
+        _app.state.http = client
+        yield
+app = FastAPI(
+    title="PhysiX-Infer",
+    description="Dual-model OpenAI-compatible inference (Qwen 2.5 3B + physix-3b-rl).",
+    lifespan=lifespan,
+)
+def _resolve_upstream(model: str | None) -> str:
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Missing 'model' field. Pass either "
+            f"'{QWEN_MODEL}' or '{PHYSIX_MODEL}'.",
+        )
+    upstream = ROUTING.get(model)
+    if upstream is None:
+        raise HTTPException(
+            status_code=400,
+            detail=(
+                f"Model '{model}' is not served by this Space. "
+                f"Available: {list(ROUTING.keys())}."
+            ),
+        )
+    return upstream
+async def _proxy_json(request: Request, path: str) -> JSONResponse | StreamingResponse:
+    """Read body, route on `model`, forward, stream back if `stream=true`."""
+    raw = await request.body()
+    try:
+        payload = json.loads(raw) if raw else {}
+    except json.JSONDecodeError as exc:
+        raise HTTPException(status_code=400, detail=f"Invalid JSON: {exc}") from exc
+    upstream = _resolve_upstream(payload.get("model"))
+    is_stream = bool(payload.get("stream"))
+    # Strip any hop-by-hop headers; pass auth/content-type through.
+    fwd_headers = {
+        k: v
+        for k, v in request.headers.items()
+        if k.lower() in {"content-type", "accept", "authorization", "x-request-id"}
+    }
+    fwd_headers.setdefault("content-type", "application/json")
+    client: httpx.AsyncClient = request.app.state.http
+    upstream_url = f"{upstream}{path}"
+    if not is_stream:
+        try:
+            resp = await client.post(upstream_url, content=raw, headers=fwd_headers)
+        except httpx.HTTPError as exc:
+            logger.exception("upstream %s failed", upstream_url)
+            raise HTTPException(status_code=502, detail=f"Upstream error: {exc}") from exc
+        # vLLM returns JSON with content-type=application/json or text/event-stream
+        # for streaming. We've handled streaming above, so trust upstream content-type.
+        return JSONResponse(
+            status_code=resp.status_code,
+            content=resp.json() if resp.headers.get("content-type", "").startswith("application/json")
+            else {"raw": resp.text},
+        )
+    # Streaming path: open the upstream as a streaming request and pump
+    # chunks straight to the client. Note the `async with` lives INSIDE
+    # the generator so it stays open until StreamingResponse is done.
+    async def _gen() -> AsyncIterator[bytes]:
+        try:
+            async with client.stream(
+                "POST", upstream_url, content=raw, headers=fwd_headers
+            ) as upstream_resp:
+                if upstream_resp.status_code >= 400:
+                    body = await upstream_resp.aread()
+                    yield body
+                    return
+                async for chunk in upstream_resp.aiter_raw():
+                    if chunk:
+                        yield chunk
+        except httpx.HTTPError as exc:
+            logger.exception("upstream stream %s failed", upstream_url)
+            err = json.dumps({"error": {"message": str(exc), "type": "upstream_error"}})
+            yield f"data: {err}\n\n".encode()
+    return StreamingResponse(_gen(), media_type="text/event-stream")
+@app.post("/v1/chat/completions")
+async def chat_completions(request: Request):
+    return await _proxy_json(request, "/v1/chat/completions")
+@app.post("/v1/completions")
+async def completions(request: Request):
+    return await _proxy_json(request, "/v1/completions")
+@app.get("/v1/models")
+async def list_models():
+    """Static listing — vLLM exposes the same shape per-upstream, but we
+    union them here so a single GET covers both. `created` and `owned_by`
+    are filled with sensible placeholders since neither field is load-bearing
+    for any client we know of."""
+    return {
+        "object": "list",
+        "data": [
+            {
+                "id": QWEN_MODEL,
+                "object": "model",
+                "created": 0,
+                "owned_by": "Qwen",
+            },
+            {
+                "id": PHYSIX_MODEL,
+                "object": "model",
+                "created": 0,
+                "owned_by": "Pratyush-01",
+            },
+        ],
+    }
+@app.get("/v1/models/{model_id:path}")
+async def get_model(model_id: str):
+    if model_id not in ROUTING:
+        raise HTTPException(status_code=404, detail=f"Model '{model_id}' not found.")
+    owner = "Qwen" if model_id == QWEN_MODEL else "Pratyush-01"
+    return {"id": model_id, "object": "model", "created": 0, "owned_by": owner}
+@app.get("/health")
+async def health(request: Request):
+    """Both upstreams must answer /health — the demo frontend uses this
+    to decide whether to show a 'warming up' notice on cold start."""
+    client: httpx.AsyncClient = request.app.state.http
+    statuses = {}
+    overall_ok = True
+    for name, base in (("qwen", QWEN_UPSTREAM), ("physix", PHYSIX_UPSTREAM)):
+        try:
+            r = await client.get(f"{base}/health", timeout=5.0)
+            statuses[name] = "ok" if r.status_code == 200 else f"status={r.status_code}"
+            overall_ok = overall_ok and r.status_code == 200
+        except httpx.HTTPError as exc:
+            statuses[name] = f"unreachable: {exc.__class__.__name__}"
+            overall_ok = False
+    return JSONResponse(
+        status_code=200 if overall_ok else 503,
+        content={"status": "ok" if overall_ok else "starting", "upstreams": statuses},
+    )
+@app.get("/", response_class=HTMLResponse)
+async def root():
+    """Landing page so the bare Space URL doesn't 404. Plain HTML — no
+    framework, no static dir to manage."""
+    return f"""<!doctype html>
+<html><head><meta charset="utf-8"><title>PhysiX-Infer</title>
+<style>
+body{{font-family:system-ui,sans-serif;max-width:680px;margin:3em auto;padding:0 1em;color:#222}}
+code,pre{{background:#f4f4f4;padding:.2em .4em;border-radius:4px;font-size:.95em}}
+pre{{padding:1em;overflow-x:auto}}
+h1{{margin-bottom:.2em}}
+.muted{{color:#777}}
+</style>
+</head><body>
+<h1>PhysiX-Infer</h1>
+<p class="muted">OpenAI-compatible inference proxy for two 3B Qwen2 checkpoints.</p>
+<h3>Models served</h3>
+<ul>
+  <li><code>{QWEN_MODEL}</code> — untrained baseline</li>
+  <li><code>{PHYSIX_MODEL}</code> — GRPO-trained variant</li>
+</ul>
+<h3>Endpoints</h3>
+<ul>
+  <li><code>GET  /v1/models</code></li>
+  <li><code>POST /v1/chat/completions</code> (set <code>model</code> to one of the ids above)</li>
+  <li><code>GET  /health</code></li>
+</ul>
+<h3>Example</h3>
+<pre>curl -X POST https://&lt;this-space&gt;.hf.space/v1/chat/completions \\
+  -H 'content-type: application/json' \\
+  -d '{{"model":"{PHYSIX_MODEL}","messages":[{{"role":"user","content":"hi"}}]}}'</pre>
+<p class="muted">No auth, but the Space sleeps after a short idle window — first request after sleep takes ~90 s while both vLLMs warm up.</p>
+</body></html>"""