# HF Router verification before shipping the Space

The judges will overwhelmingly run the demo through the **Hugging Face
Router** path. Before deploying any change that touches the connection
panel, the provider abstraction, or the server's `/interactive/*`
routes, walk this checklist.

## A. Automated (no token required)

```bash
cd physix-live
source .venv/bin/activate            # or however you activate
python -m pytest tests/test_providers.py tests/test_providers_hf.py -v
```

Expected: **20 passed**. These pin:

- The HF Router base URL (`https://router.huggingface.co/v1`).
- That the OpenAI SDK is constructed with the visitor's `api_key`,
  the canonical `base_url`, and a `User-Agent` header.
- That `HF_TOKEN` env-var fallback works when the panel field is empty
  but only for HF URLs (no leakage to third-party providers).
- That a provider rejecting `response_format={"type":"json_object"}`
  with a 400 transparently retries without it.
- That `401`, `404`, connection errors, and timeouts each surface a
  hint that points at the right remediation step.
- That a full episode through `/interactive/sessions/{id}/llm-step`
  passes the visitor's `base_url + model + api_key` byte-for-byte to
  the OpenAI SDK call.

## B. Real-network probe (HF_TOKEN required)

```bash
# Terminal 1 — the demo backend.
python -m physix.server.app --host 127.0.0.1 --port 8000

# Terminal 2 — the verifier.
export HF_TOKEN=hf_...   # needs 'Make calls to Inference Providers' scope.
python scripts/verify_hf_router.py
```

What to look for:

1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and
   has Inference Providers scope.` If it fails, fix the token scope at
   <https://huggingface.co/settings/tokens> before going further.
2. **Step 2: model probes**. Each of the four HF models we suggest in
   the connection panel gets a 4-token completion attempt. Any model
   reported as `NOT SERVED (404)` will appear broken in the demo
   unless you fix it on the model card (see "If the trained model
   404s" below).
3. **Step 3: live episode**. The verifier drives one real PhysiX
   episode. Per-turn output should look like:

   ```
   turn 1: match=0.42  format=1.00  total=0.46  (3.2s)
     equation: 'd2y/dt2 = -9.81'
   turn 2: match=0.81  format=1.00  total=0.69  (2.8s)
     equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
   ```

   `format=1.00` on every turn means the prompt + parser + verifier
   pipeline is healthy. `format=0.00` means the model returned
   something unparseable — usually a hint that
   `response_format={"type":"json_object"}` was silently ignored *and*
   the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines,
   not for the trained PhysiX model.

## C. Browser walkthrough

With both backend and frontend running (`make dev`):

1. **Panel renders four endpoints.** Open the page; both the A and B
   panels show the same four-option Endpoint dropdown:
   `Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` ·
   `Custom`. The default is `Hugging Face Router` on both sides.

2. **Model field adapts per endpoint.**
   - With **Hugging Face Router** selected, the Model field is a
     text input. Click in it: a datalist of four suggestions appears
     (PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
   - Switch to **Ollama**: the Model field becomes a hard select.
     If `ollama serve` is running, it lists installed tags; if not,
     it shows an amber-bordered fallback input with the canonical
     `qwen2.5:3b-instruct` placeholder.
   - Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`,
     `gpt-4o`, `gpt-4.1-mini`.
   - Switch to **Custom**: text input with no suggestions, plus a new
     **Custom base URL** field appears.

3. **API key persistence.** Type a key into side A's panel, refresh
   the page — the key reappears (per-`base_url` `localStorage` key).
   Switch the endpoint to OpenAI: the key is *not* the HF one (each
   base URL has its own slot). Switch back to HF Router: HF key
   reappears.

4. **One real run end-to-end.** Paste your HF token, leave A pointed
   at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`,
   pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`.

   Expected within ~30s:
   - Both panels show a trajectory plot with predicted overlays.
   - Per-side reward strip ticks turn-by-turn.
   - Once both finish, the "Scoreboard" banner at the bottom shows
     `Winner: A` (the trained model should beat raw Qwen on Free Fall
     after a few turns; if it doesn't, that's a real signal worth
     investigating — not a UI bug).

5. **401 surfaces correctly.** Clear the API key on side A and
   re-run. The A column shows an amber error row containing
   "'Make calls to Inference Providers' fine-grained permission".
   Side B (with its own key still set) keeps running normally — the
   two sides are independent.

6. **404 surfaces correctly.** Type a clearly bogus model id like
   `does-not-exist/foo` and run. The error row points at the model
   card "Deploy → Inference API" panel.

## If the trained model 404s

`Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers
only serves models that at least one provider has loaded**. If
`verify_hf_router.py` reports it as `NOT SERVED`, fix it before
shipping:

1. Open the model card at
   <https://huggingface.co/Pratyush-01/physix-3b-rl>.
2. Click **Deploy → Inference Providers**. Pick at least one
   provider that supports custom Qwen2.5-3B fine-tunes — Featherless
   and Together are the most reliable for this model size.
3. Wait ~5 minutes for the provider to warm up the weights.
4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED`
   line should now be `OK`.

If no provider will load it, the fall-back demo story is still
strong: keep side A on the SFT checkpoint
(`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and
explain in the writeup that the trained checkpoint runs locally
via Ollama / vLLM. The `:fastest` model-id suffix is also worth
trying — HF will pick whichever provider can serve it first.

## Pre-deploy gate

Don't merge to `main` if any of these regress:

- [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green
- [ ] `python scripts/verify_hf_router.py --skip-episode` reports
  ≥1 served model in the suggestion list
- [ ] At least one of `Pratyush-01/physix-3b-rl` or
  `Pratyush-01/physix-3b-sft-merged` is served (the comparison
  story collapses if neither trained variant works)
- [ ] Browser walkthrough section C above completes without
  surprises