physix / scripts /HF_ROUTER_VERIFICATION.md
Pratyush-01's picture
Upload folder using huggingface_hub
0e24aff verified
# HF Router verification before shipping the Space
The judges will overwhelmingly run the demo through the **Hugging Face
Router** path. Before deploying any change that touches the connection
panel, the provider abstraction, or the server's `/interactive/*`
routes, walk this checklist.
## A. Automated (no token required)
```bash
cd physix-live
source .venv/bin/activate # or however you activate
python -m pytest tests/test_providers.py tests/test_providers_hf.py -v
```
Expected: **20 passed**. These pin:
- The HF Router base URL (`https://router.huggingface.co/v1`).
- That the OpenAI SDK is constructed with the visitor's `api_key`,
the canonical `base_url`, and a `User-Agent` header.
- That `HF_TOKEN` env-var fallback works when the panel field is empty
but only for HF URLs (no leakage to third-party providers).
- That a provider rejecting `response_format={"type":"json_object"}`
with a 400 transparently retries without it.
- That `401`, `404`, connection errors, and timeouts each surface a
hint that points at the right remediation step.
- That a full episode through `/interactive/sessions/{id}/llm-step`
passes the visitor's `base_url + model + api_key` byte-for-byte to
the OpenAI SDK call.
## B. Real-network probe (HF_TOKEN required)
```bash
# Terminal 1 — the demo backend.
python -m physix.server.app --host 127.0.0.1 --port 8000
# Terminal 2 — the verifier.
export HF_TOKEN=hf_... # needs 'Make calls to Inference Providers' scope.
python scripts/verify_hf_router.py
```
What to look for:
1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and
has Inference Providers scope.` If it fails, fix the token scope at
<https://huggingface.co/settings/tokens> before going further.
2. **Step 2: model probes**. Each of the four HF models we suggest in
the connection panel gets a 4-token completion attempt. Any model
reported as `NOT SERVED (404)` will appear broken in the demo
unless you fix it on the model card (see "If the trained model
404s" below).
3. **Step 3: live episode**. The verifier drives one real PhysiX
episode. Per-turn output should look like:
```
turn 1: match=0.42 format=1.00 total=0.46 (3.2s)
equation: 'd2y/dt2 = -9.81'
turn 2: match=0.81 format=1.00 total=0.69 (2.8s)
equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
```
`format=1.00` on every turn means the prompt + parser + verifier
pipeline is healthy. `format=0.00` means the model returned
something unparseable — usually a hint that
`response_format={"type":"json_object"}` was silently ignored *and*
the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines,
not for the trained PhysiX model.
## C. Browser walkthrough
With both backend and frontend running (`make dev`):
1. **Panel renders four endpoints.** Open the page; both the A and B
panels show the same four-option Endpoint dropdown:
`Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` ·
`Custom`. The default is `Hugging Face Router` on both sides.
2. **Model field adapts per endpoint.**
- With **Hugging Face Router** selected, the Model field is a
text input. Click in it: a datalist of four suggestions appears
(PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
- Switch to **Ollama**: the Model field becomes a hard select.
If `ollama serve` is running, it lists installed tags; if not,
it shows an amber-bordered fallback input with the canonical
`qwen2.5:3b-instruct` placeholder.
- Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`,
`gpt-4o`, `gpt-4.1-mini`.
- Switch to **Custom**: text input with no suggestions, plus a new
**Custom base URL** field appears.
3. **API key persistence.** Type a key into side A's panel, refresh
the page — the key reappears (per-`base_url` `localStorage` key).
Switch the endpoint to OpenAI: the key is *not* the HF one (each
base URL has its own slot). Switch back to HF Router: HF key
reappears.
4. **One real run end-to-end.** Paste your HF token, leave A pointed
at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`,
pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`.
Expected within ~30s:
- Both panels show a trajectory plot with predicted overlays.
- Per-side reward strip ticks turn-by-turn.
- Once both finish, the "Scoreboard" banner at the bottom shows
`Winner: A` (the trained model should beat raw Qwen on Free Fall
after a few turns; if it doesn't, that's a real signal worth
investigating — not a UI bug).
5. **401 surfaces correctly.** Clear the API key on side A and
re-run. The A column shows an amber error row containing
"'Make calls to Inference Providers' fine-grained permission".
Side B (with its own key still set) keeps running normally — the
two sides are independent.
6. **404 surfaces correctly.** Type a clearly bogus model id like
`does-not-exist/foo` and run. The error row points at the model
card "Deploy → Inference API" panel.
## If the trained model 404s
`Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers
only serves models that at least one provider has loaded**. If
`verify_hf_router.py` reports it as `NOT SERVED`, fix it before
shipping:
1. Open the model card at
<https://huggingface.co/Pratyush-01/physix-3b-rl>.
2. Click **Deploy → Inference Providers**. Pick at least one
provider that supports custom Qwen2.5-3B fine-tunes — Featherless
and Together are the most reliable for this model size.
3. Wait ~5 minutes for the provider to warm up the weights.
4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED`
line should now be `OK`.
If no provider will load it, the fall-back demo story is still
strong: keep side A on the SFT checkpoint
(`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and
explain in the writeup that the trained checkpoint runs locally
via Ollama / vLLM. The `:fastest` model-id suffix is also worth
trying — HF will pick whichever provider can serve it first.
## Pre-deploy gate
Don't merge to `main` if any of these regress:
- [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green
- [ ] `python scripts/verify_hf_router.py --skip-episode` reports
≥1 served model in the suggestion list
- [ ] At least one of `Pratyush-01/physix-3b-rl` or
`Pratyush-01/physix-3b-sft-merged` is served (the comparison
story collapses if neither trained variant works)
- [ ] Browser walkthrough section C above completes without
surprises