# HF Router verification before shipping the Space The judges will overwhelmingly run the demo through the **Hugging Face Router** path. Before deploying any change that touches the connection panel, the provider abstraction, or the server's `/interactive/*` routes, walk this checklist. ## A. Automated (no token required) ```bash cd physix-live source .venv/bin/activate # or however you activate python -m pytest tests/test_providers.py tests/test_providers_hf.py -v ``` Expected: **20 passed**. These pin: - The HF Router base URL (`https://router.huggingface.co/v1`). - That the OpenAI SDK is constructed with the visitor's `api_key`, the canonical `base_url`, and a `User-Agent` header. - That `HF_TOKEN` env-var fallback works when the panel field is empty but only for HF URLs (no leakage to third-party providers). - That a provider rejecting `response_format={"type":"json_object"}` with a 400 transparently retries without it. - That `401`, `404`, connection errors, and timeouts each surface a hint that points at the right remediation step. - That a full episode through `/interactive/sessions/{id}/llm-step` passes the visitor's `base_url + model + api_key` byte-for-byte to the OpenAI SDK call. ## B. Real-network probe (HF_TOKEN required) ```bash # Terminal 1 — the demo backend. python -m physix.server.app --host 127.0.0.1 --port 8000 # Terminal 2 — the verifier. export HF_TOKEN=hf_... # needs 'Make calls to Inference Providers' scope. python scripts/verify_hf_router.py ``` What to look for: 1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and has Inference Providers scope.` If it fails, fix the token scope at before going further. 2. **Step 2: model probes**. Each of the four HF models we suggest in the connection panel gets a 4-token completion attempt. Any model reported as `NOT SERVED (404)` will appear broken in the demo unless you fix it on the model card (see "If the trained model 404s" below). 3. **Step 3: live episode**. The verifier drives one real PhysiX episode. Per-turn output should look like: ``` turn 1: match=0.42 format=1.00 total=0.46 (3.2s) equation: 'd2y/dt2 = -9.81' turn 2: match=0.81 format=1.00 total=0.69 (2.8s) equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2' ``` `format=1.00` on every turn means the prompt + parser + verifier pipeline is healthy. `format=0.00` means the model returned something unparseable — usually a hint that `response_format={"type":"json_object"}` was silently ignored *and* the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines, not for the trained PhysiX model. ## C. Browser walkthrough With both backend and frontend running (`make dev`): 1. **Panel renders four endpoints.** Open the page; both the A and B panels show the same four-option Endpoint dropdown: `Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` · `Custom`. The default is `Hugging Face Router` on both sides. 2. **Model field adapts per endpoint.** - With **Hugging Face Router** selected, the Model field is a text input. Click in it: a datalist of four suggestions appears (PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B). - Switch to **Ollama**: the Model field becomes a hard select. If `ollama serve` is running, it lists installed tags; if not, it shows an amber-bordered fallback input with the canonical `qwen2.5:3b-instruct` placeholder. - Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`, `gpt-4o`, `gpt-4.1-mini`. - Switch to **Custom**: text input with no suggestions, plus a new **Custom base URL** field appears. 3. **API key persistence.** Type a key into side A's panel, refresh the page — the key reappears (per-`base_url` `localStorage` key). Switch the endpoint to OpenAI: the key is *not* the HF one (each base URL has its own slot). Switch back to HF Router: HF key reappears. 4. **One real run end-to-end.** Paste your HF token, leave A pointed at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`, pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`. Expected within ~30s: - Both panels show a trajectory plot with predicted overlays. - Per-side reward strip ticks turn-by-turn. - Once both finish, the "Scoreboard" banner at the bottom shows `Winner: A` (the trained model should beat raw Qwen on Free Fall after a few turns; if it doesn't, that's a real signal worth investigating — not a UI bug). 5. **401 surfaces correctly.** Clear the API key on side A and re-run. The A column shows an amber error row containing "'Make calls to Inference Providers' fine-grained permission". Side B (with its own key still set) keeps running normally — the two sides are independent. 6. **404 surfaces correctly.** Type a clearly bogus model id like `does-not-exist/foo` and run. The error row points at the model card "Deploy → Inference API" panel. ## If the trained model 404s `Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers only serves models that at least one provider has loaded**. If `verify_hf_router.py` reports it as `NOT SERVED`, fix it before shipping: 1. Open the model card at . 2. Click **Deploy → Inference Providers**. Pick at least one provider that supports custom Qwen2.5-3B fine-tunes — Featherless and Together are the most reliable for this model size. 3. Wait ~5 minutes for the provider to warm up the weights. 4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED` line should now be `OK`. If no provider will load it, the fall-back demo story is still strong: keep side A on the SFT checkpoint (`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and explain in the writeup that the trained checkpoint runs locally via Ollama / vLLM. The `:fastest` model-id suffix is also worth trying — HF will pick whichever provider can serve it first. ## Pre-deploy gate Don't merge to `main` if any of these regress: - [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green - [ ] `python scripts/verify_hf_router.py --skip-episode` reports ≥1 served model in the suggestion list - [ ] At least one of `Pratyush-01/physix-3b-rl` or `Pratyush-01/physix-3b-sft-merged` is served (the comparison story collapses if neither trained variant works) - [ ] Browser walkthrough section C above completes without surprises