Spaces:
Sleeping
Sleeping
| # HF Router verification before shipping the Space | |
| The judges will overwhelmingly run the demo through the **Hugging Face | |
| Router** path. Before deploying any change that touches the connection | |
| panel, the provider abstraction, or the server's `/interactive/*` | |
| routes, walk this checklist. | |
| ## A. Automated (no token required) | |
| ```bash | |
| cd physix-live | |
| source .venv/bin/activate # or however you activate | |
| python -m pytest tests/test_providers.py tests/test_providers_hf.py -v | |
| ``` | |
| Expected: **20 passed**. These pin: | |
| - The HF Router base URL (`https://router.huggingface.co/v1`). | |
| - That the OpenAI SDK is constructed with the visitor's `api_key`, | |
| the canonical `base_url`, and a `User-Agent` header. | |
| - That `HF_TOKEN` env-var fallback works when the panel field is empty | |
| but only for HF URLs (no leakage to third-party providers). | |
| - That a provider rejecting `response_format={"type":"json_object"}` | |
| with a 400 transparently retries without it. | |
| - That `401`, `404`, connection errors, and timeouts each surface a | |
| hint that points at the right remediation step. | |
| - That a full episode through `/interactive/sessions/{id}/llm-step` | |
| passes the visitor's `base_url + model + api_key` byte-for-byte to | |
| the OpenAI SDK call. | |
| ## B. Real-network probe (HF_TOKEN required) | |
| ```bash | |
| # Terminal 1 — the demo backend. | |
| python -m physix.server.app --host 127.0.0.1 --port 8000 | |
| # Terminal 2 — the verifier. | |
| export HF_TOKEN=hf_... # needs 'Make calls to Inference Providers' scope. | |
| python scripts/verify_hf_router.py | |
| ``` | |
| What to look for: | |
| 1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and | |
| has Inference Providers scope.` If it fails, fix the token scope at | |
| <https://huggingface.co/settings/tokens> before going further. | |
| 2. **Step 2: model probes**. Each of the four HF models we suggest in | |
| the connection panel gets a 4-token completion attempt. Any model | |
| reported as `NOT SERVED (404)` will appear broken in the demo | |
| unless you fix it on the model card (see "If the trained model | |
| 404s" below). | |
| 3. **Step 3: live episode**. The verifier drives one real PhysiX | |
| episode. Per-turn output should look like: | |
| ``` | |
| turn 1: match=0.42 format=1.00 total=0.46 (3.2s) | |
| equation: 'd2y/dt2 = -9.81' | |
| turn 2: match=0.81 format=1.00 total=0.69 (2.8s) | |
| equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2' | |
| ``` | |
| `format=1.00` on every turn means the prompt + parser + verifier | |
| pipeline is healthy. `format=0.00` means the model returned | |
| something unparseable — usually a hint that | |
| `response_format={"type":"json_object"}` was silently ignored *and* | |
| the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines, | |
| not for the trained PhysiX model. | |
| ## C. Browser walkthrough | |
| With both backend and frontend running (`make dev`): | |
| 1. **Panel renders four endpoints.** Open the page; both the A and B | |
| panels show the same four-option Endpoint dropdown: | |
| `Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` · | |
| `Custom`. The default is `Hugging Face Router` on both sides. | |
| 2. **Model field adapts per endpoint.** | |
| - With **Hugging Face Router** selected, the Model field is a | |
| text input. Click in it: a datalist of four suggestions appears | |
| (PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B). | |
| - Switch to **Ollama**: the Model field becomes a hard select. | |
| If `ollama serve` is running, it lists installed tags; if not, | |
| it shows an amber-bordered fallback input with the canonical | |
| `qwen2.5:3b-instruct` placeholder. | |
| - Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`, | |
| `gpt-4o`, `gpt-4.1-mini`. | |
| - Switch to **Custom**: text input with no suggestions, plus a new | |
| **Custom base URL** field appears. | |
| 3. **API key persistence.** Type a key into side A's panel, refresh | |
| the page — the key reappears (per-`base_url` `localStorage` key). | |
| Switch the endpoint to OpenAI: the key is *not* the HF one (each | |
| base URL has its own slot). Switch back to HF Router: HF key | |
| reappears. | |
| 4. **One real run end-to-end.** Paste your HF token, leave A pointed | |
| at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`, | |
| pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`. | |
| Expected within ~30s: | |
| - Both panels show a trajectory plot with predicted overlays. | |
| - Per-side reward strip ticks turn-by-turn. | |
| - Once both finish, the "Scoreboard" banner at the bottom shows | |
| `Winner: A` (the trained model should beat raw Qwen on Free Fall | |
| after a few turns; if it doesn't, that's a real signal worth | |
| investigating — not a UI bug). | |
| 5. **401 surfaces correctly.** Clear the API key on side A and | |
| re-run. The A column shows an amber error row containing | |
| "'Make calls to Inference Providers' fine-grained permission". | |
| Side B (with its own key still set) keeps running normally — the | |
| two sides are independent. | |
| 6. **404 surfaces correctly.** Type a clearly bogus model id like | |
| `does-not-exist/foo` and run. The error row points at the model | |
| card "Deploy → Inference API" panel. | |
| ## If the trained model 404s | |
| `Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers | |
| only serves models that at least one provider has loaded**. If | |
| `verify_hf_router.py` reports it as `NOT SERVED`, fix it before | |
| shipping: | |
| 1. Open the model card at | |
| <https://huggingface.co/Pratyush-01/physix-3b-rl>. | |
| 2. Click **Deploy → Inference Providers**. Pick at least one | |
| provider that supports custom Qwen2.5-3B fine-tunes — Featherless | |
| and Together are the most reliable for this model size. | |
| 3. Wait ~5 minutes for the provider to warm up the weights. | |
| 4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED` | |
| line should now be `OK`. | |
| If no provider will load it, the fall-back demo story is still | |
| strong: keep side A on the SFT checkpoint | |
| (`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and | |
| explain in the writeup that the trained checkpoint runs locally | |
| via Ollama / vLLM. The `:fastest` model-id suffix is also worth | |
| trying — HF will pick whichever provider can serve it first. | |
| ## Pre-deploy gate | |
| Don't merge to `main` if any of these regress: | |
| - [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green | |
| - [ ] `python scripts/verify_hf_router.py --skip-episode` reports | |
| ≥1 served model in the suggestion list | |
| - [ ] At least one of `Pratyush-01/physix-3b-rl` or | |
| `Pratyush-01/physix-3b-sft-merged` is served (the comparison | |
| story collapses if neither trained variant works) | |
| - [ ] Browser walkthrough section C above completes without | |
| surprises | |