Spaces:
Sleeping
Sleeping
File size: 6,608 Bytes
08f8699 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | # HF Router verification before shipping the Space
The judges will overwhelmingly run the demo through the **Hugging Face
Router** path. Before deploying any change that touches the connection
panel, the provider abstraction, or the server's `/interactive/*`
routes, walk this checklist.
## A. Automated (no token required)
```bash
cd physix-live
source .venv/bin/activate # or however you activate
python -m pytest tests/test_providers.py tests/test_providers_hf.py -v
```
Expected: **20 passed**. These pin:
- The HF Router base URL (`https://router.huggingface.co/v1`).
- That the OpenAI SDK is constructed with the visitor's `api_key`,
the canonical `base_url`, and a `User-Agent` header.
- That `HF_TOKEN` env-var fallback works when the panel field is empty
but only for HF URLs (no leakage to third-party providers).
- That a provider rejecting `response_format={"type":"json_object"}`
with a 400 transparently retries without it.
- That `401`, `404`, connection errors, and timeouts each surface a
hint that points at the right remediation step.
- That a full episode through `/interactive/sessions/{id}/llm-step`
passes the visitor's `base_url + model + api_key` byte-for-byte to
the OpenAI SDK call.
## B. Real-network probe (HF_TOKEN required)
```bash
# Terminal 1 — the demo backend.
python -m physix.server.app --host 127.0.0.1 --port 8000
# Terminal 2 — the verifier.
export HF_TOKEN=hf_... # needs 'Make calls to Inference Providers' scope.
python scripts/verify_hf_router.py
```
What to look for:
1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and
has Inference Providers scope.` If it fails, fix the token scope at
<https://huggingface.co/settings/tokens> before going further.
2. **Step 2: model probes**. Each of the four HF models we suggest in
the connection panel gets a 4-token completion attempt. Any model
reported as `NOT SERVED (404)` will appear broken in the demo
unless you fix it on the model card (see "If the trained model
404s" below).
3. **Step 3: live episode**. The verifier drives one real PhysiX
episode. Per-turn output should look like:
```
turn 1: match=0.42 format=1.00 total=0.46 (3.2s)
equation: 'd2y/dt2 = -9.81'
turn 2: match=0.81 format=1.00 total=0.69 (2.8s)
equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
```
`format=1.00` on every turn means the prompt + parser + verifier
pipeline is healthy. `format=0.00` means the model returned
something unparseable — usually a hint that
`response_format={"type":"json_object"}` was silently ignored *and*
the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines,
not for the trained PhysiX model.
## C. Browser walkthrough
With both backend and frontend running (`make dev`):
1. **Panel renders four endpoints.** Open the page; both the A and B
panels show the same four-option Endpoint dropdown:
`Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` ·
`Custom`. The default is `Hugging Face Router` on both sides.
2. **Model field adapts per endpoint.**
- With **Hugging Face Router** selected, the Model field is a
text input. Click in it: a datalist of four suggestions appears
(PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
- Switch to **Ollama**: the Model field becomes a hard select.
If `ollama serve` is running, it lists installed tags; if not,
it shows an amber-bordered fallback input with the canonical
`qwen2.5:3b-instruct` placeholder.
- Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`,
`gpt-4o`, `gpt-4.1-mini`.
- Switch to **Custom**: text input with no suggestions, plus a new
**Custom base URL** field appears.
3. **API key persistence.** Type a key into side A's panel, refresh
the page — the key reappears (per-`base_url` `localStorage` key).
Switch the endpoint to OpenAI: the key is *not* the HF one (each
base URL has its own slot). Switch back to HF Router: HF key
reappears.
4. **One real run end-to-end.** Paste your HF token, leave A pointed
at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`,
pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`.
Expected within ~30s:
- Both panels show a trajectory plot with predicted overlays.
- Per-side reward strip ticks turn-by-turn.
- Once both finish, the "Scoreboard" banner at the bottom shows
`Winner: A` (the trained model should beat raw Qwen on Free Fall
after a few turns; if it doesn't, that's a real signal worth
investigating — not a UI bug).
5. **401 surfaces correctly.** Clear the API key on side A and
re-run. The A column shows an amber error row containing
"'Make calls to Inference Providers' fine-grained permission".
Side B (with its own key still set) keeps running normally — the
two sides are independent.
6. **404 surfaces correctly.** Type a clearly bogus model id like
`does-not-exist/foo` and run. The error row points at the model
card "Deploy → Inference API" panel.
## If the trained model 404s
`Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers
only serves models that at least one provider has loaded**. If
`verify_hf_router.py` reports it as `NOT SERVED`, fix it before
shipping:
1. Open the model card at
<https://huggingface.co/Pratyush-01/physix-3b-rl>.
2. Click **Deploy → Inference Providers**. Pick at least one
provider that supports custom Qwen2.5-3B fine-tunes — Featherless
and Together are the most reliable for this model size.
3. Wait ~5 minutes for the provider to warm up the weights.
4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED`
line should now be `OK`.
If no provider will load it, the fall-back demo story is still
strong: keep side A on the SFT checkpoint
(`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and
explain in the writeup that the trained checkpoint runs locally
via Ollama / vLLM. The `:fastest` model-id suffix is also worth
trying — HF will pick whichever provider can serve it first.
## Pre-deploy gate
Don't merge to `main` if any of these regress:
- [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green
- [ ] `python scripts/verify_hf_router.py --skip-episode` reports
≥1 served model in the suggestion list
- [ ] At least one of `Pratyush-01/physix-3b-rl` or
`Pratyush-01/physix-3b-sft-merged` is served (the comparison
story collapses if neither trained variant works)
- [ ] Browser walkthrough section C above completes without
surprises
|