File size: 6,608 Bytes
08f8699
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# HF Router verification before shipping the Space

The judges will overwhelmingly run the demo through the **Hugging Face
Router** path. Before deploying any change that touches the connection
panel, the provider abstraction, or the server's `/interactive/*`
routes, walk this checklist.

## A. Automated (no token required)

```bash
cd physix-live
source .venv/bin/activate            # or however you activate
python -m pytest tests/test_providers.py tests/test_providers_hf.py -v
```

Expected: **20 passed**. These pin:

- The HF Router base URL (`https://router.huggingface.co/v1`).
- That the OpenAI SDK is constructed with the visitor's `api_key`,
  the canonical `base_url`, and a `User-Agent` header.
- That `HF_TOKEN` env-var fallback works when the panel field is empty
  but only for HF URLs (no leakage to third-party providers).
- That a provider rejecting `response_format={"type":"json_object"}`
  with a 400 transparently retries without it.
- That `401`, `404`, connection errors, and timeouts each surface a
  hint that points at the right remediation step.
- That a full episode through `/interactive/sessions/{id}/llm-step`
  passes the visitor's `base_url + model + api_key` byte-for-byte to
  the OpenAI SDK call.

## B. Real-network probe (HF_TOKEN required)

```bash
# Terminal 1 — the demo backend.
python -m physix.server.app --host 127.0.0.1 --port 8000

# Terminal 2 — the verifier.
export HF_TOKEN=hf_...   # needs 'Make calls to Inference Providers' scope.
python scripts/verify_hf_router.py
```

What to look for:

1. **Step 1: HF_TOKEN check** must end with `✓ HF_TOKEN is valid and
   has Inference Providers scope.` If it fails, fix the token scope at
   <https://huggingface.co/settings/tokens> before going further.
2. **Step 2: model probes**. Each of the four HF models we suggest in
   the connection panel gets a 4-token completion attempt. Any model
   reported as `NOT SERVED (404)` will appear broken in the demo
   unless you fix it on the model card (see "If the trained model
   404s" below).
3. **Step 3: live episode**. The verifier drives one real PhysiX
   episode. Per-turn output should look like:

   ```
   turn 1: match=0.42  format=1.00  total=0.46  (3.2s)
     equation: 'd2y/dt2 = -9.81'
   turn 2: match=0.81  format=1.00  total=0.69  (2.8s)
     equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
   ```

   `format=1.00` on every turn means the prompt + parser + verifier
   pipeline is healthy. `format=0.00` means the model returned
   something unparseable — usually a hint that
   `response_format={"type":"json_object"}` was silently ignored *and*
   the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines,
   not for the trained PhysiX model.

## C. Browser walkthrough

With both backend and frontend running (`make dev`):

1. **Panel renders four endpoints.** Open the page; both the A and B
   panels show the same four-option Endpoint dropdown:
   `Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` ·
   `Custom`. The default is `Hugging Face Router` on both sides.

2. **Model field adapts per endpoint.**
   - With **Hugging Face Router** selected, the Model field is a
     text input. Click in it: a datalist of four suggestions appears
     (PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
   - Switch to **Ollama**: the Model field becomes a hard select.
     If `ollama serve` is running, it lists installed tags; if not,
     it shows an amber-bordered fallback input with the canonical
     `qwen2.5:3b-instruct` placeholder.
   - Switch to **OpenAI**: text input + datalist of `gpt-4o-mini`,
     `gpt-4o`, `gpt-4.1-mini`.
   - Switch to **Custom**: text input with no suggestions, plus a new
     **Custom base URL** field appears.

3. **API key persistence.** Type a key into side A's panel, refresh
   the page — the key reappears (per-`base_url` `localStorage` key).
   Switch the endpoint to OpenAI: the key is *not* the HF one (each
   base URL has its own slot). Switch back to HF Router: HF key
   reappears.

4. **One real run end-to-end.** Paste your HF token, leave A pointed
   at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`,
   pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`.

   Expected within ~30s:
   - Both panels show a trajectory plot with predicted overlays.
   - Per-side reward strip ticks turn-by-turn.
   - Once both finish, the "Scoreboard" banner at the bottom shows
     `Winner: A` (the trained model should beat raw Qwen on Free Fall
     after a few turns; if it doesn't, that's a real signal worth
     investigating — not a UI bug).

5. **401 surfaces correctly.** Clear the API key on side A and
   re-run. The A column shows an amber error row containing
   "'Make calls to Inference Providers' fine-grained permission".
   Side B (with its own key still set) keeps running normally — the
   two sides are independent.

6. **404 surfaces correctly.** Type a clearly bogus model id like
   `does-not-exist/foo` and run. The error row points at the model
   card "Deploy → Inference API" panel.

## If the trained model 404s

`Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers
only serves models that at least one provider has loaded**. If
`verify_hf_router.py` reports it as `NOT SERVED`, fix it before
shipping:

1. Open the model card at
   <https://huggingface.co/Pratyush-01/physix-3b-rl>.
2. Click **Deploy → Inference Providers**. Pick at least one
   provider that supports custom Qwen2.5-3B fine-tunes — Featherless
   and Together are the most reliable for this model size.
3. Wait ~5 minutes for the provider to warm up the weights.
4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED`
   line should now be `OK`.

If no provider will load it, the fall-back demo story is still
strong: keep side A on the SFT checkpoint
(`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and
explain in the writeup that the trained checkpoint runs locally
via Ollama / vLLM. The `:fastest` model-id suffix is also worth
trying — HF will pick whichever provider can serve it first.

## Pre-deploy gate

Don't merge to `main` if any of these regress:

- [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green
- [ ] `python scripts/verify_hf_router.py --skip-episode` reports
  ≥1 served model in the suggestion list
- [ ] At least one of `Pratyush-01/physix-3b-rl` or
  `Pratyush-01/physix-3b-sft-merged` is served (the comparison
  story collapses if neither trained variant works)
- [ ] Browser walkthrough section C above completes without
  surprises