Spaces:

Pratyush-01
/

physix

Sleeping

App Files Files Community

physix / scripts /HF_ROUTER_VERIFICATION.md

Pratyush-01

Upload folder using huggingface_hub

0e24aff verified 12 days ago

preview code

raw

history blame contribute delete

6.61 kB

	# HF Router verification before shipping the Space

	The judges will overwhelmingly run the demo through the **Hugging Face
	Router** path. Before deploying any change that touches the connection
	panel, the provider abstraction, or the server's `/interactive/*`
	routes, walk this checklist.

	## A. Automated (no token required)

	```bash
	cd physix-live
	source .venv/bin/activate # or however you activate
	python -m pytest tests/test_providers.py tests/test_providers_hf.py -v
	```

	Expected: 20 passed. These pin:

	- The HF Router base URL (`https://router.huggingface.co/v1`).
	- That the OpenAI SDK is constructed with the visitor's `api_key`,
	the canonical `base_url`, and a `User-Agent` header.
	- That `HF_TOKEN` env-var fallback works when the panel field is empty
	but only for HF URLs (no leakage to third-party providers).
	- That a provider rejecting `response_format={"type":"json_object"}`
	with a 400 transparently retries without it.
	- That `401`, `404`, connection errors, and timeouts each surface a
	hint that points at the right remediation step.
	- That a full episode through `/interactive/sessions/{id}/llm-step`
	passes the visitor's `base_url + model + api_key` byte-for-byte to
	the OpenAI SDK call.

	## B. Real-network probe (HF_TOKEN required)

	```bash
	# Terminal 1 — the demo backend.
	python -m physix.server.app --host 127.0.0.1 --port 8000

	# Terminal 2 — the verifier.
	export HF_TOKEN=hf_... # needs 'Make calls to Inference Providers' scope.
	python scripts/verify_hf_router.py
	```

	What to look for:

	1. Step 1: HF_TOKEN check must end with `✓ HF_TOKEN is valid and
	has Inference Providers scope.` If it fails, fix the token scope at
	<https://huggingface.co/settings/tokens> before going further.
	2. Step 2: model probes. Each of the four HF models we suggest in
	the connection panel gets a 4-token completion attempt. Any model
	reported as `NOT SERVED (404)` will appear broken in the demo
	unless you fix it on the model card (see "If the trained model
	404s" below).
	3. Step 3: live episode. The verifier drives one real PhysiX
	episode. Per-turn output should look like:

	```
	turn 1: match=0.42 format=1.00 total=0.46 (3.2s)
	equation: 'd2y/dt2 = -9.81'
	turn 2: match=0.81 format=1.00 total=0.69 (2.8s)
	equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
	```

	`format=1.00` on every turn means the prompt + parser + verifier
	pipeline is healthy. `format=0.00` means the model returned
	something unparseable — usually a hint that
	`response_format={"type":"json_object"}` was silently ignored and
	the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines,
	not for the trained PhysiX model.

	## C. Browser walkthrough

	With both backend and frontend running (`make dev`):

	1. Panel renders four endpoints. Open the page; both the A and B
	panels show the same four-option Endpoint dropdown:
	`Ollama (localhost:11434)` · `Hugging Face Router` · `OpenAI` ·
	`Custom`. The default is `Hugging Face Router` on both sides.

	2. Model field adapts per endpoint.
	- With Hugging Face Router selected, the Model field is a
	text input. Click in it: a datalist of four suggestions appears
	(PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
	- Switch to Ollama: the Model field becomes a hard select.
	If `ollama serve` is running, it lists installed tags; if not,
	it shows an amber-bordered fallback input with the canonical
	`qwen2.5:3b-instruct` placeholder.
	- Switch to OpenAI: text input + datalist of `gpt-4o-mini`,
	`gpt-4o`, `gpt-4.1-mini`.
	- Switch to Custom: text input with no suggestions, plus a new
	Custom base URL field appears.

	3. API key persistence. Type a key into side A's panel, refresh
	the page — the key reappears (per-`base_url` `localStorage` key).
	Switch the endpoint to OpenAI: the key is not the HF one (each
	base URL has its own slot). Switch back to HF Router: HF key
	reappears.

	4. One real run end-to-end. Paste your HF token, leave A pointed
	at `Pratyush-01/physix-3b-rl` and B at `Qwen/Qwen2.5-3B-Instruct`,
	pick "Free Fall" from the system dropdown, hit `▶ Run side-by-side`.

	Expected within ~30s:
	- Both panels show a trajectory plot with predicted overlays.
	- Per-side reward strip ticks turn-by-turn.
	- Once both finish, the "Scoreboard" banner at the bottom shows
	`Winner: A` (the trained model should beat raw Qwen on Free Fall
	after a few turns; if it doesn't, that's a real signal worth
	investigating — not a UI bug).

	5. 401 surfaces correctly. Clear the API key on side A and
	re-run. The A column shows an amber error row containing
	"'Make calls to Inference Providers' fine-grained permission".
	Side B (with its own key still set) keeps running normally — the
	two sides are independent.

	6. 404 surfaces correctly. Type a clearly bogus model id like
	`does-not-exist/foo` and run. The error row points at the model
	card "Deploy → Inference API" panel.

	## If the trained model 404s

	`Pratyush-01/physix-3b-rl` is public, but **HF Inference Providers
	only serves models that at least one provider has loaded**. If
	`verify_hf_router.py` reports it as `NOT SERVED`, fix it before
	shipping:

	1. Open the model card at
	<https://huggingface.co/Pratyush-01/physix-3b-rl>.
	2. Click Deploy → Inference Providers. Pick at least one
	provider that supports custom Qwen2.5-3B fine-tunes — Featherless
	and Together are the most reliable for this model size.
	3. Wait ~5 minutes for the provider to warm up the weights.
	4. Re-run `python scripts/verify_hf_router.py` — the `NOT SERVED`
	line should now be `OK`.

	If no provider will load it, the fall-back demo story is still
	strong: keep side A on the SFT checkpoint
	(`Pratyush-01/physix-3b-sft-merged`) or the untuned Qwen, and
	explain in the writeup that the trained checkpoint runs locally
	via Ollama / vLLM. The `:fastest` model-id suffix is also worth
	trying — HF will pick whichever provider can serve it first.

	## Pre-deploy gate

	Don't merge to `main` if any of these regress:

	- [ ] `pytest tests/test_providers.py tests/test_providers_hf.py` is green
	- [ ] `python scripts/verify_hf_router.py --skip-episode` reports
	≥1 served model in the suggestion list
	- [ ] At least one of `Pratyush-01/physix-3b-rl` or
	`Pratyush-01/physix-3b-sft-merged` is served (the comparison
	story collapses if neither trained variant works)
	- [ ] Browser walkthrough section C above completes without
	surprises