physix-live / scripts /HF_ROUTER_VERIFICATION.md
Pratyush-01's picture
Upload folder using huggingface_hub
08f8699 verified

HF Router verification before shipping the Space

The judges will overwhelmingly run the demo through the Hugging Face Router path. Before deploying any change that touches the connection panel, the provider abstraction, or the server's /interactive/* routes, walk this checklist.

A. Automated (no token required)

cd physix-live
source .venv/bin/activate            # or however you activate
python -m pytest tests/test_providers.py tests/test_providers_hf.py -v

Expected: 20 passed. These pin:

  • The HF Router base URL (https://router.huggingface.co/v1).
  • That the OpenAI SDK is constructed with the visitor's api_key, the canonical base_url, and a User-Agent header.
  • That HF_TOKEN env-var fallback works when the panel field is empty but only for HF URLs (no leakage to third-party providers).
  • That a provider rejecting response_format={"type":"json_object"} with a 400 transparently retries without it.
  • That 401, 404, connection errors, and timeouts each surface a hint that points at the right remediation step.
  • That a full episode through /interactive/sessions/{id}/llm-step passes the visitor's base_url + model + api_key byte-for-byte to the OpenAI SDK call.

B. Real-network probe (HF_TOKEN required)

# Terminal 1 — the demo backend.
python -m physix.server.app --host 127.0.0.1 --port 8000

# Terminal 2 — the verifier.
export HF_TOKEN=hf_...   # needs 'Make calls to Inference Providers' scope.
python scripts/verify_hf_router.py

What to look for:

  1. Step 1: HF_TOKEN check must end with ✓ HF_TOKEN is valid and has Inference Providers scope. If it fails, fix the token scope at https://huggingface.co/settings/tokens before going further.

  2. Step 2: model probes. Each of the four HF models we suggest in the connection panel gets a 4-token completion attempt. Any model reported as NOT SERVED (404) will appear broken in the demo unless you fix it on the model card (see "If the trained model 404s" below).

  3. Step 3: live episode. The verifier drives one real PhysiX episode. Per-turn output should look like:

    turn 1: match=0.42  format=1.00  total=0.46  (3.2s)
      equation: 'd2y/dt2 = -9.81'
    turn 2: match=0.81  format=1.00  total=0.69  (2.8s)
      equation: 'd2y/dt2 = -9.81 + 0.04 * vy**2'
    

    format=1.00 on every turn means the prompt + parser + verifier pipeline is healthy. format=0.00 means the model returned something unparseable — usually a hint that response_format={"type":"json_object"} was silently ignored and the model wasn't SFT-warmed. Acceptable for raw-Qwen baselines, not for the trained PhysiX model.

C. Browser walkthrough

With both backend and frontend running (make dev):

  1. Panel renders four endpoints. Open the page; both the A and B panels show the same four-option Endpoint dropdown: Ollama (localhost:11434) · Hugging Face Router · OpenAI · Custom. The default is Hugging Face Router on both sides.

  2. Model field adapts per endpoint.

    • With Hugging Face Router selected, the Model field is a text input. Click in it: a datalist of four suggestions appears (PhysiX RL, PhysiX SFT, untuned Qwen 3B, larger Qwen 7B).
    • Switch to Ollama: the Model field becomes a hard select. If ollama serve is running, it lists installed tags; if not, it shows an amber-bordered fallback input with the canonical qwen2.5:3b-instruct placeholder.
    • Switch to OpenAI: text input + datalist of gpt-4o-mini, gpt-4o, gpt-4.1-mini.
    • Switch to Custom: text input with no suggestions, plus a new Custom base URL field appears.
  3. API key persistence. Type a key into side A's panel, refresh the page — the key reappears (per-base_url localStorage key). Switch the endpoint to OpenAI: the key is not the HF one (each base URL has its own slot). Switch back to HF Router: HF key reappears.

  4. One real run end-to-end. Paste your HF token, leave A pointed at Pratyush-01/physix-3b-rl and B at Qwen/Qwen2.5-3B-Instruct, pick "Free Fall" from the system dropdown, hit ▶ Run side-by-side.

    Expected within ~30s:

    • Both panels show a trajectory plot with predicted overlays.
    • Per-side reward strip ticks turn-by-turn.
    • Once both finish, the "Scoreboard" banner at the bottom shows Winner: A (the trained model should beat raw Qwen on Free Fall after a few turns; if it doesn't, that's a real signal worth investigating — not a UI bug).
  5. 401 surfaces correctly. Clear the API key on side A and re-run. The A column shows an amber error row containing "'Make calls to Inference Providers' fine-grained permission". Side B (with its own key still set) keeps running normally — the two sides are independent.

  6. 404 surfaces correctly. Type a clearly bogus model id like does-not-exist/foo and run. The error row points at the model card "Deploy → Inference API" panel.

If the trained model 404s

Pratyush-01/physix-3b-rl is public, but HF Inference Providers only serves models that at least one provider has loaded. If verify_hf_router.py reports it as NOT SERVED, fix it before shipping:

  1. Open the model card at https://huggingface.co/Pratyush-01/physix-3b-rl.
  2. Click Deploy → Inference Providers. Pick at least one provider that supports custom Qwen2.5-3B fine-tunes — Featherless and Together are the most reliable for this model size.
  3. Wait ~5 minutes for the provider to warm up the weights.
  4. Re-run python scripts/verify_hf_router.py — the NOT SERVED line should now be OK.

If no provider will load it, the fall-back demo story is still strong: keep side A on the SFT checkpoint (Pratyush-01/physix-3b-sft-merged) or the untuned Qwen, and explain in the writeup that the trained checkpoint runs locally via Ollama / vLLM. The :fastest model-id suffix is also worth trying — HF will pick whichever provider can serve it first.

Pre-deploy gate

Don't merge to main if any of these regress:

  • pytest tests/test_providers.py tests/test_providers_hf.py is green
  • python scripts/verify_hf_router.py --skip-episode reports ≥1 served model in the suggestion list
  • At least one of Pratyush-01/physix-3b-rl or Pratyush-01/physix-3b-sft-merged is served (the comparison story collapses if neither trained variant works)
  • Browser walkthrough section C above completes without surprises