Spaces:

TheJackBright
/

polyguard-openenv-workbench

Sleeping

File size: 33,710 Bytes

# PolyGuard Space UI — demo recording script (shot-by-shot)

Use this document while screen-recording the Hugging Face Space (or local Docker). Target length: **8–14 minutes** for a full pass, or **3–5 minutes** for a highlights reel.

---

## Before you hit record

1. **Open the Space** in a clean browser profile or incognito (fewer extensions → fewer glitches).
2. **Set resolution**: 1920×1080 or 1440×900; browser zoom **100%**.
3. **Fullscreen** the Space iframe or use HF “Open in new tab” so the URL bar shows the Space domain.
4. **Wait for cold start**: first load may download the model bundle (several minutes). The **Event Log** and **Model Truth** panel will tell you if the policy failed to load (heuristic fallback is still usable for env steps).
5. **Optional**: hide mouse cursor in OBS if you prefer; otherwise move slowly and pause **2 seconds** on each panel after major clicks.

**Primary Space (product):** `https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench`
Runtime: nginx fronts the **product API** (default `8200`) and **OpenEnv service** (`8100`); see `docker/space/entrypoint.sh`.

---

## Where the model lives (Qwen and artifacts)

This matters for what you say on camera.

| Location | What it is |
| --- | --- |
| **On the Space container** | Working directory `/app` (see `entrypoint.sh`: `cd /app`). |
| **Downloaded bundle** | If `checkpoints/active/grpo_adapter/adapter_config.json` is missing at boot, `scripts/install_hf_active_bundle.py` pulls the **HF usable model bundle** into `checkpoints/active/`. |
| **Typical layout after install** | `checkpoints/active/active_model_manifest.json` — which artifact is active (often **GRPO adapter** on top of base). |
| **Weights** | `checkpoints/active/grpo_adapter/` (LoRA/PEFT), optionally `checkpoints/active/merged/` (full merged weights), `checkpoints/active/sft_adapter/`. |
| **Base model name** | Usually **`Qwen/Qwen2.5-0.5B-Instruct`** as the Transformers base for adapters (set via env e.g. `POLYGUARD_HF_MODEL`). |

**What the UI proves:** the **Model Truth** panel calls **`GET /policy/model_status`** (product API). It shows `model_id` / `base_model`, `run_id`, `preferred_artifact` / `loaded_source`, and availability flags. Say on camera: *“This is live from the API, not hard-coded in the frontend.”*

---

## UI map (what appears on screen)

| Region | Purpose |
| --- | --- |
| **Hero** (“PolyGuard neural safety cockpit”) | Marketing copy + quick stats. |
| **Top bar** | **Agent Workbench** vs **Env Explorer**, **Task** dropdown, **Reset Episode**, **Q Tips**. |
| **Status chips** | “Live” / model line; in Env mode one chip reads **ws env** (WebSocket to OpenEnv). |
| **Model Truth** | Qwen / artifact / run / availability. |
| **Advanced strip** | Only if Task = **Advanced** — pick raw `difficulty` + `sub_environment`. |
| **Episode Overview** | Mode, task, difficulty, environment, step budget, last reward, patient id, **Patient Summary**, **Risk Delta**. |
| **Candidate Actions** | Legal moves: `candidate_id`, action type, target/replacement, estimated safety delta (or **Blocked**). |
| **Action Console** | Confidence, rationale, **Submit** vs **Run Agent** (Agent mode only for Run Agent). |
| **Reward Channels** | Bars for total + primary + component scores (see below). |
| **Current Medications** | Cards from observation. |
| **Action History / Warnings** | Step trace and env warnings. |
| **Decision / Explanation / Evidence** | **Agent mode only** (filled after API steps that return those fields). |
| **Event Log** | Human-readable trace of resets, steps, rewards, errors. |

---

## Feature encyclopedia — every panel, branch, and agent

Use this section as a **script appendix** or **judge handout**. It mirrors the React workbench in `app/ui/frontend/src/App.tsx`, the API in `app/api/`, and the orchestrator in `app/agents/orchestrator.py`.

### A. How the Space is wired (under the hood)

| Piece | Role |
| --- | --- |
| **Browser → nginx** | HF Space exposes one origin; nginx routes paths. |
| **Product API** | Vite uses `API_BASE` (default **`/api`**). FastAPI serves catalog, reset, step_candidate, orchestrate, model_status, reward_breakdown, etc. |
| **OpenEnv HTTP/WS** | `ENV_BASE` defaults to **same origin** on Spaces (not localhost). Web UI opens **`ws(s)://<origin>/ws`** for Env Explorer. |
| **Two Python processes** | `entrypoint.sh` starts **uvicorn** for `app.env.fastapi_app` (env, port **8100**) and **uvicorn** for `app.api` (product API, port **8200**). Agent mode reset/step still use the **API’s** in-process `PolyGuardEnv`; Env mode uses the **separate** env service over WebSocket. |
| **Important** | Agent and Env UIs maintain **separate React state** (`agentObservation` vs `envObservation`). Toggling mode **clears the Event Log** and clears the inactive branch’s episode state so you always know which backend path you are exercising. |

### B. Hero (“PolyGuard neural safety cockpit”)

| Stat | Source | What to say on camera |
| --- | --- | --- |
| **Runtime** | `mode === "agent"` → “Agent Workbench”; else “Env Explorer”. | “This is which transport I am using right now.” |
| **Scenario** | Human label for current `taskId` from catalog presets or Advanced. | “Which curriculum preset is bound to difficulty + sub-environment.” |
| **Candidates** | `candidate_action_set.length` from the **active** observation. | “How many legal moves the env is offering after the last reset/step.” |
| **Reward** | Last scalar reward for the active branch (`null` → shown as `-`). | “Verifier scalar after the last step in this mode only.” |

### C. Top bar — every control

| Control | Behavior |
| --- | --- |
| **Agent Workbench** | Sets `mode` to `agent`. Clears env state, event log, error; clears agent panels if switching from env (see `handleModeChange`). |
| **Env Explorer** | Sets `mode` to `env`. Clears agent-specific observation/reward/decision/evidence. |
| **Task** `<select>` | Options: each **task preset** from `GET /env/catalog` (`task_presets`), plus **Advanced**. Changing a preset updates internal `difficulty` + `sub_environment` to match the preset. |
| **Reset Episode** | **Agent:** `POST /env/reset` with body from preset (`{ task_id }`) or `{ difficulty, sub_environment }`. Refreshes **Model Truth** first. Clears reward breakdown, decision, explanation, evidence, sets default candidate. **Env:** WebSocket `reset` with `{ difficulty, sub_environment }` only (no `task_id` in WS path—preset is flattened to those two fields). **Always** clears `events` at start of reset handler, then appends one “Reset … in agent/env” line. |
| **Q Tips** | Opens modal walkthrough; highlights DOM nodes with `[data-guide="…"]`. **Skip** stores `polyguard.qtips.v2.seen` in localStorage so first visit auto-opens tips. |
| **Status chips** | First chip: **Live** if observation loaded and not done, else **Complete** / **Ready**. Second chip: in Agent mode, derived from **`modelSignal()`** (Qwen verified or not); in Env mode shows **`ws env`**. |

### D. Model Truth panel — field by field

Data from **`GET /policy/model_status`** (`PolicyProviderRouter` / `active_model_status`).

| Field in UI | Typical meaning |
| --- | --- |
| **Heading label** | “Qwen 0.5B active” only when Space config matches a strict check (enabled + active + availability + model id regex for **Qwen2.5-0.5B-Instruct**); else “Qwen not verified” or Ollama-specific text if Ollama wins locally. |
| **Detail paragraph** | Human sentence: model name, artifact, `run_id`, optional **load_error**. |
| **Model** | `model_id` or `base_model` — HF id of the loaded or configured base. |
| **Run** | `run_id` from manifest / sweep activation (which training bundle). |
| **Artifact** | `loaded_source` or `preferred_artifact` — e.g. **`grpo_adapter`**, **`merged`**, **`sft_adapter`**. |
| **Availability** | Key/value pairs from `availability` dict (which load stages succeeded). |

**Ollama branch (local dev):** If `status.ollama.enabled && available`, the UI labels **Ollama Qwen active** and mentions `POLYGUARD_PROVIDER_PREFERENCE` order. Spaces Dockerfile sets **`POLYGUARD_ENABLE_OLLAMA=false`** by default.

### E. Advanced strip (Task = Advanced)

Only rendered when `taskId === "advanced"`. Two selects:

1. **Difficulty:** `easy` \| `medium` \| `hard` — passed to reset as `difficulty`.
2. **Environment:** every string in `catalog.sub_environments` (DDI, BANDIT_MINING, REGIMEN_RISK, PRECISION_DOSING, LONGITUDINAL_DEPRESCRIBING, WEB_SEARCH_MISSING_DATA, ALTERNATIVE_SUGGESTION, NEW_DRUG_DECOMPOSITION).

**What each sub-environment stresses (one line each):**

| Sub-environment | What the episode emphasizes |
| --- | --- |
| **DDI** | Drug–drug interaction exposure and pair risk. |
| **BANDIT_MINING** | Policy / bandit exploration style scenario (see preset “Bandit Mining”). |
| **REGIMEN_RISK** | Overall regimen burden and safety tradeoffs. |
| **PRECISION_DOSING** | Dose buckets, organ-sensitive flags in observation. |
| **LONGITUDINAL_DEPRESCRIBING** | Multi-step taper / stop sequences over time. |
| **WEB_SEARCH_MISSING_DATA** | Rewards process fidelity for evidence-fetch actions. |
| **ALTERNATIVE_SUGGESTION** | Substitution / alternative action types rewarded more. |
| **NEW_DRUG_DECOMPOSITION** | Hard track: decompose novel drug string into components. |

### F. Episode Overview — every KPI and subsection

**KPI grid (always eight rows):**

| KPI | Source |
| --- | --- |
| **Mode** | Literal “Agent Workbench” or “Env Explorer”. |
| **Task** | Preset label or “Advanced”. |
| **Difficulty** | `observation.deterministic_contract.difficulty` or `-`. |
| **Environment** | `deterministic_contract.sub_environment` or `observation.sub_environment`. |
| **Step Budget** | `observation.step_budget_remaining`. |
| **Last Reward** | Active branch’s last reward (after reset, Agent clears to `-` until first step). |
| **Patient** | `patient_summary.patient_id` or `patient_summary.id`. |
| **Status** | Complete if `done`, else Live if observation exists, else Ready. |

**Patient Summary `<dl>`:** First **8** keys of `observation.patient_summary` (keys humanized: underscores → spaces, title case). Typical keys include demographics, allergies, high-level clinical flags—whatever the backend puts on `PolyGuardObservation`.

**Risk Delta `<dl>`:** First **8** entries of `observation.burden_score_summary` — burden-related scalars the env uses for reward deltas.

### G. Candidate Actions list — each column

Each row is one **`CandidateAction`** from `candidate_action_set`.

| Column / concept | Meaning |
| --- | --- |
| **`candidate_id`** | Stable id (e.g. `cand_…`) — must match when submitting. |
| **Action label** | Humanized `action_type` (STOP_DRUG, SUBSTITUTE_WITHIN_CLASS, …). |
| **Third column** | `target_drug` **or** `replacement_drug` **or** `mode` — whichever is most informative. |
| **Right column** | `estimated_safety_delta` formatted to 3 decimals, or **Blocked** if `legality_precheck === false`. |
| **Disabled rows** | You cannot select illegal candidates; click does nothing. |
| **Default selection** | **Agent:** first candidate in list. **Env:** first **legal** candidate that is not `KEEP_REGIMEN` and not `REQUEST_*`, else first legal non–KEEP_REGIMEN, else first in list (`defaultCandidateForMode`). |

**Hidden fields you can mention if showing JSON elsewhere:** `dose_bucket`, `taper_days`, `monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`, `uncertainty_score`, `rationale_tags`, `required_monitoring`, `burden_delta`, `disease_stability_estimate`.

### H. Action Console — every input and button

| UI element | Effect |
| --- | --- |
| **Type / Mode / Target / Replacement / Dose / Uncertainty** | Read-only snapshot of the **currently selected** candidate. |
| **Confidence** | Number input **0.001–0.999** step 0.001; sent as `confidence` on **Submit Candidate** (Agent) or embedded in WS payload (Env). |
| **Rationale** | Free text → `rationale_brief` / rationale on the action. |
| **Submit Candidate** (Agent) | Calls `POST /env/step_candidate` with `{ candidate_id, confidence, rationale_brief }`. API finds matching legal action and calls `env.step`. |
| **Submit Env Step** (Env) | Same confidence/rationale + full action payload built by `buildActionPayload` → WS `step`. |
| **Run Agent** | **Only when** `mode === "agent"` **and** observation exists **and** not `done`. Calls `POST /agents/orchestrate` with empty JSON body. **Disabled** in Env mode. |
| **Done notice** | If `done`, shows which mode completed and `termination_reason` from `info` if present. Primary button becomes **Reset Episode** (shortcut). |

### I. Reward Channels — every bar (exact keys)

The UI renders **exactly these keys** in order (`REWARD_KEYS` in `App.tsx` — **14** rows):

| # | Key | Role |
| --- | --- | --- |
| 1 | `total_reward` | Weighted aggregate of component scores (`aggregate_rewards` in `reward_scaling.py`). |
| 2 | `primary_safety_legality` | Roll-up: legality, candidate alignment, anti-cheat, uncertainty calibration (`reward_router.compute_primary_reward_channels`). |
| 3 | `primary_clinical_improvement` | Roll-up: safety delta, burden improvement, disease stability. |
| 4 | `primary_dosing_quality` | Roll-up: dosing quality + abstention quality. |
| 5 | `primary_process_integrity` | Roll-up: format compliance, efficiency, process fidelity, explanation grounding. |
| 6 | `legality_score` | Action legal per safety verifier. |
| 7 | `safety_delta_score` | Movement on severe DDI / risk proxy vs pre-step state. |
| 8 | `burden_improvement_score` | Medication burden before vs after. |
| 9 | `disease_stability_score` | Stability heuristic vs disruptive action types. |
| 10 | `dosing_quality_score` | Dose-mode and bucket appropriateness. |
| 11 | `process_fidelity_score` | Follows intended workflow for sub-environment (e.g. fetch evidence when required). |
| 12 | `explanation_grounding_score` | Rationale present / grounded. |
| 13 | `anti_cheat_score` | Collapses when anti-cheat triggers. |
| 14 | `uncertainty_calibration_score` | Confidence vs uncertainty alignment. |

**Note:** `total_reward` is row 1; rows 2–5 are **primary** channels; rows 6–14 are **exposed component** scores. Other components (`format_compliance_score`, `efficiency_score`, `candidate_alignment_score`, `abstention_quality_score`) still exist **in the backend** `RewardBreakdown` and feed primaries + total, but this UI **does not** give them their own bar rows.

Bars show **`-`** when the value is missing (no step yet or breakdown not returned). Bar width = value × 100% with value clamped to `[0.001, 0.999]`.

**Agent vs breakdown source:** After a step, UI prefers `info.reward_breakdown`; it may also call **`GET /env/reward_breakdown`**. **Env:** uses `info.reward_breakdown` from the WebSocket step packet; if empty, the UI clears the reward panel.

### J. Current Medications cards

Built from `observation.medication_table[]`. Each card:

- **Title:** `drug` / `drug_id` / `name`.  
- **High-risk ribbon:** if `high_risk` or `is_high_risk_elderly` or Beers / warning flags.  
- **Body:** `indication` or `class_name` or `atc_class`.  
- **Meta row:** dose bucket or mg dose; taper vs `monitoring` or `route`.

### K. Action History vs Warnings

| Panel | Source |
| --- | --- |
| **Action History** | `observation.action_history` — each item shows step index and `action_type` / `candidate_id` / reward snippet. |
| **Warnings** | `observation.warning_summary` — list of human-readable env warnings (DDIs, constraints, etc.). |

### L. Decision / Explanation / Evidence (Agent only)

Rendered as JSON `<pre>` blocks:

| Title | When populated | Content origin |
| --- | --- | --- |
| **Decision** | Agent mode only. | **`final_action`** on the packet. For **`step_candidate`**, the API returns the standard **step** payload — **typically no `final_action` field**, so this panel may stay **empty after manual submit**. For **`orchestrate`**, **`final_action`** is the **`PolyGuardAction`** after critic (what actually hit `env.step`). |
| **Explanation** | Agent mode only. | **`explanation`** — output of **`ExplainerAgent`** after the step (`orchestrate` returns it). Usually **empty** after raw `step_candidate` unless API adds it. |
| **Evidence** | Agent mode only. | **`evidence`** key on packet. **`orchestrate`** returns **`evidence_out`** from **`EvidenceAgent.run(state)`** (retrieval / web-fallback bundle). **`step_candidate`** does not attach orchestrator evidence — panel often **empty** on manual clicks. |

**Demo takeaway:** Tell viewers: *“To populate Decision / Explanation / Evidence in the UI, use **Run Agent** (orchestrate). Manual **Submit Candidate** updates the env and rewards but does not replay the full multi-agent JSON into those three panels.”*

### M. Event Log vs Q Tips

| Feature | Behavior |
| --- | --- |
| **Event Log** | Prepends timestamped strings: resets, each step’s reward line, errors. **Capped** at 24 lines. Cleared when you click **Reset Episode** (handler starts with `setEvents([])` then appends) — *not* the same as mode switch clearing. |
| **Q Tips** | 10-step overlay; does not mutate env. |

### N. Orchestrator — every agent in order (`Run Agent`)

When **`POST /agents/orchestrate`** runs, `Orchestrator.run_step` executes:

| Step | Agent class | What it does (operator language) |
| --- | --- | --- |
| 1 | **`MedRecAgent`** | Summarizes current medication list / reconciliation view for downstream modules. Output key: `medrec`. |
| 2 | **`EvidenceAgent`** | Retrieves **local evidence** (and optional web fallback) for missing or thin context. Shown in UI **`evidence`** when orchestrating. |
| 3 | **`GraphSafetyAgent`** | Graph-style **DDI / duplicate therapy** style signals. Output: `graph`. |
| 4 | **`DosingAgent`** | Flags **dose-sensitive** windows and dosing opportunities. Feeds **`dosing_active`** into supervisor. |
| 5 | **`CandidateAgent`** | Wraps env **candidate builder** — produces the legal `CandidateAction` list. |
| 6 | **`SupervisorAgent`** | Chooses planner **mode**: regimen vs dose vs **REVIEW** (conservative routing). |
| 7 | **Contextual bandit** | **`ContextualBanditPolicy`** (LinUCB or Thompson sampling via `POLYGUARD_BANDIT_ALGO`) proposes **top-k** (`POLYGUARD_BANDIT_TOP_K`) candidates for the planner to consider. |
| 8 | **`PlannerAgent`** | Calls **`PolicyProviderRouter.select_candidate`** — this is where **Transformers + Qwen + PEFT** (or Ollama, or **safety ranker fallback**) picks a **`candidate_id`** and rationale. |
| 9 | **`CriticAgent`** | Safety veto / repair. May replace proposed action with a safer **`final_action`**. |
| 10 | **Replan / debate** (optional) | If `coordination_mode` is `replan_on_veto` or `lightweight_debate` and critic rejects, planner may rerun on **review** candidates; `debate_rounds` increments. |
| 11 | **`PolyGuardEnv.step`** | Commits **`final_action`**, returns `observation`, `reward`, `done`, `info`. |
| 12 | **Bandit `update`** | If the chosen candidate was in the bandit pool, **updates bandit statistics with the reward** (learning signal for next orchestrate). |
| 13 | **`ExplainerAgent`** | Builds **`explanation`** object for audit / UI. |

**Environment variables (mention for power users):**

| Variable | Effect |
| --- | --- |
| **`POLYGUARD_POLICY_STACK`** | `llm+bandit` (default): planner sees **bandit-shortlisted** candidates. `llm-only`: all supervisor-filtered candidates. `bandit-only`: **no LLM** — first bandit pick with fixed rationale. |
| **`POLYGUARD_BANDIT_*`** | Algorithm, alpha, epsilon, seed, top-k. |
| **`POLYGUARD_PROVIDER_PREFERENCE`** | e.g. `transformers` vs `ollama` order. |
| **`POLYGUARD_ENABLE_ACTIVE_MODEL`** | Must be true on Space for bundle path; **`POLYGUARD_HF_MODEL`** sets base id for adapters. |

### O. Qwen and fallbacks (planner path)

`PolicyProviderRouter` (`app/models/policy/provider_runtime.py`):

1. Builds a **JSON instruction** listing candidates and asks for `candidate_id=…; rationale=…`.  
2. Tries providers in **`POLYGUARD_PROVIDER_PREFERENCE`** (default **Transformers** on Space).  
3. Parses model text for a legal `candidate_id`; on failure uses **`safety_ranker`** deterministic ordering.

**So:** Even without Qwen load, **Run Agent** still completes using **ranker / bandit** — mention that if Model Truth is red.

### P. Full observation contract (API / types)

The TypeScript type `EnvObservation` (`lib/types.ts`) lists fields the backend **may** send. The main workbench **highlights** patient summary, medication table, candidates, burden summary, action history, warnings, step budget, and sub-environment. **Not all fields get their own panel** — if you open browser DevTools → Network → `reset` / `step` response, you can narrate extras:

| Field | Typical use |
| --- | --- |
| `comorbidity_summary` | Comorbidity list for the patient. |
| `organ_function_summary` | eGFR / hepatic flags for dosing scenarios. |
| `labs_vitals_summary` | Labs relevant to risk scoring. |
| `graph_safety_summary` | Aggregated graph / DDI context. |
| `precision_dosing_flags` | Tags when sub-environment is dosing-heavy. |
| `unresolved_conflicts` | Specialist conflict strings. |
| `abstention_indicators` | When the env suggests review / abstain. |
| `deterministic_contract` | Difficulty + sub-environment + scenario id contract for reproducibility. |

### Q. Q Tips — copy for each slide (matches `GUIDE_STEPS`)

| # | Title | Body (read aloud or paraphrase) |
| --- | --- | --- |
| 1 | Start here | PolyGuard is an interactive OpenEnv workbench; top bar picks runtime, scenario, reset. |
| 2 | Choose the runtime | Agent Workbench = REST API + reward breakdown + Qwen path; Env Explorer = WebSocket to OpenEnv. |
| 3 | Pick a scenario | Presets load real patient/regimen state from backend. |
| 4 | Check the model truth | `/policy/model_status`; Qwen only “verified” when API says adapters live. |
| 5 | Read the episode state | Task, patient, step budget, reward, risk delta from latest env response. |
| 6 | Review legal actions | Candidate rows = legal moves; inspect safety delta and mode. |
| 7 | Submit or ask the agent | Submit Candidate vs Run Agent; check model panel before claiming LLM. |
| 8 | Inspect reward channels | Real scorer output per channel; empty = no step yet. |
| 9 | Track regimen changes | Medication cards + history + warnings = not canned. |
| 10 | Follow the run | Event log shows resets, steps, rewards, errors plainly. |

---

## Agent Workbench vs Env Explorer (say this exactly on camera)

| | **Agent Workbench** | **Env Explorer** |
| --- | --- | --- |
| **Reset** | `POST /env/reset` with task preset (e.g. `{ "task_id": "easy_screening" }`) via product API. | WebSocket `reset` message to OpenEnv **`/ws`** with `{ difficulty, sub_environment }`. |
| **Submit** | `POST /env/step_candidate` — product API resolves `candidate_id` + your confidence + rationale into a full action and steps the **same** in-process `PolyGuardEnv`. | WebSocket `step` — payload built from selected candidate; talks **directly** to OpenEnv service. |
| **Run Agent** | **`POST /agents/orchestrate`** — runs the full **orchestrator** (med rec, evidence, graph, dosing, candidates, supervisor, bandit, **planner/LLM**, critic, env step, explainer). | Button **disabled** — there is no orchestrator path over raw WS-only mode in this UI. |
| **Decision / Explanation / Evidence panels** | **Populated** after orchestrate or after steps that echo `final_action` / `explanation` / `evidence` (orchestrate returns rich `evidence` from `EvidenceAgent` pipeline). | **Always empty** in the UI by design — those panels are `null` in Env mode (`App.tsx` only passes agent-mode state to DetailPanels). |
| **Reward breakdown** | From step `info.reward_breakdown` or fallback `GET /env/reward_breakdown`. | From WS step packet `info.reward_breakdown` when present. |
| **Switching mode** | Clears the **Event Log** and resets the other mode’s transient state — mention that so viewers don’t think it’s a bug. | Same. |

**One-liner for judges:** *“Agent Workbench is the full product API plus optional LLM-orchestrated policy; Env Explorer is the raw OpenEnv WebSocket contract for the same underlying environment.”*

---

## Reward channels — what they mean and how they’re computed (talk track)

Rewards are **verifier-backed**, **bounded** to roughly **`[0.001, 0.999]`** (3 decimal places in UI).

### Four primary channels (high level)

These are **averages of component groups** (`app/env/reward_router.py` — `compute_primary_reward_channels`):

1. **`primary_safety_legality`** — legality, candidate id alignment, anti-cheat, uncertainty calibration.  
2. **`primary_clinical_improvement`** — safety delta vs severe pairs, burden improvement, disease stability.  
3. **`primary_dosing_quality`** — dosing quality + abstention (e.g. appropriate review requests under uncertainty).  
4. **`primary_process_integrity`** — format compliance, efficiency (step budget), process fidelity, explanation grounding.

### Components (examples — `compute_reward_breakdown`)

The environment builds scores such as:

- **`legality_score`**: high if the action is legal per safety report.  
- **`safety_delta_score` / `burden_improvement_score`**: from **before/after** burden and severe DDI pair counts (`_delta_to_reward`).  
- **`anti_cheat_score`**: collapses if anti-cheat flags the trajectory.  
- **`uncertainty_calibration_score`**: penalizes overconfidence vs modeled uncertainty.  
- **Sub-environment tweaks**: e.g. `WEB_SEARCH_MISSING_DATA` boosts process fidelity when using `FETCH_EXTERNAL_EVIDENCE`; `NEW_DRUG_DECOMPOSITION` rewards decomposition actions with components.

Then components are **scaled/clamped**, **primary channels** recomputed, and **`total_reward`** = weighted aggregate (`aggregate_rewards`).

**Demo line:** *“Bars update only after a real step — empty fields mean we haven’t stepped yet, not fake filler.”*

---

## Built-in **Q Tips** (on-screen tour)

Click **Q Tips** in the top bar. The app cycles **10 slides** (`App.tsx` → `GUIDE_STEPS`):

1. Start here — top bar, scenarios, reset.  
2. Choose the runtime — Agent vs Env.  
3. Pick a scenario — presets load real patient/regimen state.  
4. Check the model truth — `/policy/model_status`.  
5. Read episode state — overview + patient summary.  
6. Review legal actions — candidates.  
7. Submit or ask the agent — Submit vs Run Agent.  
8. Inspect reward channels.  
9. Medications + history/warnings.  
10. Event log — errors and connectivity.

**Recording tip:** Record **Q Tips** once in full voiceover (“I’ll use the in-app tour…”) then dismiss and do the live walkthrough below.

---

## Shot-by-shot recording script

### Scene 0 — Intro (30–45 s)

**Action:** Scroll slightly so hero + top bar are visible.  
**Say:** *“This is PolyGuard on Hugging Face Spaces: an OpenEnv workbench for polypharmacy safety. The backend runs a real `PolyGuardEnv` with verifiable rewards; the UI can drive it through the product API or raw OpenEnv WebSockets.”*

---

### Scene 1 — Model Truth (45–60 s)

**Action:** Stay on **Agent Workbench**. Click nothing yet; point at **Model Truth**.  
**Say:** *“Model Truth is live from `/policy/model_status`. Here we see the base model—typically Qwen 2.5 0.5B Instruct—which artifact is loaded—often the GRPO adapter—and the run id. On Spaces, weights are under `/app/checkpoints/active` after the bundle installer runs.”*

**If panel shows unavailable:** *“Cold start or CPU load can delay the bundle; the environment still works for manual candidate submission; Run Agent may fall back to non-LLM routing depending on config.”*

---

### Scene 2 — Easy task, manual submit (Agent) (90–120 s)

**Action:** Task → **Easy Screening** (DDI, easy). **Reset Episode.**  
**Say:** *“Easy Screening fixes difficulty easy and sub-environment DDI—drug–drug interaction screening.”*

**Action:** Pan **Episode Overview** — read **Patient Summary** and **Risk Delta** aloud briefly.  
**Say:** *“This patient block and risk delta come straight from the observation object.”*

**Action:** **Candidate Actions** — click 2–3 rows; show **Blocked** vs legal. Select a **legal** row.  
**Say:** *“Candidates are legal moves from the env; illegal rows are disabled.”*

**Action:** **Action Console** — tweak **Confidence** and **Rationale** slightly. Click **Submit Candidate**.  
**Say:** *“Submit Candidate hits `/env/step_candidate` with my chosen legal action, confidence, and rationale.”*

**Action:** After response, pause on **Reward Channels** and **Last Reward** in overview.  
**Say:** *“These bars are the verifier breakdown; total reward is the scalar GRPO-style signal we train on.”*

**Action:** **Action History** — show one new line. **Event Log** — show the new reward line.  
**Say:** *“History and event log give an audit trail—not a canned animation.”*

---

### Scene 3 — Run Agent (orchestrator + LLM path) (90–120 s)

**Prerequisite:** Prefer recording when Model Truth shows **enabled** and **active** with Qwen artifacts.

**Action:** **Reset Episode** again (same or different task). Click **Run Agent**. Wait for completion.  
**Say:** *“Run Agent calls `/agents/orchestrate`. That runs med reconciliation, evidence retrieval, graph safety, dosing hints, candidate generation, supervisor mode, a contextual bandit shortlist, then the planner—here that’s where the loaded Qwen policy can choose among candidates—the critic veto, environment step, and explainer.”*

**Action:** Scroll to **Decision**, **Explanation**, **Evidence** JSON panels.  
**Say:** *“These three panels are only populated in Agent Workbench mode. Env Explorer deliberately hides them because the raw WebSocket client doesn’t run the full orchestrator response bundle.”*

**Action:** Point at **Evidence** — mention structured retriever output vs empty object if task didn’t fetch.  
**Say:** *“Evidence is whatever the evidence agent produced for this state—grounding for clinician trust.”*

---

### Scene 4 — Env Explorer contrast (60–90 s)

**Action:** Click **Env Explorer**. **Reset Episode** (same task: Easy Screening).  
**Say:** *“Now the UI resets over WebSocket `reset` to the OpenEnv service on port 8100—same scenarios, different transport.”*

**Action:** Select a candidate, **Submit Env Step**.  
**Say:** *“Submit Env Step sends a WebSocket `step` with the action payload—no `/agents/orchestrate`.”*

**Action:** Scroll to **Decision / Explanation / Evidence** — show they stay **empty** or “No data.”  
**Say:** *“This is intentional: I’m proving the low-level env API, not the full agent stack.”*

**Action:** **Event Log** — note new lines tagged from env step.

---

### Scene 5 — Task variety (2–3 minutes, optional montage)

For each preset, do **Reset** + **one** legal **Submit** (Agent mode is enough):

| Task | Difficulty | Sub-environment | What to say |
| --- | --- | --- | --- |
| **Easy Screening** | easy | DDI | “Fast DDI-focused episode.” |
| **Budgeted Screening** | medium | REGIMEN_RISK | “More steps, regimen-risk tradeoffs.” |
| **Complex Tradeoff** | hard | REGIMEN_RISK | “Harder patient draw, tighter budgets.” |
| **Bandit Mining** | hard | BANDIT_MINING | “Bandit-style policy mining scenario.” |

**Action:** Switch Task to **Advanced**. Set e.g. **hard** + **PRECISION_DOSING**. Reset.  
**Say:** *“Advanced exposes every sub-environment enum the backend supports—precision dosing, deprescribing, web-search missing data, alternatives, new-drug decomposition.”*

---

### Scene 6 — Medications + warnings (45 s)

**Action:** After any step with regimen change, show **Current Medications** cards (high-risk styling).  
**Say:** *“Cards mirror `medication_table` from the observation; warnings list is explicit env output.”*

---

### Scene 7 — Closing (30 s)

**Say:** *“That’s the full loop: HF Space hosts OpenEnv + API, Qwen adapters live under checkpoints/active, Agent Workbench demonstrates orchestrated LLM decisions with evidence and explanations, and Env Explorer proves the same environment over raw WebSockets for OpenEnv compatibility.”*

---

## OBS / QuickTime checklist

- [ ] Capture **system audio** if you add voiceover in post; or record mic in OBS.  
- [ ] **1920×1080**, 30 fps (or 60 if you want smooth cursor).  
- [ ] **2 s pause** after each button click before scrolling away.  
- [ ] If Space sleeps, **mouse jiggle** or refresh before recording.  
- [ ] Export **MP4 H.264** for YouTube / HF dataset card.

---

## Quick troubleshooting on camera (if something breaks)

| Symptom | What to say / do |
| --- | --- |
| WebSocket errors in Event Log | “Env service reconnect—refresh page; WS URL is derived from the Space origin.” |
| Run Agent fails | “Check Model Truth—model may still be downloading or Ollama disabled on Space.” |
| Reward bars all dash | “No step yet—reset and submit once.” |
| Candidates empty | “Reset episode—env didn’t initialize.” |

---

## Related docs

- [UI overview](ui.md)  
- [Deployment](deployment.md)  
- [Environment design](environment_design.md)  
- [Reward design](reward_design.md)  
- [Architecture](architecture.md)