File size: 33,710 Bytes
e21fe7d
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a246b
e21fe7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a246b
e21fe7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
# PolyGuard Space UI — demo recording script (shot-by-shot)

Use this document while screen-recording the Hugging Face Space (or local Docker). Target length: **8–14 minutes** for a full pass, or **3–5 minutes** for a highlights reel.

---

## Before you hit record

1. **Open the Space** in a clean browser profile or incognito (fewer extensions → fewer glitches).
2. **Set resolution**: 1920×1080 or 1440×900; browser zoom **100%**.
3. **Fullscreen** the Space iframe or use HF “Open in new tab” so the URL bar shows the Space domain.
4. **Wait for cold start**: first load may download the model bundle (several minutes). The **Event Log** and **Model Truth** panel will tell you if the policy failed to load (heuristic fallback is still usable for env steps).
5. **Optional**: hide mouse cursor in OBS if you prefer; otherwise move slowly and pause **2 seconds** on each panel after major clicks.

**Primary Space (product):** `https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench`
Runtime: nginx fronts the **product API** (default `8200`) and **OpenEnv service** (`8100`); see `docker/space/entrypoint.sh`.

---

## Where the model lives (Qwen and artifacts)

This matters for what you say on camera.

| Location | What it is |
| --- | --- |
| **On the Space container** | Working directory `/app` (see `entrypoint.sh`: `cd /app`). |
| **Downloaded bundle** | If `checkpoints/active/grpo_adapter/adapter_config.json` is missing at boot, `scripts/install_hf_active_bundle.py` pulls the **HF usable model bundle** into `checkpoints/active/`. |
| **Typical layout after install** | `checkpoints/active/active_model_manifest.json` — which artifact is active (often **GRPO adapter** on top of base). |
| **Weights** | `checkpoints/active/grpo_adapter/` (LoRA/PEFT), optionally `checkpoints/active/merged/` (full merged weights), `checkpoints/active/sft_adapter/`. |
| **Base model name** | Usually **`Qwen/Qwen2.5-0.5B-Instruct`** as the Transformers base for adapters (set via env e.g. `POLYGUARD_HF_MODEL`). |

**What the UI proves:** the **Model Truth** panel calls **`GET /policy/model_status`** (product API). It shows `model_id` / `base_model`, `run_id`, `preferred_artifact` / `loaded_source`, and availability flags. Say on camera: *“This is live from the API, not hard-coded in the frontend.”*

---

## UI map (what appears on screen)

| Region | Purpose |
| --- | --- |
| **Hero** (“PolyGuard neural safety cockpit”) | Marketing copy + quick stats. |
| **Top bar** | **Agent Workbench** vs **Env Explorer**, **Task** dropdown, **Reset Episode**, **Q Tips**. |
| **Status chips** | “Live” / model line; in Env mode one chip reads **ws env** (WebSocket to OpenEnv). |
| **Model Truth** | Qwen / artifact / run / availability. |
| **Advanced strip** | Only if Task = **Advanced** — pick raw `difficulty` + `sub_environment`. |
| **Episode Overview** | Mode, task, difficulty, environment, step budget, last reward, patient id, **Patient Summary**, **Risk Delta**. |
| **Candidate Actions** | Legal moves: `candidate_id`, action type, target/replacement, estimated safety delta (or **Blocked**). |
| **Action Console** | Confidence, rationale, **Submit** vs **Run Agent** (Agent mode only for Run Agent). |
| **Reward Channels** | Bars for total + primary + component scores (see below). |
| **Current Medications** | Cards from observation. |
| **Action History / Warnings** | Step trace and env warnings. |
| **Decision / Explanation / Evidence** | **Agent mode only** (filled after API steps that return those fields). |
| **Event Log** | Human-readable trace of resets, steps, rewards, errors. |

---

## Feature encyclopedia — every panel, branch, and agent

Use this section as a **script appendix** or **judge handout**. It mirrors the React workbench in `app/ui/frontend/src/App.tsx`, the API in `app/api/`, and the orchestrator in `app/agents/orchestrator.py`.

### A. How the Space is wired (under the hood)

| Piece | Role |
| --- | --- |
| **Browser → nginx** | HF Space exposes one origin; nginx routes paths. |
| **Product API** | Vite uses `API_BASE` (default **`/api`**). FastAPI serves catalog, reset, step_candidate, orchestrate, model_status, reward_breakdown, etc. |
| **OpenEnv HTTP/WS** | `ENV_BASE` defaults to **same origin** on Spaces (not localhost). Web UI opens **`ws(s)://<origin>/ws`** for Env Explorer. |
| **Two Python processes** | `entrypoint.sh` starts **uvicorn** for `app.env.fastapi_app` (env, port **8100**) and **uvicorn** for `app.api` (product API, port **8200**). Agent mode reset/step still use the **API’s** in-process `PolyGuardEnv`; Env mode uses the **separate** env service over WebSocket. |
| **Important** | Agent and Env UIs maintain **separate React state** (`agentObservation` vs `envObservation`). Toggling mode **clears the Event Log** and clears the inactive branch’s episode state so you always know which backend path you are exercising. |

### B. Hero (“PolyGuard neural safety cockpit”)

| Stat | Source | What to say on camera |
| --- | --- | --- |
| **Runtime** | `mode === "agent"` → “Agent Workbench”; else “Env Explorer”. | “This is which transport I am using right now.” |
| **Scenario** | Human label for current `taskId` from catalog presets or Advanced. | “Which curriculum preset is bound to difficulty + sub-environment.” |
| **Candidates** | `candidate_action_set.length` from the **active** observation. | “How many legal moves the env is offering after the last reset/step.” |
| **Reward** | Last scalar reward for the active branch (`null` → shown as `-`). | “Verifier scalar after the last step in this mode only.” |

### C. Top bar — every control

| Control | Behavior |
| --- | --- |
| **Agent Workbench** | Sets `mode` to `agent`. Clears env state, event log, error; clears agent panels if switching from env (see `handleModeChange`). |
| **Env Explorer** | Sets `mode` to `env`. Clears agent-specific observation/reward/decision/evidence. |
| **Task** `<select>` | Options: each **task preset** from `GET /env/catalog` (`task_presets`), plus **Advanced**. Changing a preset updates internal `difficulty` + `sub_environment` to match the preset. |
| **Reset Episode** | **Agent:** `POST /env/reset` with body from preset (`{ task_id }`) or `{ difficulty, sub_environment }`. Refreshes **Model Truth** first. Clears reward breakdown, decision, explanation, evidence, sets default candidate. **Env:** WebSocket `reset` with `{ difficulty, sub_environment }` only (no `task_id` in WS path—preset is flattened to those two fields). **Always** clears `events` at start of reset handler, then appends one “Reset … in agent/env” line. |
| **Q Tips** | Opens modal walkthrough; highlights DOM nodes with `[data-guide="…"]`. **Skip** stores `polyguard.qtips.v2.seen` in localStorage so first visit auto-opens tips. |
| **Status chips** | First chip: **Live** if observation loaded and not done, else **Complete** / **Ready**. Second chip: in Agent mode, derived from **`modelSignal()`** (Qwen verified or not); in Env mode shows **`ws env`**. |

### D. Model Truth panel — field by field

Data from **`GET /policy/model_status`** (`PolicyProviderRouter` / `active_model_status`).

| Field in UI | Typical meaning |
| --- | --- |
| **Heading label** | “Qwen 0.5B active” only when Space config matches a strict check (enabled + active + availability + model id regex for **Qwen2.5-0.5B-Instruct**); else “Qwen not verified” or Ollama-specific text if Ollama wins locally. |
| **Detail paragraph** | Human sentence: model name, artifact, `run_id`, optional **load_error**. |
| **Model** | `model_id` or `base_model` — HF id of the loaded or configured base. |
| **Run** | `run_id` from manifest / sweep activation (which training bundle). |
| **Artifact** | `loaded_source` or `preferred_artifact` — e.g. **`grpo_adapter`**, **`merged`**, **`sft_adapter`**. |
| **Availability** | Key/value pairs from `availability` dict (which load stages succeeded). |

**Ollama branch (local dev):** If `status.ollama.enabled && available`, the UI labels **Ollama Qwen active** and mentions `POLYGUARD_PROVIDER_PREFERENCE` order. Spaces Dockerfile sets **`POLYGUARD_ENABLE_OLLAMA=false`** by default.

### E. Advanced strip (Task = Advanced)

Only rendered when `taskId === "advanced"`. Two selects:

1. **Difficulty:** `easy` \| `medium` \| `hard` — passed to reset as `difficulty`.
2. **Environment:** every string in `catalog.sub_environments` (DDI, BANDIT_MINING, REGIMEN_RISK, PRECISION_DOSING, LONGITUDINAL_DEPRESCRIBING, WEB_SEARCH_MISSING_DATA, ALTERNATIVE_SUGGESTION, NEW_DRUG_DECOMPOSITION).

**What each sub-environment stresses (one line each):**

| Sub-environment | What the episode emphasizes |
| --- | --- |
| **DDI** | Drug–drug interaction exposure and pair risk. |
| **BANDIT_MINING** | Policy / bandit exploration style scenario (see preset “Bandit Mining”). |
| **REGIMEN_RISK** | Overall regimen burden and safety tradeoffs. |
| **PRECISION_DOSING** | Dose buckets, organ-sensitive flags in observation. |
| **LONGITUDINAL_DEPRESCRIBING** | Multi-step taper / stop sequences over time. |
| **WEB_SEARCH_MISSING_DATA** | Rewards process fidelity for evidence-fetch actions. |
| **ALTERNATIVE_SUGGESTION** | Substitution / alternative action types rewarded more. |
| **NEW_DRUG_DECOMPOSITION** | Hard track: decompose novel drug string into components. |

### F. Episode Overview — every KPI and subsection

**KPI grid (always eight rows):**

| KPI | Source |
| --- | --- |
| **Mode** | Literal “Agent Workbench” or “Env Explorer”. |
| **Task** | Preset label or “Advanced”. |
| **Difficulty** | `observation.deterministic_contract.difficulty` or `-`. |
| **Environment** | `deterministic_contract.sub_environment` or `observation.sub_environment`. |
| **Step Budget** | `observation.step_budget_remaining`. |
| **Last Reward** | Active branch’s last reward (after reset, Agent clears to `-` until first step). |
| **Patient** | `patient_summary.patient_id` or `patient_summary.id`. |
| **Status** | Complete if `done`, else Live if observation exists, else Ready. |

**Patient Summary `<dl>`:** First **8** keys of `observation.patient_summary` (keys humanized: underscores → spaces, title case). Typical keys include demographics, allergies, high-level clinical flags—whatever the backend puts on `PolyGuardObservation`.

**Risk Delta `<dl>`:** First **8** entries of `observation.burden_score_summary` — burden-related scalars the env uses for reward deltas.

### G. Candidate Actions list — each column

Each row is one **`CandidateAction`** from `candidate_action_set`.

| Column / concept | Meaning |
| --- | --- |
| **`candidate_id`** | Stable id (e.g. `cand_…`) — must match when submitting. |
| **Action label** | Humanized `action_type` (STOP_DRUG, SUBSTITUTE_WITHIN_CLASS, …). |
| **Third column** | `target_drug` **or** `replacement_drug` **or** `mode` — whichever is most informative. |
| **Right column** | `estimated_safety_delta` formatted to 3 decimals, or **Blocked** if `legality_precheck === false`. |
| **Disabled rows** | You cannot select illegal candidates; click does nothing. |
| **Default selection** | **Agent:** first candidate in list. **Env:** first **legal** candidate that is not `KEEP_REGIMEN` and not `REQUEST_*`, else first legal non–KEEP_REGIMEN, else first in list (`defaultCandidateForMode`). |

**Hidden fields you can mention if showing JSON elsewhere:** `dose_bucket`, `taper_days`, `monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`, `uncertainty_score`, `rationale_tags`, `required_monitoring`, `burden_delta`, `disease_stability_estimate`.

### H. Action Console — every input and button

| UI element | Effect |
| --- | --- |
| **Type / Mode / Target / Replacement / Dose / Uncertainty** | Read-only snapshot of the **currently selected** candidate. |
| **Confidence** | Number input **0.001–0.999** step 0.001; sent as `confidence` on **Submit Candidate** (Agent) or embedded in WS payload (Env). |
| **Rationale** | Free text → `rationale_brief` / rationale on the action. |
| **Submit Candidate** (Agent) | Calls `POST /env/step_candidate` with `{ candidate_id, confidence, rationale_brief }`. API finds matching legal action and calls `env.step`. |
| **Submit Env Step** (Env) | Same confidence/rationale + full action payload built by `buildActionPayload` → WS `step`. |
| **Run Agent** | **Only when** `mode === "agent"` **and** observation exists **and** not `done`. Calls `POST /agents/orchestrate` with empty JSON body. **Disabled** in Env mode. |
| **Done notice** | If `done`, shows which mode completed and `termination_reason` from `info` if present. Primary button becomes **Reset Episode** (shortcut). |

### I. Reward Channels — every bar (exact keys)

The UI renders **exactly these keys** in order (`REWARD_KEYS` in `App.tsx`**14** rows):

| # | Key | Role |
| --- | --- | --- |
| 1 | `total_reward` | Weighted aggregate of component scores (`aggregate_rewards` in `reward_scaling.py`). |
| 2 | `primary_safety_legality` | Roll-up: legality, candidate alignment, anti-cheat, uncertainty calibration (`reward_router.compute_primary_reward_channels`). |
| 3 | `primary_clinical_improvement` | Roll-up: safety delta, burden improvement, disease stability. |
| 4 | `primary_dosing_quality` | Roll-up: dosing quality + abstention quality. |
| 5 | `primary_process_integrity` | Roll-up: format compliance, efficiency, process fidelity, explanation grounding. |
| 6 | `legality_score` | Action legal per safety verifier. |
| 7 | `safety_delta_score` | Movement on severe DDI / risk proxy vs pre-step state. |
| 8 | `burden_improvement_score` | Medication burden before vs after. |
| 9 | `disease_stability_score` | Stability heuristic vs disruptive action types. |
| 10 | `dosing_quality_score` | Dose-mode and bucket appropriateness. |
| 11 | `process_fidelity_score` | Follows intended workflow for sub-environment (e.g. fetch evidence when required). |
| 12 | `explanation_grounding_score` | Rationale present / grounded. |
| 13 | `anti_cheat_score` | Collapses when anti-cheat triggers. |
| 14 | `uncertainty_calibration_score` | Confidence vs uncertainty alignment. |

**Note:** `total_reward` is row 1; rows 2–5 are **primary** channels; rows 6–14 are **exposed component** scores. Other components (`format_compliance_score`, `efficiency_score`, `candidate_alignment_score`, `abstention_quality_score`) still exist **in the backend** `RewardBreakdown` and feed primaries + total, but this UI **does not** give them their own bar rows.

Bars show **`-`** when the value is missing (no step yet or breakdown not returned). Bar width = value × 100% with value clamped to `[0.001, 0.999]`.

**Agent vs breakdown source:** After a step, UI prefers `info.reward_breakdown`; it may also call **`GET /env/reward_breakdown`**. **Env:** uses `info.reward_breakdown` from the WebSocket step packet; if empty, the UI clears the reward panel.

### J. Current Medications cards

Built from `observation.medication_table[]`. Each card:

- **Title:** `drug` / `drug_id` / `name`.  
- **High-risk ribbon:** if `high_risk` or `is_high_risk_elderly` or Beers / warning flags.  
- **Body:** `indication` or `class_name` or `atc_class`.  
- **Meta row:** dose bucket or mg dose; taper vs `monitoring` or `route`.

### K. Action History vs Warnings

| Panel | Source |
| --- | --- |
| **Action History** | `observation.action_history` — each item shows step index and `action_type` / `candidate_id` / reward snippet. |
| **Warnings** | `observation.warning_summary` — list of human-readable env warnings (DDIs, constraints, etc.). |

### L. Decision / Explanation / Evidence (Agent only)

Rendered as JSON `<pre>` blocks:

| Title | When populated | Content origin |
| --- | --- | --- |
| **Decision** | Agent mode only. | **`final_action`** on the packet. For **`step_candidate`**, the API returns the standard **step** payload — **typically no `final_action` field**, so this panel may stay **empty after manual submit**. For **`orchestrate`**, **`final_action`** is the **`PolyGuardAction`** after critic (what actually hit `env.step`). |
| **Explanation** | Agent mode only. | **`explanation`** — output of **`ExplainerAgent`** after the step (`orchestrate` returns it). Usually **empty** after raw `step_candidate` unless API adds it. |
| **Evidence** | Agent mode only. | **`evidence`** key on packet. **`orchestrate`** returns **`evidence_out`** from **`EvidenceAgent.run(state)`** (retrieval / web-fallback bundle). **`step_candidate`** does not attach orchestrator evidence — panel often **empty** on manual clicks. |

**Demo takeaway:** Tell viewers: *“To populate Decision / Explanation / Evidence in the UI, use **Run Agent** (orchestrate). Manual **Submit Candidate** updates the env and rewards but does not replay the full multi-agent JSON into those three panels.”*

### M. Event Log vs Q Tips

| Feature | Behavior |
| --- | --- |
| **Event Log** | Prepends timestamped strings: resets, each step’s reward line, errors. **Capped** at 24 lines. Cleared when you click **Reset Episode** (handler starts with `setEvents([])` then appends) — *not* the same as mode switch clearing. |
| **Q Tips** | 10-step overlay; does not mutate env. |

### N. Orchestrator — every agent in order (`Run Agent`)

When **`POST /agents/orchestrate`** runs, `Orchestrator.run_step` executes:

| Step | Agent class | What it does (operator language) |
| --- | --- | --- |
| 1 | **`MedRecAgent`** | Summarizes current medication list / reconciliation view for downstream modules. Output key: `medrec`. |
| 2 | **`EvidenceAgent`** | Retrieves **local evidence** (and optional web fallback) for missing or thin context. Shown in UI **`evidence`** when orchestrating. |
| 3 | **`GraphSafetyAgent`** | Graph-style **DDI / duplicate therapy** style signals. Output: `graph`. |
| 4 | **`DosingAgent`** | Flags **dose-sensitive** windows and dosing opportunities. Feeds **`dosing_active`** into supervisor. |
| 5 | **`CandidateAgent`** | Wraps env **candidate builder** — produces the legal `CandidateAction` list. |
| 6 | **`SupervisorAgent`** | Chooses planner **mode**: regimen vs dose vs **REVIEW** (conservative routing). |
| 7 | **Contextual bandit** | **`ContextualBanditPolicy`** (LinUCB or Thompson sampling via `POLYGUARD_BANDIT_ALGO`) proposes **top-k** (`POLYGUARD_BANDIT_TOP_K`) candidates for the planner to consider. |
| 8 | **`PlannerAgent`** | Calls **`PolicyProviderRouter.select_candidate`** — this is where **Transformers + Qwen + PEFT** (or Ollama, or **safety ranker fallback**) picks a **`candidate_id`** and rationale. |
| 9 | **`CriticAgent`** | Safety veto / repair. May replace proposed action with a safer **`final_action`**. |
| 10 | **Replan / debate** (optional) | If `coordination_mode` is `replan_on_veto` or `lightweight_debate` and critic rejects, planner may rerun on **review** candidates; `debate_rounds` increments. |
| 11 | **`PolyGuardEnv.step`** | Commits **`final_action`**, returns `observation`, `reward`, `done`, `info`. |
| 12 | **Bandit `update`** | If the chosen candidate was in the bandit pool, **updates bandit statistics with the reward** (learning signal for next orchestrate). |
| 13 | **`ExplainerAgent`** | Builds **`explanation`** object for audit / UI. |

**Environment variables (mention for power users):**

| Variable | Effect |
| --- | --- |
| **`POLYGUARD_POLICY_STACK`** | `llm+bandit` (default): planner sees **bandit-shortlisted** candidates. `llm-only`: all supervisor-filtered candidates. `bandit-only`: **no LLM** — first bandit pick with fixed rationale. |
| **`POLYGUARD_BANDIT_*`** | Algorithm, alpha, epsilon, seed, top-k. |
| **`POLYGUARD_PROVIDER_PREFERENCE`** | e.g. `transformers` vs `ollama` order. |
| **`POLYGUARD_ENABLE_ACTIVE_MODEL`** | Must be true on Space for bundle path; **`POLYGUARD_HF_MODEL`** sets base id for adapters. |

### O. Qwen and fallbacks (planner path)

`PolicyProviderRouter` (`app/models/policy/provider_runtime.py`):

1. Builds a **JSON instruction** listing candidates and asks for `candidate_id=…; rationale=…`.  
2. Tries providers in **`POLYGUARD_PROVIDER_PREFERENCE`** (default **Transformers** on Space).  
3. Parses model text for a legal `candidate_id`; on failure uses **`safety_ranker`** deterministic ordering.

**So:** Even without Qwen load, **Run Agent** still completes using **ranker / bandit** — mention that if Model Truth is red.

### P. Full observation contract (API / types)

The TypeScript type `EnvObservation` (`lib/types.ts`) lists fields the backend **may** send. The main workbench **highlights** patient summary, medication table, candidates, burden summary, action history, warnings, step budget, and sub-environment. **Not all fields get their own panel** — if you open browser DevTools → Network → `reset` / `step` response, you can narrate extras:

| Field | Typical use |
| --- | --- |
| `comorbidity_summary` | Comorbidity list for the patient. |
| `organ_function_summary` | eGFR / hepatic flags for dosing scenarios. |
| `labs_vitals_summary` | Labs relevant to risk scoring. |
| `graph_safety_summary` | Aggregated graph / DDI context. |
| `precision_dosing_flags` | Tags when sub-environment is dosing-heavy. |
| `unresolved_conflicts` | Specialist conflict strings. |
| `abstention_indicators` | When the env suggests review / abstain. |
| `deterministic_contract` | Difficulty + sub-environment + scenario id contract for reproducibility. |

### Q. Q Tips — copy for each slide (matches `GUIDE_STEPS`)

| # | Title | Body (read aloud or paraphrase) |
| --- | --- | --- |
| 1 | Start here | PolyGuard is an interactive OpenEnv workbench; top bar picks runtime, scenario, reset. |
| 2 | Choose the runtime | Agent Workbench = REST API + reward breakdown + Qwen path; Env Explorer = WebSocket to OpenEnv. |
| 3 | Pick a scenario | Presets load real patient/regimen state from backend. |
| 4 | Check the model truth | `/policy/model_status`; Qwen only “verified” when API says adapters live. |
| 5 | Read the episode state | Task, patient, step budget, reward, risk delta from latest env response. |
| 6 | Review legal actions | Candidate rows = legal moves; inspect safety delta and mode. |
| 7 | Submit or ask the agent | Submit Candidate vs Run Agent; check model panel before claiming LLM. |
| 8 | Inspect reward channels | Real scorer output per channel; empty = no step yet. |
| 9 | Track regimen changes | Medication cards + history + warnings = not canned. |
| 10 | Follow the run | Event log shows resets, steps, rewards, errors plainly. |

---

## Agent Workbench vs Env Explorer (say this exactly on camera)

| | **Agent Workbench** | **Env Explorer** |
| --- | --- | --- |
| **Reset** | `POST /env/reset` with task preset (e.g. `{ "task_id": "easy_screening" }`) via product API. | WebSocket `reset` message to OpenEnv **`/ws`** with `{ difficulty, sub_environment }`. |
| **Submit** | `POST /env/step_candidate` — product API resolves `candidate_id` + your confidence + rationale into a full action and steps the **same** in-process `PolyGuardEnv`. | WebSocket `step` — payload built from selected candidate; talks **directly** to OpenEnv service. |
| **Run Agent** | **`POST /agents/orchestrate`** — runs the full **orchestrator** (med rec, evidence, graph, dosing, candidates, supervisor, bandit, **planner/LLM**, critic, env step, explainer). | Button **disabled** — there is no orchestrator path over raw WS-only mode in this UI. |
| **Decision / Explanation / Evidence panels** | **Populated** after orchestrate or after steps that echo `final_action` / `explanation` / `evidence` (orchestrate returns rich `evidence` from `EvidenceAgent` pipeline). | **Always empty** in the UI by design — those panels are `null` in Env mode (`App.tsx` only passes agent-mode state to DetailPanels). |
| **Reward breakdown** | From step `info.reward_breakdown` or fallback `GET /env/reward_breakdown`. | From WS step packet `info.reward_breakdown` when present. |
| **Switching mode** | Clears the **Event Log** and resets the other mode’s transient state — mention that so viewers don’t think it’s a bug. | Same. |

**One-liner for judges:** *“Agent Workbench is the full product API plus optional LLM-orchestrated policy; Env Explorer is the raw OpenEnv WebSocket contract for the same underlying environment.”*

---

## Reward channels — what they mean and how they’re computed (talk track)

Rewards are **verifier-backed**, **bounded** to roughly **`[0.001, 0.999]`** (3 decimal places in UI).

### Four primary channels (high level)

These are **averages of component groups** (`app/env/reward_router.py``compute_primary_reward_channels`):

1. **`primary_safety_legality`** — legality, candidate id alignment, anti-cheat, uncertainty calibration.  
2. **`primary_clinical_improvement`** — safety delta vs severe pairs, burden improvement, disease stability.  
3. **`primary_dosing_quality`** — dosing quality + abstention (e.g. appropriate review requests under uncertainty).  
4. **`primary_process_integrity`** — format compliance, efficiency (step budget), process fidelity, explanation grounding.

### Components (examples — `compute_reward_breakdown`)

The environment builds scores such as:

- **`legality_score`**: high if the action is legal per safety report.  
- **`safety_delta_score` / `burden_improvement_score`**: from **before/after** burden and severe DDI pair counts (`_delta_to_reward`).  
- **`anti_cheat_score`**: collapses if anti-cheat flags the trajectory.  
- **`uncertainty_calibration_score`**: penalizes overconfidence vs modeled uncertainty.  
- **Sub-environment tweaks**: e.g. `WEB_SEARCH_MISSING_DATA` boosts process fidelity when using `FETCH_EXTERNAL_EVIDENCE`; `NEW_DRUG_DECOMPOSITION` rewards decomposition actions with components.

Then components are **scaled/clamped**, **primary channels** recomputed, and **`total_reward`** = weighted aggregate (`aggregate_rewards`).

**Demo line:** *“Bars update only after a real step — empty fields mean we haven’t stepped yet, not fake filler.”*

---

## Built-in **Q Tips** (on-screen tour)

Click **Q Tips** in the top bar. The app cycles **10 slides** (`App.tsx``GUIDE_STEPS`):

1. Start here — top bar, scenarios, reset.  
2. Choose the runtime — Agent vs Env.  
3. Pick a scenario — presets load real patient/regimen state.  
4. Check the model truth — `/policy/model_status`.  
5. Read episode state — overview + patient summary.  
6. Review legal actions — candidates.  
7. Submit or ask the agent — Submit vs Run Agent.  
8. Inspect reward channels.  
9. Medications + history/warnings.  
10. Event log — errors and connectivity.

**Recording tip:** Record **Q Tips** once in full voiceover (“I’ll use the in-app tour…”) then dismiss and do the live walkthrough below.

---

## Shot-by-shot recording script

### Scene 0 — Intro (30–45 s)

**Action:** Scroll slightly so hero + top bar are visible.  
**Say:** *“This is PolyGuard on Hugging Face Spaces: an OpenEnv workbench for polypharmacy safety. The backend runs a real `PolyGuardEnv` with verifiable rewards; the UI can drive it through the product API or raw OpenEnv WebSockets.”*

---

### Scene 1 — Model Truth (45–60 s)

**Action:** Stay on **Agent Workbench**. Click nothing yet; point at **Model Truth**.  
**Say:** *“Model Truth is live from `/policy/model_status`. Here we see the base model—typically Qwen 2.5 0.5B Instruct—which artifact is loaded—often the GRPO adapter—and the run id. On Spaces, weights are under `/app/checkpoints/active` after the bundle installer runs.”*

**If panel shows unavailable:** *“Cold start or CPU load can delay the bundle; the environment still works for manual candidate submission; Run Agent may fall back to non-LLM routing depending on config.”*

---

### Scene 2 — Easy task, manual submit (Agent) (90–120 s)

**Action:** Task → **Easy Screening** (DDI, easy). **Reset Episode.**  
**Say:** *“Easy Screening fixes difficulty easy and sub-environment DDI—drug–drug interaction screening.”*

**Action:** Pan **Episode Overview** — read **Patient Summary** and **Risk Delta** aloud briefly.  
**Say:** *“This patient block and risk delta come straight from the observation object.”*

**Action:** **Candidate Actions** — click 2–3 rows; show **Blocked** vs legal. Select a **legal** row.  
**Say:** *“Candidates are legal moves from the env; illegal rows are disabled.”*

**Action:** **Action Console** — tweak **Confidence** and **Rationale** slightly. Click **Submit Candidate**.  
**Say:** *“Submit Candidate hits `/env/step_candidate` with my chosen legal action, confidence, and rationale.”*

**Action:** After response, pause on **Reward Channels** and **Last Reward** in overview.  
**Say:** *“These bars are the verifier breakdown; total reward is the scalar GRPO-style signal we train on.”*

**Action:** **Action History** — show one new line. **Event Log** — show the new reward line.  
**Say:** *“History and event log give an audit trail—not a canned animation.”*

---

### Scene 3 — Run Agent (orchestrator + LLM path) (90–120 s)

**Prerequisite:** Prefer recording when Model Truth shows **enabled** and **active** with Qwen artifacts.

**Action:** **Reset Episode** again (same or different task). Click **Run Agent**. Wait for completion.  
**Say:** *“Run Agent calls `/agents/orchestrate`. That runs med reconciliation, evidence retrieval, graph safety, dosing hints, candidate generation, supervisor mode, a contextual bandit shortlist, then the planner—here that’s where the loaded Qwen policy can choose among candidates—the critic veto, environment step, and explainer.”*

**Action:** Scroll to **Decision**, **Explanation**, **Evidence** JSON panels.  
**Say:** *“These three panels are only populated in Agent Workbench mode. Env Explorer deliberately hides them because the raw WebSocket client doesn’t run the full orchestrator response bundle.”*

**Action:** Point at **Evidence** — mention structured retriever output vs empty object if task didn’t fetch.  
**Say:** *“Evidence is whatever the evidence agent produced for this state—grounding for clinician trust.”*

---

### Scene 4 — Env Explorer contrast (60–90 s)

**Action:** Click **Env Explorer**. **Reset Episode** (same task: Easy Screening).  
**Say:** *“Now the UI resets over WebSocket `reset` to the OpenEnv service on port 8100—same scenarios, different transport.”*

**Action:** Select a candidate, **Submit Env Step**.  
**Say:** *“Submit Env Step sends a WebSocket `step` with the action payload—no `/agents/orchestrate`.”*

**Action:** Scroll to **Decision / Explanation / Evidence** — show they stay **empty** or “No data.”  
**Say:** *“This is intentional: I’m proving the low-level env API, not the full agent stack.”*

**Action:** **Event Log** — note new lines tagged from env step.

---

### Scene 5 — Task variety (2–3 minutes, optional montage)

For each preset, do **Reset** + **one** legal **Submit** (Agent mode is enough):

| Task | Difficulty | Sub-environment | What to say |
| --- | --- | --- | --- |
| **Easy Screening** | easy | DDI | “Fast DDI-focused episode.” |
| **Budgeted Screening** | medium | REGIMEN_RISK | “More steps, regimen-risk tradeoffs.” |
| **Complex Tradeoff** | hard | REGIMEN_RISK | “Harder patient draw, tighter budgets.” |
| **Bandit Mining** | hard | BANDIT_MINING | “Bandit-style policy mining scenario.” |

**Action:** Switch Task to **Advanced**. Set e.g. **hard** + **PRECISION_DOSING**. Reset.  
**Say:** *“Advanced exposes every sub-environment enum the backend supports—precision dosing, deprescribing, web-search missing data, alternatives, new-drug decomposition.”*

---

### Scene 6 — Medications + warnings (45 s)

**Action:** After any step with regimen change, show **Current Medications** cards (high-risk styling).  
**Say:** *“Cards mirror `medication_table` from the observation; warnings list is explicit env output.”*

---

### Scene 7 — Closing (30 s)

**Say:** *“That’s the full loop: HF Space hosts OpenEnv + API, Qwen adapters live under checkpoints/active, Agent Workbench demonstrates orchestrated LLM decisions with evidence and explanations, and Env Explorer proves the same environment over raw WebSockets for OpenEnv compatibility.”*

---

## OBS / QuickTime checklist

- [ ] Capture **system audio** if you add voiceover in post; or record mic in OBS.  
- [ ] **1920×1080**, 30 fps (or 60 if you want smooth cursor).  
- [ ] **2 s pause** after each button click before scrolling away.  
- [ ] If Space sleeps, **mouse jiggle** or refresh before recording.  
- [ ] Export **MP4 H.264** for YouTube / HF dataset card.

---

## Quick troubleshooting on camera (if something breaks)

| Symptom | What to say / do |
| --- | --- |
| WebSocket errors in Event Log | “Env service reconnect—refresh page; WS URL is derived from the Space origin.” |
| Run Agent fails | “Check Model Truth—model may still be downloading or Ollama disabled on Space.” |
| Reward bars all dash | “No step yet—reset and submit once.” |
| Candidates empty | “Reset episode—env didn’t initialize.” |

---

## Related docs

- [UI overview](ui.md)  
- [Deployment](deployment.md)  
- [Environment design](environment_design.md)  
- [Reward design](reward_design.md)  
- [Architecture](architecture.md)