Spaces:

TheJackBright
/

polyguard-openenv-workbench

Sleeping

App Files Files Community

TheJackBright commited on 11 days ago

Commit

0ea9615

1 Parent(s): 3e0e65c

Write public PolyGuard project blog

Browse files

Files changed (1) hide show

blog.md +1144 -0

blog.md CHANGED Viewed

	@@ -0,0 +1,1144 @@

+# PolyGuard OpenEnv: Training Medication-Safety Agents Inside a Verifier-Backed World
+Someone does not experience an unsafe medication regimen as "polypharmacy."
+They experience it as dizziness after a new sleep medication, bleeding after a
+painkiller is added to a blood thinner, confusion from a sedative-opioid
+combination, or a preventable emergency visit because five prescribers each saw
+one slice of the medication list.
+The dangerous part is often not a single drug. It is the combination: the wrong
+pair, the wrong dose in the wrong organ-function context, the missing lab, the
+duplicated class, the abrupt stop that should have been a taper, or the model
+that confidently says "looks fine" because it was never forced to act inside a
+safety-checked environment.
+That is the problem PolyGuard was built for.
+The [CDC medication-safety data page](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html)
+reports that adverse drug events send more than 1.5 million people to emergency
+departments in the United States every year, with almost 500,000
+hospitalizations. Adults 65 and older account for more than 600,000 of those
+visits. A CDC-authored [JAMA surveillance study](https://jamanetwork.com/journals/jama/fullarticle/2585977)
+found that older adults made up 34.5 percent of outpatient adverse-drug-event
+ED visits and had the highest hospitalization rate, 43.6 percent. Globally, the
+[WHO Medication Without Harm challenge](https://www.who.int/initiatives/medication-without-harm)
+estimates the cost associated with medication errors at USD 42 billion
+annually. AHRQ's deprescribing safety review summarizes estimates that
+[45 percent of older adults are exposed to polypharmacy and 58 percent to
+potentially inappropriate medications](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
+Not every adverse drug event is caused by an incorrect drug combination. But
+these numbers describe the harm surface PolyGuard targets: medication decisions
+where combination risk, monitoring gaps, frailty, organ function, uncertainty,
+and action sequencing all matter at once.
+PolyGuard turns that problem into an OpenEnv-compatible reinforcement-learning
+environment for polypharmacy safety, medication optimization, deprescribing,
+safe substitution, missing-evidence recovery, and precision dosing. A language
+model policy observes a constrained patient/regimen state, chooses one legal
+candidate action, receives verifier-backed reward, and improves through SFT
+plus GRPO-style post-training.
+It is not medical software and it is not clinical advice. It is a controlled
+research environment for studying how language-model policies can be trained,
+audited, and stress-tested on safety-critical medication action selection.
+## What To Open First
+- GitHub repository:
+  [Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
+- Live product Space:
+  [TheJackBright/polyguard-openenv-workbench](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
+- One-run Colab/HF notebook:
+  [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb)
+- Final evidence index:
+  [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
+- Artifact and traceability guide:
+  [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md)
+- Final artifact/evidence Space:
+  [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts)
+The final artifact/evidence Space hosts the Qwen 3B artifact bundle. The Qwen
+0.5B and 1.5B runs were trained using a second Hugging Face account, so their
+model artifacts could not be hosted in the same final Space. Their report
+mirrors are checked into this repo:
+[0.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct)
+and
+[1.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct).
+## The Research Bet
+Medication safety is combinatorial, partially observable, and high stakes. A
+useful policy has to do more than generate a plausible answer. It has to notice
+drug-drug interaction risk, reason about comorbidities and organ function,
+respect taper and monitoring requirements, choose safe substitutions, abstain
+or ask for review when uncertainty is high, and expose why it acted.
+The machine-learning pressure is just as real. If a medication vocabulary has
+500 drugs, the number of possible five-drug combinations is:
+```text
+C(500, 5) = 255,244,687,600
+```
+Exhaustive search is not a serious option. The paper that inspired this
+project, [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190),
+frames dangerous polypharmacy discovery as a bandit search problem over a huge
+combination space. It benchmarks neural bandit search over simulated
+polypharmacy datasets with 500 drugs and 100,000 distinct combinations, and
+reports detection of up to 72 percent of potentially inappropriate
+polypharmacies with 99 percent average precision after 30,000 time steps.
+PolyGuard borrows that search instinct, then moves the problem from offline
+combination mining into an agentic environment. The policy sees a patient
+state, chooses among legal clinical action candidates, and is judged by a
+deterministic verifier and reward router rather than by free-form text
+preference alone.
+The research question is narrow and concrete:
+Can environment-backed feedback make a small open model better at safe
+medication action selection than prompt-only, first-legal, rule-only, or
+single-agent baselines?
+The answer in this repository is an inspectable system:
+1. A finite-horizon OpenEnv simulation for medication decisions.
+2. A constrained action space, so the model chooses candidate actions instead
+   of inventing arbitrary clinical instructions.
+3. A legality verifier that prevents unsafe state mutation.
+4. Thirteen reward components rolled into four primary reward channels.
+5. A multi-agent policy stack with supervisor routing, contextual bandit
+   reranking, planner selection, critic veto, and explanation logging.
+6. SFT for format and clinical-prior warm start.
+7. GRPO with environment-backed reward, not an opaque LLM judge.
+8. Agentic evaluation with baseline comparison, policy ablations, post-save
+   inference, robustness checks, action traces, and failure mining.
+![PolyGuard system architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png)
+## A Failure Trace That Motivated the Design
+In the final matched-seed traces, the failure mode is not abstract. On seeds
+`8000` and `8004`, the basic prompt-style proxy repeatedly chose `cand_01`,
+the first legal candidate. In those cases, `cand_01` meant `KEEP_REGIMEN` while
+a hidden `warfarin_like` + `nsaid_like` interaction remained unresolved. The
+verifier recorded `holdout_ddi_not_addressed`.
+The full PolyGuard pipeline selected `cand_03`, a safer intervention candidate,
+and avoided those failure reasons.
+That is the core argument of the project: medication AI should be judged inside
+a stateful safety environment, not only by whether its answer sounds clinically
+plausible.
+Internal evidence:
+[basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
+and
+[action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl).
+## Safety Contract
+PolyGuard does not let a model directly mutate a medication list from free
+text. Every decision is candidate-based, verifier-checked, reward-decomposed,
+and traced. Illegal actions can be scored, penalized, and logged, but they do
+not change patient state.
+The repo evidence for this contract is spread across the environment, rules,
+and final reports:
+| Claim | Repo evidence |
+| --- | --- |
+| Hard contraindication examples are represented | [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) |
+| Safer alternatives are explicit | [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) |
+| Unsafe substitutions and dose escalations are blocked before state mutation | [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
+| Reward hacking and loop-like behavior are surfaced | [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [docs/reward_design.md](polyguard-rl/docs/reward_design.md) |
+| Baseline failure is traceable by seed and candidate | [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json), [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) |
+| Final claims are separated from older smoke artifacts | [final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
+## Project Map
+The implementation lives under [polyguard-rl/](polyguard-rl/).
+| Area | Key paths |
+| --- | --- |
+| OpenEnv runtime | [openenv.yaml](polyguard-rl/openenv.yaml), [app/env/env_core.py](polyguard-rl/app/env/env_core.py), [app/env/fastapi_app.py](polyguard-rl/app/env/fastapi_app.py), [server/app.py](polyguard-rl/server/app.py) |
+| Action/state contracts | [app/common/types.py](polyguard-rl/app/common/types.py), [app/common/enums.py](polyguard-rl/app/common/enums.py) |
+| Candidate generation and verifier | [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py), [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
+| Reward and anti-cheat | [app/env/reward_router.py](polyguard-rl/app/env/reward_router.py), [app/env/reward_scaling.py](polyguard-rl/app/env/reward_scaling.py), [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [configs/rewards.yaml](polyguard-rl/configs/rewards.yaml) |
+| Multi-agent policy | [app/agents/](polyguard-rl/app/agents/), [docs/agents.md](polyguard-rl/docs/agents.md) |
+| Bandits and baselines | [app/models/baselines/contextual_bandit.py](polyguard-rl/app/models/baselines/contextual_bandit.py), [app/models/baselines/contextual_bandit_policy.py](polyguard-rl/app/models/baselines/contextual_bandit_policy.py), [app/models/baselines/](polyguard-rl/app/models/baselines/) |
+| Training | [app/training/](polyguard-rl/app/training/), [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), [docs/training.md](polyguard-rl/docs/training.md) |
+| Data | [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json), [data/scenarios/](polyguard-rl/data/scenarios/), [docs/datasets.md](polyguard-rl/docs/datasets.md) |
+| Evaluation | [app/evaluation/](polyguard-rl/app/evaluation/), [scripts/evaluate_all.py](polyguard-rl/scripts/evaluate_all.py), [docs/evaluation.md](polyguard-rl/docs/evaluation.md) |
+| Product API/UI | [app/api/](polyguard-rl/app/api/), [app/ui/frontend/](polyguard-rl/app/ui/frontend/), [docs/ui.md](polyguard-rl/docs/ui.md) |
+| Math | [docs/math.md](polyguard-rl/docs/math.md), [docs/mathematics.md](polyguard-rl/docs/mathematics.md) |
+| Results | [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/) |
+Supporting docs include [architecture.md](polyguard-rl/docs/architecture.md),
+[environment_design.md](polyguard-rl/docs/environment_design.md),
+[reward_design.md](polyguard-rl/docs/reward_design.md),
+[safety.md](polyguard-rl/docs/safety.md),
+[precision_dosing.md](polyguard-rl/docs/precision_dosing.md),
+[graph_models.md](polyguard-rl/docs/graph_models.md),
+[ablations.md](polyguard-rl/docs/ablations.md),
+[api.md](polyguard-rl/docs/api.md),
+[deployment.md](polyguard-rl/docs/deployment.md),
+[ui.md](polyguard-rl/docs/ui.md),
+[DEMO_RECORDING_SCRIPT.md](polyguard-rl/docs/DEMO_RECORDING_SCRIPT.md), and
+[submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
+## The OpenEnv Environment
+At the center is `PolyGuardEnv`, implemented in
+[app/env/env_core.py](polyguard-rl/app/env/env_core.py). It follows the
+OpenEnv/Gym shape:
+```text
+reset(seed, difficulty, sub_environment, scenario_id, patient_id)
+  -> PolyGuardObservation
+step(PolyGuardAction)
+  -> (PolyGuardObservation, reward, done, info)
+```
+At reset, the environment loads or generates a patient scenario, selects a
+difficulty and sub-environment, computes a risk summary, builds candidate
+actions, estimates uncertainty, and emits a strict observation. At step time,
+the environment parses the action, checks legality, evaluates anti-cheat rules,
+mutates state only if the action is safe, computes decomposed reward, appends a
+trace, and returns detailed `info` fields such as failure reasons, transition
+delta, primary reward channels, invalid-action count, and timeout checks.
+![Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png)
+PolyGuard is not one task. It cycles through specialized sub-environments:
+| Sub-environment | What it stresses |
+| --- | --- |
+| `DDI` | High-risk drug-drug interaction recognition and resolution |
+| `BANDIT_MINING` | Candidate exploration and shortlist/ranking behavior inspired by bandit search |
+| `REGIMEN_RISK` | General medication burden and regimen optimization |
+| `PRECISION_DOSING` | Dose-hold, dose reduction, renal/hepatic guardrails, monitoring decisions |
+| `LONGITUDINAL_DEPRESCRIBING` | Multi-step taper/deprescribing behavior over a longer horizon |
+| `WEB_SEARCH_MISSING_DATA` | Evidence fetch or review when critical data is missing |
+| `ALTERNATIVE_SUGGESTION` | Safe alternatives and within-class substitution |
+| `NEW_DRUG_DECOMPOSITION` | First-pass reasoning over an unknown or combination medication |
+The curriculum in [app/env/curriculum.py](polyguard-rl/app/env/curriculum.py)
+starts with short easy DDI/regimen-risk episodes, then adds bandit and
+alternative-selection tasks, and finally hard cases with precision dosing,
+longitudinal deprescribing, missing data, and new-drug decomposition.
+### State and Observation
+The latent state is represented by `PolyGuardState` and includes patient
+demographics, active decision mode, step budget, medications, dose buckets,
+comorbidities, labs, vitals, frailty, adherence, monitoring gaps, prior adverse
+event history, burden score, severe-pair count, precision dosing flags,
+unresolved conflicts, action history, cumulative reward, and done state.
+The agent does not get every simulator internal. It receives a controlled
+`PolyGuardObservation` with a patient summary, medication table, comorbidity
+summary, organ function, labs/vitals, graph safety summary, burden summary,
+precision dosing flags, unresolved conflicts, candidate actions, step budget,
+action history, warnings, abstention indicators, seed, scenario, difficulty,
+and sub-environment.
+This split matters. PolyGuard is a partially observable environment. Missing
+labs and unresolved conflicts are visible as uncertainty signals, not as hidden
+reward traps.
+## Action Space and Safety Constraints
+PolyGuard deliberately avoids unconstrained text actions. The policy chooses a
+strict `PolyGuardAction` with fields such as `mode`, `action_type`,
+`target_drug`, `replacement_drug`, `dose_bucket`, `taper_days`,
+`monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`,
+`candidate_id`, `confidence`, and `rationale_brief`.
+The action types are compact:
+| Family | Action types |
+| --- | --- |
+| Regimen | `KEEP_REGIMEN`, `STOP_DRUG`, `SUBSTITUTE_WITHIN_CLASS`, `RECOMMEND_ALTERNATIVE` |
+| Dosing | `REDUCE_DOSE_BUCKET`, `INCREASE_DOSE_BUCKET`, `DOSE_HOLD`, `ORDER_MONITORING_AND_WAIT` |
+| Deprescribing | `TAPER_INITIATE`, `TAPER_CONTINUE` |
+| Evidence and uncertainty | `FETCH_EXTERNAL_EVIDENCE`, `DECOMPOSE_NEW_DRUG`, `REQUEST_SPECIALIST_REVIEW`, `REQUEST_PHARMACIST_REVIEW` |
+The candidate builder in
+[app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py)
+generates a bounded candidate set:
+```text
+3 <= |C_t| <= 10
+```
+Each candidate carries estimated safety delta, burden delta, disease stability,
+uncertainty score, rationale tags, required monitoring, and a legality
+precheck. Policy selection is candidate selection:
+```text
+a_t = to_action(c_t), where c_t is in C_t
+```
+The verifier in [app/env/verifier.py](polyguard-rl/app/env/verifier.py)
+enforces hard safety constraints before state mutation. It checks that target
+drugs exist when required, substitutions and alternatives are allowed, evidence
+domains are allowlisted, new-drug decomposition includes required components,
+taper-required drugs are not stopped abruptly, renal/hepatic unsafe dose
+escalation is blocked, duplicate therapy and contraindicated replacement pairs
+are blocked, and monitoring/hold actions include a monitoring plan.
+Illegal actions can receive reward penalties and become visible in traces, but
+they do not mutate patient state.
+## Multi-Agent Policy Stack
+The "agents" in PolyGuard are an auditable policy factorization rather than
+free-form independent chatbots. A step flows through:
+```text
+MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate
+       -> Supervisor -> Planner -> Critic -> Env -> Explainer
+```
+![Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png)
+| Agent/module | Role |
+| --- | --- |
+| `MedRecAgent` | Summarizes current regimen and medication burden |
+| `EvidenceAgent` | Retrieves local or fallback evidence when missing data is present |
+| `GraphSafetyAgent` | Scores risky pairs, side-effect load, duplicate therapy, and graph safety patterns |
+| `DosingAgent` | Detects dose-sensitive cases and dose-hold opportunities |
+| `CandidateAgent` | Exposes legal candidate actions from the environment candidate builder |
+| `SupervisorAgent` | Routes to regimen optimization, dose optimization, or review mode |
+| `PlannerAgent` | Selects an action from candidates through the policy provider |
+| `CriticAgent` | Vetoes illegal or unsafe proposed actions and can force review fallback |
+| `ExplainerAgent` | Records grounded rationale for demo, replay, and audit |
+The orchestration modes are `sequential_pipeline`, `supervisor_routed`,
+`replan_on_veto`, and `lightweight_debate`. Policy-stack ablations compare
+`bandit-only`, `llm-only`, and `llm+bandit`.
+## Contextual Bandits
+PolyGuard uses contextual bandits as an inspectable candidate-reranking layer.
+This is where the project most directly echoes the arXiv bandit inspiration:
+unsafe polypharmacy search is combinatorial, so the system should learn which
+regions of the candidate/action space are worth exploring rather than enumerate
+everything.
+Each candidate becomes an 8-dimensional feature vector:
+```text
+x(c) = [
+  1,
+  I[legality_precheck],
+  estimated_safety_delta,
+  burden_delta,
+  disease_stability_estimate,
+  1 - uncertainty_score,
+  I[mode = DOSE_OPT],
+  I[mode = REVIEW]
+]
+```
+An arm is keyed by macro mode and action type:
+```text
+arm(c) = mode(c) || ":" || action_type(c)
+```
+The LinUCB variant maintains, for each arm `a`:
+```text
+A_a = I + sum x x^T
+b_a = sum r x
+theta_a = A_a^{-1} b_a
+score_a(x) = theta_a^T x + alpha * sqrt(x^T A_a^{-1} x)
+```
+There is also a Thompson-style variant:
+```text
+score_a(x) = theta_a^T x + Normal(0, alpha)
+```
+This layer can shortlist candidates before the planner emits the final action.
+It is deliberately kept inside the candidate space: the bandit can improve
+ordering and exploration, but it cannot invent an unsafe action outside the
+environment contract.
+## Reward Model
+The reward model is decomposed on purpose. A single scalar reward is needed for
+RL, but safety-critical RL needs more than one opaque number. PolyGuard logs 13
+component columns and four primary channels on every step.
+![Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png)
+All reward values are clamped and quantized:
+```text
+q(x) = round(clip(x, 0.001, 0.999), 3)
+```
+The 13 reward components are:
+| Component | Weight | Meaning |
+| --- | ---: | --- |
+| `format_compliance_score` | 0.08 | Action payload conforms to the schema |
+| `candidate_alignment_score` | 0.08 | The model selected a valid candidate-style id |
+| `legality_score` | 0.12 | The verifier accepted the action |
+| `safety_delta_score` | 0.15 | Severe-pair and burden risk decreased |
+| `burden_improvement_score` | 0.08 | Dose-weighted medication burden improved |
+| `disease_stability_score` | 0.10 | The action did not destabilize underlying disease management |
+| `dosing_quality_score` | 0.08 | Dose-sensitive routing/action quality |
+| `abstention_quality_score` | 0.06 | Review/abstention is appropriate under uncertainty |
+| `efficiency_score` | 0.06 | The action uses the finite step budget well |
+| `process_fidelity_score` | 0.06 | The action follows task-specific process expectations |
+| `explanation_grounding_score` | 0.03 | The rationale is present and grounded |
+| `anti_cheat_score` | 0.06 | Reward-hacking checks did not fire |
+| `uncertainty_calibration_score` | 0.04 | Confidence matches observable uncertainty |
+The scalar reward is a weighted average:
+```text
+R_env(s_t, a_t, s_{t+1}) = q( sum_i w_i c_i / sum_i w_i )
+```
+Safety-heavy terms dominate the total weight:
+```text
+legality + safety_delta + burden + disease_stability + anti_cheat
+= 0.12 + 0.15 + 0.08 + 0.10 + 0.06
+= 0.51
+```
+The four primary reward channels are:
+| Channel | Component family |
+| --- | --- |
+| `safety_legality` | legality, candidate alignment, anti-cheat, uncertainty calibration |
+| `clinical_improvement` | safety delta, burden improvement, disease stability |
+| `dosing_quality` | dosing quality and abstention quality |
+| `process_integrity` | format compliance, efficiency, process fidelity, explanation grounding |
+These channels are emitted in `info.primary_reward_channels`, GRPO reward logs,
+reports, plots, and ablation summaries.
+## Anti-Cheat and Failure Visibility
+RL policies exploit reward functions. PolyGuard makes common shortcut failures
+explicit: repeated action loops, excessive keep-regimen behavior, excessive
+review/abstention behavior, candidate ID mismatch, candidate outside the legal
+set, hidden high-risk DDI no-op behavior, parser exploit patterns in rationales,
+and retries of failed no-op actions.
+If an exploit is detected:
+```text
+anti_cheat_score = 0.001
+done = true
+termination_reason = "exploit_detection"
+```
+Episodes can also terminate on step budget exhaustion, repeated invalid
+actions, safety-veto threshold, patient destabilization, safe resolution,
+wall-clock timeout, or per-step timeout.
+![Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png)
+## Mathematics
+PolyGuard can be read as a finite-horizon constrained partially observable
+Markov decision process:
+```text
+M = (S, A, O, T, R, H, C)
+```
+where `S` is latent patient/regimen state, `A` is the constrained medication
+action set, `O` is the controlled observation, `T(s' | s, a)` is the transition
+function, `R(s, a, s')` is verifier-backed reward, `H` is the episode horizon,
+and `C(s, a)` is the hard safety/legality constraint predicate.
+The objective is:
+```text
+maximize_pi   E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ]
+subject to    C(s_t, a_t) = 1 whenever possible
+```
+There is no explicit discount factor in the runtime. Time preference enters
+through finite horizons and the efficiency reward:
+```text
+efficiency_t = q(1 - step_count_t / (max_steps + 1))
+```
+State transition is two-gated:
+```text
+if verifier(s_t, a_t).legal and not anti_cheat(s_t, a_t):
+    s_{t+1} = T(s_t, a_t)
+else:
+    s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t)
+```
+Risk-like deltas become reward through:
+```text
+delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post))
+```
+For burden and contraindicated-pair improvement:
+```text
+burden_reward = delta_reward(pre_burden, post_burden)
+pair_reward   = delta_reward(pre_pairs, post_pairs)
+safety_delta_score =
+  q(0.65 * pair_reward + 0.35 * burden_reward) if legal
+  0.001 otherwise
+```
+GRPO uses environment execution as the reward function. For each prompt, the
+model emits candidate completions; PolyGuard parses the candidate id, resets a
+deterministic environment using the recorded seed and scenario fields, executes
+one step, and returns reward. The training reward combines environment reward
+with a legality bonus:
+```text
+legal_bonus = 0.95 if action is legal else 0.05
+R_GRPO = q(0.80 * R_env + 0.20 * legal_bonus)
+```
+Conceptually, GRPO forms a within-prompt advantage:
+```text
+A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon)
+```
+and optimizes a clipped policy-ratio objective with KL regularization. The
+optimizer mechanics are TRL's; PolyGuard's contribution is the verifier-backed
+reward function and the controlled action/state environment. The expanded
+derivation is in
+[polyguard-rl/docs/mathematics.md](polyguard-rl/docs/mathematics.md).
+## Data and Dataset Pipeline
+The data pipeline builds a compact medication-safety substrate from local drug
+knowledge, synthetic patients, scenario files, retrieval text, and optional
+external augmentation.
+![Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png)
+The dataset design is documented in
+[docs/datasets.md](polyguard-rl/docs/datasets.md). The local generated
+pipeline produces these processed artifacts and counts:
+| Artifact | Count | Path |
+| --- | ---: | --- |
+| Normalized drug rows | 10 | `data/processed/normalized_drugs.parquet` |
+| Drug class rows | 10 | `data/processed/drug_classes.parquet` |
+| Interaction rows | 2 | `data/processed/interactions.parquet` |
+| Graph edges | 18 | `data/processed/graph_edges.parquet` |
+| Synthetic patients | 20 | `data/processed/patients_synthetic.parquet` |
+| Retrieval documents | 8 | `data/processed/retrieval_corpus.jsonl` |
+| Easy scenarios | 100 | [data/scenarios/easy/](polyguard-rl/data/scenarios/easy/) |
+| Medium scenarios | 200 | [data/scenarios/medium/](polyguard-rl/data/scenarios/medium/) |
+| Hard scenarios | 200 | [data/scenarios/hard/](polyguard-rl/data/scenarios/hard/) |
+| Local small SFT rows | 80 | `data/processed/training_corpus_sft.jsonl` |
+| Local small GRPO prompts | 80 | `data/processed/training_corpus_grpo_prompts.jsonl` |
+The provenance manifest generated by the local pipeline records source policy
+and counts at `data/processed/provenance_manifest.json`.
+Additional data-governance and rule artifacts are produced or consumed by the
+pipeline:
+| Artifact | Why it matters |
+| --- | --- |
+| `data/processed/ingested_sources.json` | Source ingestion ledger used by the local build |
+| `data/processed/feature_dictionary.json` | Names and meanings of structured model features |
+| `data/processed/burden_rules.yaml` | Medication-burden and duplicate-therapy rules |
+| `data/processed/substitution_rules.yaml` | Data-level safer-substitution rules |
+| `data/processed/taper_rules.yaml` | Deprescribing and taper requirements |
+| `data/retrieval_index/index.json` | Retrieval index over local evidence chunks |
+The local knowledge seed is
+[data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json).
+It contains drug classes, example high-risk pairs, renal and hepatic flags,
+side-effect tags, substitution rules, and taper requirements. The processed
+tables then feed graph modeling, candidate generation, environment scenarios,
+retrieval, SFT rows, and GRPO prompts.
+The full training/evidence runs used 2,000 examples per Qwen model, recorded in
+the final reports under
+[docs/results/final_submission_evidence/reports/](polyguard-rl/docs/results/final_submission_evidence/reports/).
+## Models Inside the Environment
+PolyGuard combines learned and rule-backed components:
+- Graph safety model:
+  [app/models/graph/](polyguard-rl/app/models/graph/) produces regimen
+  embeddings, pairwise DDI severity, severe-alert probability, and side-effect
+  tag probabilities.
+- Tabular risk model:
+  [app/models/tabular/](polyguard-rl/app/models/tabular/) supports calibrated
+  patient/regimen risk heads and evaluation.
+- Dosing model:
+  [app/models/dosing/](polyguard-rl/app/models/dosing/) models dose-sensitive
+  states with target attainment, toxicity, underdose risk, organ stress,
+  interaction load, and monitoring need.
+- Retrieval:
+  [app/models/retrieval/](polyguard-rl/app/models/retrieval/) and
+  [app/knowledge/](polyguard-rl/app/knowledge/) provide local evidence chunks,
+  drug rules, renal/hepatic guardrails, duplicate therapy rules, substitution
+  rules, taper rules, burden scoring, and side-effect ontology.
+- Active model runtime:
+  [app/models/policy/active_model.py](polyguard-rl/app/models/policy/active_model.py)
+  discovers activated artifacts from `checkpoints/active/active_model_manifest.json`;
+  the tracked evidence mirror includes
+  [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json).
+  The provider load order prefers a GRPO adapter, then merged model, then SFT
+  adapter.
+- Provider runtime:
+  [app/models/policy/provider_runtime.py](polyguard-rl/app/models/policy/provider_runtime.py)
+  is Transformers-first, with optional Ollama when enabled. If model loading is
+  unavailable, the runtime falls back to deterministic safety ranking.
+Tracked support-model reports show that the environment is not only an LLM
+wrapper:
+| Component | Report | Current tracked result |
+| --- | --- | --- |
+| Graph model | [docs/results/graph_train.json](polyguard-rl/docs/results/graph_train.json) | `status: trained`, `num_samples: 180`, artifact path `outputs/models/graph_model.pkl` |
+| Tabular risk model | [docs/results/risk_train.json](polyguard-rl/docs/results/risk_train.json) | `status: trained`, `dataset_size: 180`, `train_mae: 0.0033`, artifact path `outputs/models/tabular_risk.pkl` |
+| Dose surrogate model | [docs/results/dose_train.json](polyguard-rl/docs/results/dose_train.json) | `status: trained`, `dataset_size: 120`, `train_mae: 0.0025`, artifact path `outputs/models/dose_model.pkl` |
+The hard-coded contraindicated seed pairs in
+[app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py)
+include `warfarin_like` + `nsaid_like` and `benzodiazepine_like` +
+`opioid_like`. Substitution rules in
+[app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py)
+include safer alternatives such as `nsaid_like -> acetaminophen_like`,
+`nsaid_like -> topical_nsaid_like`, `benzodiazepine_like ->
+non_benzo_sleep_support`, and `opioid_like -> non_opioid_analgesic`.
+### Precision Dosing
+Precision dosing uses sensitive classes such as anticoagulants, sedatives, and
+glucose-lowering drugs. The dosing agent and surrogate model are implemented in
+[app/agents/dosing_agent.py](polyguard-rl/app/agents/dosing_agent.py) and
+[app/models/dosing/](polyguard-rl/app/models/dosing/).
+The surrogate PK/PD transition in
+[app/models/dosing/surrogate_pkpd.py](polyguard-rl/app/models/dosing/surrogate_pkpd.py)
+uses effect, toxicity, underdose, organ stress, and interaction load:
+```text
+effective_delta = dose_delta * (1 - min(0.6, organ_factor * 0.4))
+effect' =
+  clip(effect + 0.28 * effective_delta - 0.05 * interaction_factor, 0, 1)
+toxicity_gain =
+  max(0, dose_delta) * (0.35 + 0.25 * organ_factor + 0.20 * interaction_factor)
+toxicity' =
+  clip(0.85 * toxicity + toxicity_gain, 0, 1)
+underdose' =
+  clip(1 - effect' + 0.15 * max(0, -dose_delta), 0, 1)
+```
+The higher-level dosing metrics use target attainment, toxicity avoidance,
+underdose risk, and monitoring need:
+```text
+target_attainment = 1 - abs(effect_level - 0.62)
+toxicity_proxy    = toxicity_level + 0.20 * organ_stress + 0.12 * interaction_load
+measurement_need  = max(toxicity_proxy, underdose_proxy)
+```
+## Training and Post-Training
+The training stack is deliberately staged:
+1. Build structured data, scenarios, retrieval records, SFT examples, and GRPO
+   prompts.
+2. Run SFT with TRL to teach the model the candidate-id format and obvious
+   clinical priors.
+3. Run GRPO with environment-backed reward, where sampled candidate completions
+   are executed in PolyGuardEnv and scored by the verifier/reward router.
+4. Track sampled generations, reward components, primary reward channels,
+   legality, anti-cheat events, and training curves.
+5. Run policy-stack ablations and baseline comparisons.
+6. Merge or export adapters safely.
+7. Validate post-save inference from the saved artifact, not from an in-memory
+   training object.
+8. Generate reports, charts, action traces, and final artifact manifests.
+The relevant training source files are
+[scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py),
+[scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py),
+[app/training/sft_trl.py](polyguard-rl/app/training/sft_trl.py),
+[app/training/grpo_trl.py](polyguard-rl/app/training/grpo_trl.py),
+[app/training/reward_functions.py](polyguard-rl/app/training/reward_functions.py),
+[app/training/openenv_wrapper.py](polyguard-rl/app/training/openenv_wrapper.py),
+and [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py).
+The one-run notebook is
+[polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
+It is the accessible Colab/HF workflow for building data, running checks,
+launching training, pulling reports, generating charts, validating inference,
+activating a model, deploying the product Space, and running acceptance checks.
+The modular notebook series is:
+- [01_data_building.ipynb](polyguard-rl/notebooks/01_data_building.ipynb)
+- [02_knowledge_graph.ipynb](polyguard-rl/notebooks/02_knowledge_graph.ipynb)
+- [03_risk_models.ipynb](polyguard-rl/notebooks/03_risk_models.ipynb)
+- [04_environment_validation.ipynb](polyguard-rl/notebooks/04_environment_validation.ipynb)
+- [05_sft_debug.ipynb](polyguard-rl/notebooks/05_sft_debug.ipynb)
+- [06_grpo_debug.ipynb](polyguard-rl/notebooks/06_grpo_debug.ipynb)
+- [07_policy_analysis.ipynb](polyguard-rl/notebooks/07_policy_analysis.ipynb)
+- [08_dosing_analysis.ipynb](polyguard-rl/notebooks/08_dosing_analysis.ipynb)
+- [09_training_loop.ipynb](polyguard-rl/notebooks/09_training_loop.ipynb)
+For exact local and remote execution details, use
+[docs/training.md](polyguard-rl/docs/training.md) and
+[docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
+This blog focuses on architecture, data, evaluation, and evidence rather than
+private or environment-specific commands.
+## Training Curves and Model Results
+The final curated evidence lives in
+[polyguard-rl/docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/).
+It replaces earlier smoke-run charts and older 0.5B/1.5B-only views.
+### SFT Loss Across Qwen Runs
+![SFT loss curves across Qwen runs](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/sft_loss_curves_all_models.png)
+The SFT curves, post-save valid rates, and token-accuracy histories show that
+the models learned the candidate-id output contract rather than only producing
+unconstrained prose. The visible curves drop from roughly `3.0-3.6` initial
+loss to low final loss across all three Qwen sizes.
+![Qwen 3B SFT training loss](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_sft_training_loss.png)
+The tracked per-model summaries are:
+| Run | Model | Epochs | Final step | Runtime | Key SFT metrics |
+| --- | --- | ---: | ---: | ---: | --- |
+| `qwen-qwen2-5-0-5b-instruct` | `Qwen/Qwen2.5-0.5B-Instruct` | 2 | 2,000 | `234.6302s` | loss `3.0856 -> 0.0626`, best `0.0057`, train loss `0.1923`, token accuracy `0.9717`, valid rate `1.0`, avg env reward `0.726`, latency `1.839s` |
+| `qwen-qwen2-5-1-5b-instruct` | `Qwen/Qwen2.5-1.5B-Instruct` | 2 | 4,000 | `483.7085s` | loss `2.9686 -> 0.0681`, best `0.0009`, train loss `0.1152`, token accuracy `0.9726`, valid rate `1.0`, avg env reward `0.726`, latency `2.158s` |
+| `qwen-qwen2-5-3b-instruct` | `Qwen/Qwen2.5-3B-Instruct` | 2 | 2,000 | `715.2908s` | loss `3.5687 -> 0.054`, best `0.0022`, train loss `0.1569`, token accuracy `0.9750`, SFT avg env reward `0.781`, SFT latency `2.863s` |
+Each SFT run used 2,000 examples. The 0.5B and 3B runs recorded 2,001 history
+rows including the final trainer summary; the 1.5B run recorded 4,001 history
+rows because its batch configuration produced 4,000 final steps.
+### GRPO Reward Curve
+![Qwen 3B GRPO reward curve](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_grpo_reward_curve.png)
+![Qwen 3B GRPO training loss](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_grpo_loss_curve.png)
+The complete GRPO evidence is available for Qwen 3B:
+- Backend: `trl_transformers`
+- Model: `Qwen/Qwen2.5-3B-Instruct`
+- Records: `2000`
+- Epochs: `1.0`
+- Final step: `2000`
+- Runtime: `6873.9375s` (`1.91h`)
+- Reward samples: `4000`
+- GRPO average reward: `0.767`
+- GRPO reward history: min `0.376`, max `0.880`, last `0.812`, average `0.76685`
+- GRPO train loss: `0.000002665`
+- Post-save GRPO valid rate: `1.0`
+- Post-save GRPO average environment reward: `0.726`
+- Post-save GRPO average latency: `3.681s`
+- Artifact path recorded in the report: `checkpoints/sweeps/qwen-qwen2-5-3b-instruct/grpo_adapter`
+Source reports:
+[grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json),
+[postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json),
+and
+[submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json).
+### SFT vs GRPO by Model
+![SFT vs GRPO verifier reward by model](polyguard-rl/docs/results/final_submission_evidence/charts/curated/model_comparison/sft_vs_grpo_reward_by_model.png)
+This chart is intentionally transparent about artifact availability. Qwen 0.5B
+and 1.5B have SFT reports/histories and post-save SFT evidence in the repo, but
+their adapter directories were not present in the local/final artifact mirrors
+at packaging time. Qwen 3B has the complete SFT plus GRPO artifact set.
+The packaged manifest records Qwen 3B as complete with 125 checkpoint files
+(`433,208,536` bytes), 11 SFT adapter files (`30,655,905` bytes), 11 GRPO
+adapter files (`30,656,841` bytes), and 9 report files (`5,930,214` bytes).
+Qwen 0.5B and 1.5B are retained as report/post-save evidence only.
+Manifest:
+[docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json).
+### Product Pipeline vs Basic LLM Proxy
+![Basic LLM vs full PolyGuard pipeline](polyguard-rl/docs/results/final_submission_evidence/charts/curated/product_over_basic_llm/basic_llm_vs_full_pipeline_reward.png)
+Matched-seed evaluation compares a basic LLM-style first-legal proxy, an
+SFT-style safety ranker, and the full PolyGuard orchestrated pipeline. The same
+PolyGuard verifier/reward system judges all three.
+| Policy | Episodes | Avg reward | Legality rate | Failure/exploit rate | Candidate diversity |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| Basic LLM proxy | 8 | `0.762` | `1.0` | `0.25` | 1 |
+| SFT policy proxy | 8 | `0.818` | `1.0` | `0.0` | 2 |
+| Full PolyGuard pipeline | 8 | `0.805` | `1.0` | `0.0` | 2 |
+The full pipeline improves average verifier reward over the basic LLM proxy by
+`+0.043` while reducing visible failure/exploit rate from `0.25` to `0.0`.
+![Reward delta by matched seed](polyguard-rl/docs/results/final_submission_evidence/charts/curated/product_over_basic_llm/reward_delta_by_seed.png)
+Two matched seeds expose the core failure mode: the basic policy repeatedly
+kept a regimen despite the hidden `warfarin_like` + `nsaid_like` DDI holdout,
+triggering `holdout_ddi_not_addressed`. The full pipeline selected safer dose
+or hold candidates and avoided those failure reasons.
+Source:
+[basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json).
+### Reward Components and Channels
+![Reward component bars](polyguard-rl/docs/results/final_submission_evidence/charts/curated/reward_and_safety/reward_component_bars.png)
+![Primary reward channel bars](polyguard-rl/docs/results/final_submission_evidence/charts/curated/reward_and_safety/primary_reward_channel_bars.png)
+The reward charts are as important as the scalar reward curve. They show
+whether the model is improving by becoming safer and more process-faithful or
+merely exploiting one easy component. The reports log the full 13-component
+reward vector and the four primary channels for GRPO and evaluation runs.
+For Qwen 3B GRPO, the tracked average primary channels are:
+| Channel | Average |
+| --- | ---: |
+| `safety_legality` | `0.816` |
+| `clinical_improvement` | `0.609` |
+| `dosing_quality` | `0.543` |
+| `process_integrity` | `0.875` |
+### Post-Save Inference
+![Inference validity and reward](polyguard-rl/docs/results/final_submission_evidence/charts/curated/inference/inference_validity_reward.png)
+Post-save inference is separate from training. The exported/activated artifact
+is loaded and asked to choose candidate ids on held prompt samples. The Qwen 3B
+GRPO adapter path produced:
+- `model_source: adapter`
+- `samples: 5`
+- `valid_rate: 1.0`
+- `avg_env_reward: 0.726`
+- `avg_latency_seconds: 3.681`
+The caveat matters: `valid_rate: 1.0` means the output was parseable and
+executable as a candidate selection. In the five-sample Qwen 3B post-save GRPO
+report, four valid samples still terminated with `exploit_detection`. That is
+retained as safety evidence, because PolyGuard's job is to expose suspicious or
+loop-like behavior instead of hiding it behind a clean parse metric.
+## Agentic Evaluation
+Evaluation is not one benchmark number. The evaluation stack under
+[app/evaluation/](polyguard-rl/app/evaluation/) includes offline policy
+evaluation, safety evaluation, dosing evaluation, robustness under missing labs
+and noisy inputs, calibration and abstention evaluation, process fidelity,
+subgroup summaries, explainability grounding, baseline comparison, policy
+ablations, failure mining, and action traces.
+The tracked benchmark report records:
+| Metric family | Result |
+| --- | --- |
+| Offline avg reward | `0.772833` |
+| Offline legal rate | `1.0` |
+| Severe violation rate | `0.0` |
+| Illegal step rate | `0.0` |
+| Dosing target attainment | `0.75` |
+| Dosing toxicity avoidance | `1.0` |
+| Missing-labs safety rate | `0.666667` |
+| Noisy-dose, conflicting-meds, alias-noise, hidden-duplicate, wrong-candidate-id, stale-evidence, delayed-ADE safety/resilience | `1.0` |
+| Calibration ECE proxy | `0.08625` |
+| Process fidelity | `0.92` |
+| Explainability grounding | `0.8` |
+Source:
+[docs/results/benchmark_report.json](polyguard-rl/docs/results/benchmark_report.json).
+The improvement gate compares baseline and candidate reports:
+| Gate dimension | Delta |
+| --- | ---: |
+| Average reward | `+0.025833` |
+| Legality rate | `0.0` non-regression |
+| Success rate | `0.0` non-regression |
+| Process fidelity | `+0.92` |
+| Timeout rate | `0.0` non-regression |
+| Failure visibility | `0.0` non-regression |
+Source:
+[docs/results/improvement_report.json](polyguard-rl/docs/results/improvement_report.json).
+### Policy Ablation Results
+| Stack | Avg reward | Legality | Visible failure rate | Exploit detections | Interpretation |
+| --- | ---: | ---: | ---: | ---: | --- |
+| `bandit_only` | `0.779625` | `1.0` | `0.0625` | 2 | Strong deterministic shortlist behavior with low failure visibility |
+| `llm_only` | `0.772391` | `1.0` | `0.3043` | 7 | Legal, but more loop-like failure behavior |
+| `llm+bandit` | `0.764739` | `1.0` | `0.3043` | 7 | Current combined stack needs tighter exploration/control in these ablation settings |
+![Policy ablation reward](polyguard-rl/docs/results/final_submission_evidence/charts/curated/policy_ablation/policy_ablation_reward.png)
+The point of these ablations is not to claim every combined policy is always
+better. The point is that PolyGuard can localize behavior: legality remains
+high, while failure mining shows whether a stack is looping, over-reviewing,
+or selecting non-improving candidates.
+Source:
+[policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json).
+## OpenEnv and Product Surfaces
+The OpenEnv package is compact:
+```yaml
+spec_version: 1
+name: polyguard-openenv
+runtime: fastapi
+app: app.env.fastapi_app:app
+port: 8100
+```
+The OpenEnv runtime exposes `POST /reset`, `POST /step`, `GET /state`,
+`GET /metadata`, `GET /schema`, `POST /mcp`, `GET /health`, `GET /ws`, and
+backward-compatible `/env/*` routes.
+The product API in [app/api/routes.py](polyguard-rl/app/api/routes.py) wraps
+the environment, orchestrator, policy runtime, evaluation, evidence search,
+cases, metrics, and medication-alternative tooling. Product-facing endpoints
+include `/env/reset`, `/env/step_candidate`, `/agents/orchestrate`,
+`/policy/infer`, `/policy/model_status`, `/eval/run_policy`,
+`/metrics/training`, `/evidence/query`, and `/tools/medication_alternatives`.
+![Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png)
+## Operations and Deployment
+The repository keeps deployment and artifact operations explicit:
+| Surface | Files |
+| --- | --- |
+| Local/container runtime | [Dockerfile](polyguard-rl/Dockerfile), [Dockerfile.space](polyguard-rl/Dockerfile.space), [docker-compose.yml](polyguard-rl/docker-compose.yml), [requirements.txt](polyguard-rl/requirements.txt), [requirements-space.txt](polyguard-rl/requirements-space.txt) |
+| Product Space/API deployment | [scripts/deploy_space.sh](polyguard-rl/scripts/deploy_space.sh), [scripts/deploy_space_api.py](polyguard-rl/scripts/deploy_space_api.py), [docs/deployment.md](polyguard-rl/docs/deployment.md) |
+| Training and evidence Spaces | [scripts/deploy_training_space.py](polyguard-rl/scripts/deploy_training_space.py), [scripts/monitor_training_space_status.py](polyguard-rl/scripts/monitor_training_space_status.py), [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py), [app/hf_space/evidence_runner.py](polyguard-rl/app/hf_space/evidence_runner.py) |
+| Artifact packaging and activation | [scripts/deploy_final_artifact_space.py](polyguard-rl/scripts/deploy_final_artifact_space.py), [scripts/package_active_model_bundle.py](polyguard-rl/scripts/package_active_model_bundle.py), [scripts/install_hf_active_bundle.py](polyguard-rl/scripts/install_hf_active_bundle.py), [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json) |
+| Submission validation | [scripts/acceptance_gate.py](polyguard-rl/scripts/acceptance_gate.py), [scripts/validate_submission_links.py](polyguard-rl/scripts/validate_submission_links.py), [docs/submission_checklist.md](polyguard-rl/docs/submission_checklist.md), [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
+The important operational distinction is that local smoke artifacts, remote
+training-space logs, final artifact packaging, and active-model installation
+are separate stages. Final claims are tied to the curated evidence bundle, not
+to whichever intermediate output directory happens to exist in a checkout.
+## The Workbench UI
+The UI is a React 18 + Vite + TypeScript workbench under
+[app/ui/frontend/](polyguard-rl/app/ui/frontend/). It is not the environment
+itself; it is an operator surface over the API and OpenEnv runtime.
+[Live workbench Space](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
+![Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png)
+The main views cover patient workbench, episode replay, policy comparison and
+policy lab, precision dosing, training monitor, safety inspector, candidate
+actions, reward panel, episode trace, and alternative medication search through
+`/tools/medication_alternatives`.
+The Patient Workbench shows the active model chip, current scenario, candidate
+set, agent-vs-environment flow, reward breakdown, and action trace without
+requiring the reader to inspect raw JSON. The UI is intentionally a workbench,
+not a polished clinical application.
+### UI Sequence
+The five UI screenshots are checked in under `polyguard-rl/docs/UI Images/`.
+1. The workbench opens with model truth, live episode context, scenario status,
+   candidate count, and reward state.
+![PolyGuard workbench overview](polyguard-rl/docs/UI%20Images/1.jpeg)
+2. The episode panel makes the patient, task, difficulty, sub-environment, risk
+   delta, and candidate-action console visible without reading raw JSON.
+![Episode overview and candidate console](polyguard-rl/docs/UI%20Images/2.jpeg)
+3. Candidate selection is paired with reward-channel feedback, current
+   medications, and blocked/available action visibility.
+![Candidate actions and reward channels](polyguard-rl/docs/UI%20Images/3.jpeg)
+4. After an action, the workbench exposes history, warnings, decision payload,
+   grounded facts, explanation, evidence, and event logs.
+![Action history, decision payload, and evidence](polyguard-rl/docs/UI%20Images/4.jpeg)
+5. The alternatives tool surfaces medication substitutions from the current
+   regimen and links out to source labels.
+![Medication alternatives tool](polyguard-rl/docs/UI%20Images/5.jpeg)
+## Demo Videos
+### [UI Walkthrough Video](https://drive.google.com/file/d/1YOzad5gvx-tSmGzJNuBgokBF4-dX2T2H/view?usp=sharing)
+This walkthrough shows the deployed workbench surface, including the live model
+chip, episode context, candidate actions, reward panels, and evidence-oriented
+patient review flow.
+### [Agent In Action: Action Button Demo](https://drive.google.com/file/d/1eHk1v0OYJRrLWVO97ZclN05MYHxmNnmc/view?usp=sharing)
+This demo focuses on what the action button does: selecting a candidate,
+submitting it through the environment, producing a verifier-scored transition,
+and exposing the resulting reward, action history, warnings, and explanation.
+### [World Model Tool: Tavily and OpenFDA Alternative Suggestions](https://drive.google.com/file/d/1GaUyyaXaBCHjhHFbpkprojNt5pLNAoYi/view?usp=sharing)
+This tool demo shows the world-model support path for alternative medication
+suggestions, using Tavily and the OpenFDA government database to retrieve
+candidate alternatives and side-effect evidence for safer review.
+## How a Reviewer Should Read the Repository
+For a fresh reviewer, the intended path is:
+1. Read the artifact index:
+   [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
+2. Inspect the final curated evidence:
+   [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md).
+3. Open the one-run notebook:
+   [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
+4. For local smoke work, follow [docs/training.md](polyguard-rl/docs/training.md)
+   and the local scripts
+   [scripts/run_env_local.sh](polyguard-rl/scripts/run_env_local.sh),
+   [scripts/run_api_local.sh](polyguard-rl/scripts/run_api_local.sh), and
+   [scripts/run_ui_local.sh](polyguard-rl/scripts/run_ui_local.sh).
+5. For full training/reproduction, use the notebook or training docs rather
+   than copying private artifact commands out of old drafts.
+6. For final public artifacts, use the final artifact Space:
+   [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts).
+## Evidence and Artifact Inventory
+Important evidence paths:
+- Final overview:
+  [docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
+- Artifact manifest:
+  [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json)
+- Three-model summary:
+  [docs/results/final_submission_evidence/reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json)
+- Qwen 3B GRPO report:
+  [docs/results/final_submission_evidence/reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json)
+- Post-save GRPO inference:
+  [docs/results/final_submission_evidence/reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json)
+- Basic LLM vs PolyGuard:
+  [docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
+- Policy ablation:
+  [docs/results/final_submission_evidence/reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json)
+- Action traces:
+  [docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl)
+- Curated charts:
+  [docs/results/final_submission_evidence/charts/curated/README.md](polyguard-rl/docs/results/final_submission_evidence/charts/curated/README.md)
+Important tests:
+| Category | Tests |
+| --- | --- |
+| Environment contract | [tests/test_openenv_contract.py](polyguard-rl/tests/test_openenv_contract.py), [tests/test_env_reset.py](polyguard-rl/tests/test_env_reset.py), [tests/test_env_step.py](polyguard-rl/tests/test_env_step.py), [tests/test_env_step_flow.py](polyguard-rl/tests/test_env_step_flow.py), [tests/test_future_subenvs.py](polyguard-rl/tests/test_future_subenvs.py) |
+| Reward and safety | [tests/test_reward_functions.py](polyguard-rl/tests/test_reward_functions.py), [tests/test_reward_range.py](polyguard-rl/tests/test_reward_range.py), [tests/test_reward_channels.py](polyguard-rl/tests/test_reward_channels.py), [tests/test_anti_cheat.py](polyguard-rl/tests/test_anti_cheat.py), [tests/test_constraints.py](polyguard-rl/tests/test_constraints.py), [tests/test_timeout_logic.py](polyguard-rl/tests/test_timeout_logic.py) |
+| Policy and runtime | [tests/test_agents.py](polyguard-rl/tests/test_agents.py), [tests/test_contextual_bandit.py](polyguard-rl/tests/test_contextual_bandit.py), [tests/test_policy_schema.py](polyguard-rl/tests/test_policy_schema.py), [tests/test_provider_runtime.py](polyguard-rl/tests/test_provider_runtime.py), [tests/test_postsave_inference.py](polyguard-rl/tests/test_postsave_inference.py), [tests/test_checkpoint_integrity.py](polyguard-rl/tests/test_checkpoint_integrity.py) |
+| API and product tooling | [tests/test_api.py](polyguard-rl/tests/test_api.py), [tests/test_medication_alternatives.py](polyguard-rl/tests/test_medication_alternatives.py), [tests/test_remote_env.py](polyguard-rl/tests/test_remote_env.py) |
+| Data and evidence | [tests/test_parser.py](polyguard-rl/tests/test_parser.py), [tests/test_dataops_parser.py](polyguard-rl/tests/test_dataops_parser.py), [tests/test_graph_infer.py](polyguard-rl/tests/test_graph_infer.py), [tests/test_submission_evidence.py](polyguard-rl/tests/test_submission_evidence.py) |
+| Submission, notebook, and HF flow | [tests/test_acceptance_gate.py](polyguard-rl/tests/test_acceptance_gate.py), [tests/test_runner_notebook.py](polyguard-rl/tests/test_runner_notebook.py), [tests/test_hf_training_sweep.py](polyguard-rl/tests/test_hf_training_sweep.py) |
+Additional architecture diagrams:
+- [System architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png)
+- [Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png)
+- [Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png)
+- [Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png)
+- [Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png)
+- [Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png)
+- [Evidence generation flow](polyguard-rl/docs/assets/diagrams/evidence_generation_flow.png)
+- [Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png)
+- [Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png)
+## Limitations
+PolyGuard is a simulator and research environment. Its current data substrate
+is compact and intentionally inspectable, not a production clinical knowledge
+base. The final evidence set is strongest for Qwen 3B because that run has
+complete SFT, GRPO, post-save GRPO, policy-ablation, adapter, and checkpoint
+evidence. Qwen 0.5B and 1.5B have SFT reports/histories and post-save SFT
+evidence, but their adapter directories are marked `reports_only_or_partial` in
+the final manifest.
+The reward model is hand-designed and auditable. That is a feature for this
+OpenEnv setting, but it also means reward-channel design should be
+stress-tested as the data grows. The current ablations show that contextual
+bandits are useful and inspectable, while the `llm+bandit` combined stack needs
+more tuning to avoid loop-like failure behavior in some settings.
+The right conclusion is not "this is a clinical decision system." The right
+conclusion is that constrained environment feedback, verifier-backed rewards,
+agentic evaluation, and explicit failure mining are a better substrate for
+safety-critical medication-policy learning than free-form prompt responses.
+## References
+- Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois.
+  [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190).
+  arXiv:2212.05190.
+- World Health Organization.
+  [Medication Without Harm](https://www.who.int/initiatives/medication-without-harm).
+- CDC.
+  [FastStats: Medication Safety Data](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html).
+- Shehab N, Lovegrove MC, Geller AI, Rose KO, Weidle NJ, Budnitz DS.
+  [US Emergency Department Visits for Outpatient Adverse Drug Events, 2013-2014](https://jamanetwork.com/journals/jama/fullarticle/2585977).
+  JAMA. 2016;316(20):2115-2125.
+- AHRQ / NCBI Bookshelf.
+  [Deprescribing To Reduce Medication Harms in Older Adults](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
+- American Geriatrics Society.
+  [2023 updated AGS Beers Criteria for potentially inappropriate medication use in older adults](https://pmc.ncbi.nlm.nih.gov/articles/PMC12478568/).
+- O'Mahony et al.
+  [STOPP/START criteria for potentially inappropriate prescribing in older people: version 3](https://pmc.ncbi.nlm.nih.gov/articles/PMC10447584/).
+## License
+The project package declares an MIT license in
+[polyguard-rl/pyproject.toml](polyguard-rl/pyproject.toml). See
+[polyguard-rl/LICENSE](polyguard-rl/LICENSE) for the license text.