| # PolyGuard OpenEnv: Training Medication-Safety Agents Inside a Verifier-Backed World |
|
|
| Someone does not experience an unsafe medication regimen as "polypharmacy." |
| They experience it as dizziness after a new sleep medication, bleeding after a |
| painkiller is added to a blood thinner, confusion from a sedative-opioid |
| combination, or a preventable emergency visit because five prescribers each saw |
| one slice of the medication list. |
|
|
| The dangerous part is often not a single drug. It is the combination: the wrong |
| pair, the wrong dose in the wrong organ-function context, the missing lab, the |
| duplicated class, the abrupt stop that should have been a taper, or the model |
| that confidently says "looks fine" because it was never forced to act inside a |
| safety-checked environment. |
|
|
| That is the problem PolyGuard was built for. |
|
|
| The [CDC medication-safety data page](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html) |
| reports that adverse drug events send more than 1.5 million people to emergency |
| departments in the United States every year, with almost 500,000 |
| hospitalizations. Adults 65 and older account for more than 600,000 of those |
| visits. A CDC-authored [JAMA surveillance study](https://jamanetwork.com/journals/jama/fullarticle/2585977) |
| found that older adults made up 34.5 percent of outpatient adverse-drug-event |
| ED visits and had the highest hospitalization rate, 43.6 percent. Globally, the |
| [WHO Medication Without Harm challenge](https://www.who.int/initiatives/medication-without-harm) |
| estimates the cost associated with medication errors at USD 42 billion |
| annually. AHRQ's deprescribing safety review summarizes estimates that |
| [45 percent of older adults are exposed to polypharmacy and 58 percent to |
| potentially inappropriate medications](https://www.ncbi.nlm.nih.gov/books/NBK600387/). |
|
|
| Not every adverse drug event is caused by an incorrect drug combination. But |
| these numbers describe the harm surface PolyGuard targets: medication decisions |
| where combination risk, monitoring gaps, frailty, organ function, uncertainty, |
| and action sequencing all matter at once. |
|
|
| PolyGuard turns that problem into an OpenEnv-compatible reinforcement-learning |
| environment for polypharmacy safety, medication optimization, deprescribing, |
| safe substitution, missing-evidence recovery, and precision dosing. A language |
| model policy observes a constrained patient/regimen state, chooses one legal |
| candidate action, receives verifier-backed reward, and improves through SFT |
| plus GRPO-style post-training. |
|
|
| It is not medical software and it is not clinical advice. It is a controlled |
| research environment for studying how language-model policies can be trained, |
| audited, and stress-tested on safety-critical medication action selection. |
|
|
| ## What To Open First |
|
|
| - GitHub repository: |
| [Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK) |
| - Live product Space: |
| [TheJackBright/polyguard-openenv-workbench](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench) |
| - One-run Colab/HF notebook: |
| [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb) |
| - Final evidence index: |
| [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
| - Artifact and traceability guide: |
| [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
| - Final artifact/evidence Space: |
| [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts) |
|
|
| The final artifact/evidence Space hosts the Qwen 3B artifact bundle. The Qwen |
| 0.5B and 1.5B runs were trained using a second Hugging Face account, so their |
| model artifacts could not be hosted in the same final Space. Their report |
| mirrors are checked into this repo: |
| [0.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct) |
| and |
| [1.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct). |
|
|
| ## The Research Bet |
|
|
| Medication safety is combinatorial, partially observable, and high stakes. A |
| useful policy has to do more than generate a plausible answer. It has to notice |
| drug-drug interaction risk, reason about comorbidities and organ function, |
| respect taper and monitoring requirements, choose safe substitutions, abstain |
| or ask for review when uncertainty is high, and expose why it acted. |
|
|
| The machine-learning pressure is just as real. If a medication vocabulary has |
| 500 drugs, the number of possible five-drug combinations is: |
|
|
| ```text |
| C(500, 5) = 255,244,687,600 |
| ``` |
|
|
| Exhaustive search is not a serious option. The paper that inspired this |
| project, [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190), |
| frames dangerous polypharmacy discovery as a bandit search problem over a huge |
| combination space. It benchmarks neural bandit search over simulated |
| polypharmacy datasets with 500 drugs and 100,000 distinct combinations, and |
| reports detection of up to 72 percent of potentially inappropriate |
| polypharmacies with 99 percent average precision after 30,000 time steps. |
|
|
| PolyGuard borrows that search instinct, then moves the problem from offline |
| combination mining into an agentic environment. The policy sees a patient |
| state, chooses among legal clinical action candidates, and is judged by a |
| deterministic verifier and reward router rather than by free-form text |
| preference alone. |
|
|
| The research question is narrow and concrete: |
|
|
| Can environment-backed feedback make a small open model better at safe |
| medication action selection than prompt-only, first-legal, rule-only, or |
| single-agent baselines? |
|
|
| The answer in this repository is an inspectable system: |
|
|
| 1. A finite-horizon OpenEnv simulation for medication decisions. |
| 2. A constrained action space, so the model chooses candidate actions instead |
| of inventing arbitrary clinical instructions. |
| 3. A legality verifier that prevents unsafe state mutation. |
| 4. Thirteen reward components rolled into four primary reward channels. |
| 5. A multi-agent policy stack with supervisor routing, contextual bandit |
| reranking, planner selection, critic veto, and explanation logging. |
| 6. SFT for format and clinical-prior warm start. |
| 7. GRPO with environment-backed reward, not an opaque LLM judge. |
| 8. Agentic evaluation with baseline comparison, policy ablations, post-save |
| inference, robustness checks, action traces, and failure mining. |
|
|
|  |
|
|
| ## A Failure Trace That Motivated the Design |
|
|
| In the final matched-seed traces, the failure mode is not abstract. On seeds |
| `8000` and `8004`, the basic prompt-style proxy repeatedly chose `cand_01`, |
| the first legal candidate. In those cases, `cand_01` meant `KEEP_REGIMEN` while |
| a hidden `warfarin_like` + `nsaid_like` interaction remained unresolved. The |
| verifier recorded `holdout_ddi_not_addressed`. |
|
|
| The full PolyGuard pipeline selected `cand_03`, a safer intervention candidate, |
| and avoided those failure reasons. |
|
|
| That is the core argument of the project: medication AI should be judged inside |
| a stateful safety environment, not only by whether its answer sounds clinically |
| plausible. |
|
|
| Internal evidence: |
| [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json) |
| and |
| [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl). |
|
|
| ## Safety Contract |
|
|
| PolyGuard does not let a model directly mutate a medication list from free |
| text. Every decision is candidate-based, verifier-checked, reward-decomposed, |
| and traced. Illegal actions can be scored, penalized, and logged, but they do |
| not change patient state. |
|
|
| The repo evidence for this contract is spread across the environment, rules, |
| and final reports: |
|
|
| | Claim | Repo evidence | |
| | --- | --- | |
| | Hard contraindication examples are represented | [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) | |
| | Safer alternatives are explicit | [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) | |
| | Unsafe substitutions and dose escalations are blocked before state mutation | [app/env/verifier.py](polyguard-rl/app/env/verifier.py) | |
| | Reward hacking and loop-like behavior are surfaced | [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [docs/reward_design.md](polyguard-rl/docs/reward_design.md) | |
| | Baseline failure is traceable by seed and candidate | [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json), [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) | |
| | Final claims are separated from older smoke artifacts | [final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) | |
|
|
| ## Project Map |
|
|
| The implementation lives under [polyguard-rl/](polyguard-rl/). |
|
|
| | Area | Key paths | |
| | --- | --- | |
| | OpenEnv runtime | [openenv.yaml](polyguard-rl/openenv.yaml), [app/env/env_core.py](polyguard-rl/app/env/env_core.py), [app/env/fastapi_app.py](polyguard-rl/app/env/fastapi_app.py), [server/app.py](polyguard-rl/server/app.py) | |
| | Action/state contracts | [app/common/types.py](polyguard-rl/app/common/types.py), [app/common/enums.py](polyguard-rl/app/common/enums.py) | |
| | Candidate generation and verifier | [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py), [app/env/verifier.py](polyguard-rl/app/env/verifier.py) | |
| | Reward and anti-cheat | [app/env/reward_router.py](polyguard-rl/app/env/reward_router.py), [app/env/reward_scaling.py](polyguard-rl/app/env/reward_scaling.py), [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [configs/rewards.yaml](polyguard-rl/configs/rewards.yaml) | |
| | Multi-agent policy | [app/agents/](polyguard-rl/app/agents/), [docs/agents.md](polyguard-rl/docs/agents.md) | |
| | Bandits and baselines | [app/models/baselines/contextual_bandit.py](polyguard-rl/app/models/baselines/contextual_bandit.py), [app/models/baselines/contextual_bandit_policy.py](polyguard-rl/app/models/baselines/contextual_bandit_policy.py), [app/models/baselines/](polyguard-rl/app/models/baselines/) | |
| | Training | [app/training/](polyguard-rl/app/training/), [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), [docs/training.md](polyguard-rl/docs/training.md) | |
| | Data | [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json), [data/scenarios/](polyguard-rl/data/scenarios/), [docs/datasets.md](polyguard-rl/docs/datasets.md) | |
| | Evaluation | [app/evaluation/](polyguard-rl/app/evaluation/), [scripts/evaluate_all.py](polyguard-rl/scripts/evaluate_all.py), [docs/evaluation.md](polyguard-rl/docs/evaluation.md) | |
| | Product API/UI | [app/api/](polyguard-rl/app/api/), [app/ui/frontend/](polyguard-rl/app/ui/frontend/), [docs/ui.md](polyguard-rl/docs/ui.md) | |
| | Math | [docs/math.md](polyguard-rl/docs/math.md), [docs/mathematics.md](polyguard-rl/docs/mathematics.md) | |
| | Results | [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/) | |
|
|
| Supporting docs include [architecture.md](polyguard-rl/docs/architecture.md), |
| [environment_design.md](polyguard-rl/docs/environment_design.md), |
| [reward_design.md](polyguard-rl/docs/reward_design.md), |
| [safety.md](polyguard-rl/docs/safety.md), |
| [precision_dosing.md](polyguard-rl/docs/precision_dosing.md), |
| [graph_models.md](polyguard-rl/docs/graph_models.md), |
| [ablations.md](polyguard-rl/docs/ablations.md), |
| [api.md](polyguard-rl/docs/api.md), |
| [deployment.md](polyguard-rl/docs/deployment.md), |
| [ui.md](polyguard-rl/docs/ui.md), |
| [DEMO_RECORDING_SCRIPT.md](polyguard-rl/docs/DEMO_RECORDING_SCRIPT.md), and |
| [submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md). |
|
|
| ## The OpenEnv Environment |
|
|
| At the center is `PolyGuardEnv`, implemented in |
| [app/env/env_core.py](polyguard-rl/app/env/env_core.py). It follows the |
| OpenEnv/Gym shape: |
|
|
| ```text |
| reset(seed, difficulty, sub_environment, scenario_id, patient_id) |
| -> PolyGuardObservation |
| |
| step(PolyGuardAction) |
| -> (PolyGuardObservation, reward, done, info) |
| ``` |
|
|
| At reset, the environment loads or generates a patient scenario, selects a |
| difficulty and sub-environment, computes a risk summary, builds candidate |
| actions, estimates uncertainty, and emits a strict observation. At step time, |
| the environment parses the action, checks legality, evaluates anti-cheat rules, |
| mutates state only if the action is safe, computes decomposed reward, appends a |
| trace, and returns detailed `info` fields such as failure reasons, transition |
| delta, primary reward channels, invalid-action count, and timeout checks. |
|
|
|  |
|
|
| PolyGuard is not one task. It cycles through specialized sub-environments: |
|
|
| | Sub-environment | What it stresses | |
| | --- | --- | |
| | `DDI` | High-risk drug-drug interaction recognition and resolution | |
| | `BANDIT_MINING` | Candidate exploration and shortlist/ranking behavior inspired by bandit search | |
| | `REGIMEN_RISK` | General medication burden and regimen optimization | |
| | `PRECISION_DOSING` | Dose-hold, dose reduction, renal/hepatic guardrails, monitoring decisions | |
| | `LONGITUDINAL_DEPRESCRIBING` | Multi-step taper/deprescribing behavior over a longer horizon | |
| | `WEB_SEARCH_MISSING_DATA` | Evidence fetch or review when critical data is missing | |
| | `ALTERNATIVE_SUGGESTION` | Safe alternatives and within-class substitution | |
| | `NEW_DRUG_DECOMPOSITION` | First-pass reasoning over an unknown or combination medication | |
|
|
| The curriculum in [app/env/curriculum.py](polyguard-rl/app/env/curriculum.py) |
| starts with short easy DDI/regimen-risk episodes, then adds bandit and |
| alternative-selection tasks, and finally hard cases with precision dosing, |
| longitudinal deprescribing, missing data, and new-drug decomposition. |
|
|
| ### State and Observation |
|
|
| The latent state is represented by `PolyGuardState` and includes patient |
| demographics, active decision mode, step budget, medications, dose buckets, |
| comorbidities, labs, vitals, frailty, adherence, monitoring gaps, prior adverse |
| event history, burden score, severe-pair count, precision dosing flags, |
| unresolved conflicts, action history, cumulative reward, and done state. |
|
|
| The agent does not get every simulator internal. It receives a controlled |
| `PolyGuardObservation` with a patient summary, medication table, comorbidity |
| summary, organ function, labs/vitals, graph safety summary, burden summary, |
| precision dosing flags, unresolved conflicts, candidate actions, step budget, |
| action history, warnings, abstention indicators, seed, scenario, difficulty, |
| and sub-environment. |
|
|
| This split matters. PolyGuard is a partially observable environment. Missing |
| labs and unresolved conflicts are visible as uncertainty signals, not as hidden |
| reward traps. |
|
|
| ## Action Space and Safety Constraints |
|
|
| PolyGuard deliberately avoids unconstrained text actions. The policy chooses a |
| strict `PolyGuardAction` with fields such as `mode`, `action_type`, |
| `target_drug`, `replacement_drug`, `dose_bucket`, `taper_days`, |
| `monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`, |
| `candidate_id`, `confidence`, and `rationale_brief`. |
|
|
| The action types are compact: |
|
|
| | Family | Action types | |
| | --- | --- | |
| | Regimen | `KEEP_REGIMEN`, `STOP_DRUG`, `SUBSTITUTE_WITHIN_CLASS`, `RECOMMEND_ALTERNATIVE` | |
| | Dosing | `REDUCE_DOSE_BUCKET`, `INCREASE_DOSE_BUCKET`, `DOSE_HOLD`, `ORDER_MONITORING_AND_WAIT` | |
| | Deprescribing | `TAPER_INITIATE`, `TAPER_CONTINUE` | |
| | Evidence and uncertainty | `FETCH_EXTERNAL_EVIDENCE`, `DECOMPOSE_NEW_DRUG`, `REQUEST_SPECIALIST_REVIEW`, `REQUEST_PHARMACIST_REVIEW` | |
|
|
| The candidate builder in |
| [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py) |
| generates a bounded candidate set: |
|
|
| ```text |
| 3 <= |C_t| <= 10 |
| ``` |
|
|
| Each candidate carries estimated safety delta, burden delta, disease stability, |
| uncertainty score, rationale tags, required monitoring, and a legality |
| precheck. Policy selection is candidate selection: |
|
|
| ```text |
| a_t = to_action(c_t), where c_t is in C_t |
| ``` |
|
|
| The verifier in [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
| enforces hard safety constraints before state mutation. It checks that target |
| drugs exist when required, substitutions and alternatives are allowed, evidence |
| domains are allowlisted, new-drug decomposition includes required components, |
| taper-required drugs are not stopped abruptly, renal/hepatic unsafe dose |
| escalation is blocked, duplicate therapy and contraindicated replacement pairs |
| are blocked, and monitoring/hold actions include a monitoring plan. |
|
|
| Illegal actions can receive reward penalties and become visible in traces, but |
| they do not mutate patient state. |
|
|
| ## Multi-Agent Policy Stack |
|
|
| The "agents" in PolyGuard are an auditable policy factorization rather than |
| free-form independent chatbots. A step flows through: |
|
|
| ```text |
| MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate |
| -> Supervisor -> Planner -> Critic -> Env -> Explainer |
| ``` |
|
|
|  |
|
|
| | Agent/module | Role | |
| | --- | --- | |
| | `MedRecAgent` | Summarizes current regimen and medication burden | |
| | `EvidenceAgent` | Retrieves local or fallback evidence when missing data is present | |
| | `GraphSafetyAgent` | Scores risky pairs, side-effect load, duplicate therapy, and graph safety patterns | |
| | `DosingAgent` | Detects dose-sensitive cases and dose-hold opportunities | |
| | `CandidateAgent` | Exposes legal candidate actions from the environment candidate builder | |
| | `SupervisorAgent` | Routes to regimen optimization, dose optimization, or review mode | |
| | `PlannerAgent` | Selects an action from candidates through the policy provider | |
| | `CriticAgent` | Vetoes illegal or unsafe proposed actions and can force review fallback | |
| | `ExplainerAgent` | Records grounded rationale for demo, replay, and audit | |
|
|
| The orchestration modes are `sequential_pipeline`, `supervisor_routed`, |
| `replan_on_veto`, and `lightweight_debate`. Policy-stack ablations compare |
| `bandit-only`, `llm-only`, and `llm+bandit`. |
|
|
| ## Contextual Bandits |
|
|
| PolyGuard uses contextual bandits as an inspectable candidate-reranking layer. |
| This is where the project most directly echoes the arXiv bandit inspiration: |
| unsafe polypharmacy search is combinatorial, so the system should learn which |
| regions of the candidate/action space are worth exploring rather than enumerate |
| everything. |
|
|
| Each candidate becomes an 8-dimensional feature vector: |
|
|
| ```text |
| x(c) = [ |
| 1, |
| I[legality_precheck], |
| estimated_safety_delta, |
| burden_delta, |
| disease_stability_estimate, |
| 1 - uncertainty_score, |
| I[mode = DOSE_OPT], |
| I[mode = REVIEW] |
| ] |
| ``` |
|
|
| An arm is keyed by macro mode and action type: |
|
|
| ```text |
| arm(c) = mode(c) || ":" || action_type(c) |
| ``` |
|
|
| The LinUCB variant maintains, for each arm `a`: |
|
|
| ```text |
| A_a = I + sum x x^T |
| b_a = sum r x |
| theta_a = A_a^{-1} b_a |
| |
| score_a(x) = theta_a^T x + alpha * sqrt(x^T A_a^{-1} x) |
| ``` |
|
|
| There is also a Thompson-style variant: |
|
|
| ```text |
| score_a(x) = theta_a^T x + Normal(0, alpha) |
| ``` |
|
|
| This layer can shortlist candidates before the planner emits the final action. |
| It is deliberately kept inside the candidate space: the bandit can improve |
| ordering and exploration, but it cannot invent an unsafe action outside the |
| environment contract. |
|
|
| ## Reward Model |
|
|
| The reward model is decomposed on purpose. A single scalar reward is needed for |
| RL, but safety-critical RL needs more than one opaque number. PolyGuard logs 13 |
| component columns and four primary channels on every step. |
|
|
|  |
|
|
| All reward values are clamped and quantized: |
|
|
| ```text |
| q(x) = round(clip(x, 0.001, 0.999), 3) |
| ``` |
|
|
| The 13 reward components are: |
|
|
| | Component | Weight | Meaning | |
| | --- | ---: | --- | |
| | `format_compliance_score` | 0.08 | Action payload conforms to the schema | |
| | `candidate_alignment_score` | 0.08 | The model selected a valid candidate-style id | |
| | `legality_score` | 0.12 | The verifier accepted the action | |
| | `safety_delta_score` | 0.15 | Severe-pair and burden risk decreased | |
| | `burden_improvement_score` | 0.08 | Dose-weighted medication burden improved | |
| | `disease_stability_score` | 0.10 | The action did not destabilize underlying disease management | |
| | `dosing_quality_score` | 0.08 | Dose-sensitive routing/action quality | |
| | `abstention_quality_score` | 0.06 | Review/abstention is appropriate under uncertainty | |
| | `efficiency_score` | 0.06 | The action uses the finite step budget well | |
| | `process_fidelity_score` | 0.06 | The action follows task-specific process expectations | |
| | `explanation_grounding_score` | 0.03 | The rationale is present and grounded | |
| | `anti_cheat_score` | 0.06 | Reward-hacking checks did not fire | |
| | `uncertainty_calibration_score` | 0.04 | Confidence matches observable uncertainty | |
|
|
| The scalar reward is a weighted average: |
|
|
| ```text |
| R_env(s_t, a_t, s_{t+1}) = q( sum_i w_i c_i / sum_i w_i ) |
| ``` |
|
|
| Safety-heavy terms dominate the total weight: |
|
|
| ```text |
| legality + safety_delta + burden + disease_stability + anti_cheat |
| = 0.12 + 0.15 + 0.08 + 0.10 + 0.06 |
| = 0.51 |
| ``` |
|
|
| The four primary reward channels are: |
|
|
| | Channel | Component family | |
| | --- | --- | |
| | `safety_legality` | legality, candidate alignment, anti-cheat, uncertainty calibration | |
| | `clinical_improvement` | safety delta, burden improvement, disease stability | |
| | `dosing_quality` | dosing quality and abstention quality | |
| | `process_integrity` | format compliance, efficiency, process fidelity, explanation grounding | |
|
|
| These channels are emitted in `info.primary_reward_channels`, GRPO reward logs, |
| reports, plots, and ablation summaries. |
|
|
| ## Anti-Cheat and Failure Visibility |
|
|
| RL policies exploit reward functions. PolyGuard makes common shortcut failures |
| explicit: repeated action loops, excessive keep-regimen behavior, excessive |
| review/abstention behavior, candidate ID mismatch, candidate outside the legal |
| set, hidden high-risk DDI no-op behavior, parser exploit patterns in rationales, |
| and retries of failed no-op actions. |
|
|
| If an exploit is detected: |
|
|
| ```text |
| anti_cheat_score = 0.001 |
| done = true |
| termination_reason = "exploit_detection" |
| ``` |
|
|
| Episodes can also terminate on step budget exhaustion, repeated invalid |
| actions, safety-veto threshold, patient destabilization, safe resolution, |
| wall-clock timeout, or per-step timeout. |
|
|
|  |
|
|
| ## Mathematics |
|
|
| PolyGuard can be read as a finite-horizon constrained partially observable |
| Markov decision process: |
|
|
| ```text |
| M = (S, A, O, T, R, H, C) |
| ``` |
|
|
| where `S` is latent patient/regimen state, `A` is the constrained medication |
| action set, `O` is the controlled observation, `T(s' | s, a)` is the transition |
| function, `R(s, a, s')` is verifier-backed reward, `H` is the episode horizon, |
| and `C(s, a)` is the hard safety/legality constraint predicate. |
|
|
| The objective is: |
|
|
| ```text |
| maximize_pi E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ] |
| subject to C(s_t, a_t) = 1 whenever possible |
| ``` |
|
|
| There is no explicit discount factor in the runtime. Time preference enters |
| through finite horizons and the efficiency reward: |
|
|
| ```text |
| efficiency_t = q(1 - step_count_t / (max_steps + 1)) |
| ``` |
|
|
| State transition is two-gated: |
|
|
| ```text |
| if verifier(s_t, a_t).legal and not anti_cheat(s_t, a_t): |
| s_{t+1} = T(s_t, a_t) |
| else: |
| s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t) |
| ``` |
|
|
| Risk-like deltas become reward through: |
|
|
| ```text |
| delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post)) |
| ``` |
|
|
| For burden and contraindicated-pair improvement: |
|
|
| ```text |
| burden_reward = delta_reward(pre_burden, post_burden) |
| pair_reward = delta_reward(pre_pairs, post_pairs) |
| |
| safety_delta_score = |
| q(0.65 * pair_reward + 0.35 * burden_reward) if legal |
| 0.001 otherwise |
| ``` |
|
|
| GRPO uses environment execution as the reward function. For each prompt, the |
| model emits candidate completions; PolyGuard parses the candidate id, resets a |
| deterministic environment using the recorded seed and scenario fields, executes |
| one step, and returns reward. The training reward combines environment reward |
| with a legality bonus: |
|
|
| ```text |
| legal_bonus = 0.95 if action is legal else 0.05 |
| |
| R_GRPO = q(0.80 * R_env + 0.20 * legal_bonus) |
| ``` |
|
|
| Conceptually, GRPO forms a within-prompt advantage: |
|
|
| ```text |
| A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon) |
| ``` |
|
|
| and optimizes a clipped policy-ratio objective with KL regularization. The |
| optimizer mechanics are TRL's; PolyGuard's contribution is the verifier-backed |
| reward function and the controlled action/state environment. The expanded |
| derivation is in |
| [polyguard-rl/docs/mathematics.md](polyguard-rl/docs/mathematics.md). |
|
|
| ## Data and Dataset Pipeline |
|
|
| The data pipeline builds a compact medication-safety substrate from local drug |
| knowledge, synthetic patients, scenario files, retrieval text, and optional |
| external augmentation. |
|
|
|  |
|
|
| The dataset design is documented in |
| [docs/datasets.md](polyguard-rl/docs/datasets.md). The local generated |
| pipeline produces these processed artifacts and counts: |
|
|
| | Artifact | Count | Path | |
| | --- | ---: | --- | |
| | Normalized drug rows | 10 | `data/processed/normalized_drugs.parquet` | |
| | Drug class rows | 10 | `data/processed/drug_classes.parquet` | |
| | Interaction rows | 2 | `data/processed/interactions.parquet` | |
| | Graph edges | 18 | `data/processed/graph_edges.parquet` | |
| | Synthetic patients | 20 | `data/processed/patients_synthetic.parquet` | |
| | Retrieval documents | 8 | `data/processed/retrieval_corpus.jsonl` | |
| | Easy scenarios | 100 | [data/scenarios/easy/](polyguard-rl/data/scenarios/easy/) | |
| | Medium scenarios | 200 | [data/scenarios/medium/](polyguard-rl/data/scenarios/medium/) | |
| | Hard scenarios | 200 | [data/scenarios/hard/](polyguard-rl/data/scenarios/hard/) | |
| | Local small SFT rows | 80 | `data/processed/training_corpus_sft.jsonl` | |
| | Local small GRPO prompts | 80 | `data/processed/training_corpus_grpo_prompts.jsonl` | |
|
|
| The provenance manifest generated by the local pipeline records source policy |
| and counts at `data/processed/provenance_manifest.json`. |
|
|
| Additional data-governance and rule artifacts are produced or consumed by the |
| pipeline: |
|
|
| | Artifact | Why it matters | |
| | --- | --- | |
| | `data/processed/ingested_sources.json` | Source ingestion ledger used by the local build | |
| | `data/processed/feature_dictionary.json` | Names and meanings of structured model features | |
| | `data/processed/burden_rules.yaml` | Medication-burden and duplicate-therapy rules | |
| | `data/processed/substitution_rules.yaml` | Data-level safer-substitution rules | |
| | `data/processed/taper_rules.yaml` | Deprescribing and taper requirements | |
| | `data/retrieval_index/index.json` | Retrieval index over local evidence chunks | |
|
|
| The local knowledge seed is |
| [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json). |
| It contains drug classes, example high-risk pairs, renal and hepatic flags, |
| side-effect tags, substitution rules, and taper requirements. The processed |
| tables then feed graph modeling, candidate generation, environment scenarios, |
| retrieval, SFT rows, and GRPO prompts. |
|
|
| The full training/evidence runs used 2,000 examples per Qwen model, recorded in |
| the final reports under |
| [docs/results/final_submission_evidence/reports/](polyguard-rl/docs/results/final_submission_evidence/reports/). |
|
|
| ## Models Inside the Environment |
|
|
| PolyGuard combines learned and rule-backed components: |
|
|
| - Graph safety model: |
| [app/models/graph/](polyguard-rl/app/models/graph/) produces regimen |
| embeddings, pairwise DDI severity, severe-alert probability, and side-effect |
| tag probabilities. |
| - Tabular risk model: |
| [app/models/tabular/](polyguard-rl/app/models/tabular/) supports calibrated |
| patient/regimen risk heads and evaluation. |
| - Dosing model: |
| [app/models/dosing/](polyguard-rl/app/models/dosing/) models dose-sensitive |
| states with target attainment, toxicity, underdose risk, organ stress, |
| interaction load, and monitoring need. |
| - Retrieval: |
| [app/models/retrieval/](polyguard-rl/app/models/retrieval/) and |
| [app/knowledge/](polyguard-rl/app/knowledge/) provide local evidence chunks, |
| drug rules, renal/hepatic guardrails, duplicate therapy rules, substitution |
| rules, taper rules, burden scoring, and side-effect ontology. |
| - Active model runtime: |
| [app/models/policy/active_model.py](polyguard-rl/app/models/policy/active_model.py) |
| discovers activated artifacts from `checkpoints/active/active_model_manifest.json`; |
| the tracked evidence mirror includes |
| [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json). |
| The provider load order prefers a GRPO adapter, then merged model, then SFT |
| adapter. |
| - Provider runtime: |
| [app/models/policy/provider_runtime.py](polyguard-rl/app/models/policy/provider_runtime.py) |
| is Transformers-first, with optional Ollama when enabled. If model loading is |
| unavailable, the runtime falls back to deterministic safety ranking. |
|
|
| Tracked support-model reports show that the environment is not only an LLM |
| wrapper: |
|
|
| | Component | Report | Current tracked result | |
| | --- | --- | --- | |
| | Graph model | [docs/results/graph_train.json](polyguard-rl/docs/results/graph_train.json) | `status: trained`, `num_samples: 180`, artifact path `outputs/models/graph_model.pkl` | |
| | Tabular risk model | [docs/results/risk_train.json](polyguard-rl/docs/results/risk_train.json) | `status: trained`, `dataset_size: 180`, `train_mae: 0.0033`, artifact path `outputs/models/tabular_risk.pkl` | |
| | Dose surrogate model | [docs/results/dose_train.json](polyguard-rl/docs/results/dose_train.json) | `status: trained`, `dataset_size: 120`, `train_mae: 0.0025`, artifact path `outputs/models/dose_model.pkl` | |
|
|
| The hard-coded contraindicated seed pairs in |
| [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) |
| include `warfarin_like` + `nsaid_like` and `benzodiazepine_like` + |
| `opioid_like`. Substitution rules in |
| [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) |
| include safer alternatives such as `nsaid_like -> acetaminophen_like`, |
| `nsaid_like -> topical_nsaid_like`, `benzodiazepine_like -> |
| non_benzo_sleep_support`, and `opioid_like -> non_opioid_analgesic`. |
| |
| ### Precision Dosing |
| |
| Precision dosing uses sensitive classes such as anticoagulants, sedatives, and |
| glucose-lowering drugs. The dosing agent and surrogate model are implemented in |
| [app/agents/dosing_agent.py](polyguard-rl/app/agents/dosing_agent.py) and |
| [app/models/dosing/](polyguard-rl/app/models/dosing/). |
| |
| The surrogate PK/PD transition in |
| [app/models/dosing/surrogate_pkpd.py](polyguard-rl/app/models/dosing/surrogate_pkpd.py) |
| uses effect, toxicity, underdose, organ stress, and interaction load: |
| |
| ```text |
| effective_delta = dose_delta * (1 - min(0.6, organ_factor * 0.4)) |
|
|
| effect' = |
| clip(effect + 0.28 * effective_delta - 0.05 * interaction_factor, 0, 1) |
|
|
| toxicity_gain = |
| max(0, dose_delta) * (0.35 + 0.25 * organ_factor + 0.20 * interaction_factor) |
|
|
| toxicity' = |
| clip(0.85 * toxicity + toxicity_gain, 0, 1) |
| |
| underdose' = |
| clip(1 - effect' + 0.15 * max(0, -dose_delta), 0, 1) |
| ``` |
| |
| The higher-level dosing metrics use target attainment, toxicity avoidance, |
| underdose risk, and monitoring need: |
| |
| ```text |
| target_attainment = 1 - abs(effect_level - 0.62) |
| toxicity_proxy = toxicity_level + 0.20 * organ_stress + 0.12 * interaction_load |
| measurement_need = max(toxicity_proxy, underdose_proxy) |
| ``` |
| |
| ## Training and Post-Training |
| |
| The training stack is deliberately staged: |
| |
| 1. Build structured data, scenarios, retrieval records, SFT examples, and GRPO |
| prompts. |
| 2. Run SFT with TRL to teach the model the candidate-id format and obvious |
| clinical priors. |
| 3. Run GRPO with environment-backed reward, where sampled candidate completions |
| are executed in PolyGuardEnv and scored by the verifier/reward router. |
| 4. Track sampled generations, reward components, primary reward channels, |
| legality, anti-cheat events, and training curves. |
| 5. Run policy-stack ablations and baseline comparisons. |
| 6. Merge or export adapters safely. |
| 7. Validate post-save inference from the saved artifact, not from an in-memory |
| training object. |
| 8. Generate reports, charts, action traces, and final artifact manifests. |
| |
| The relevant training source files are |
| [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), |
| [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), |
| [app/training/sft_trl.py](polyguard-rl/app/training/sft_trl.py), |
| [app/training/grpo_trl.py](polyguard-rl/app/training/grpo_trl.py), |
| [app/training/reward_functions.py](polyguard-rl/app/training/reward_functions.py), |
| [app/training/openenv_wrapper.py](polyguard-rl/app/training/openenv_wrapper.py), |
| and [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py). |
| |
| The one-run notebook is |
| [polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb). |
| It is the accessible Colab/HF workflow for building data, running checks, |
| launching training, pulling reports, generating charts, validating inference, |
| activating a model, deploying the product Space, and running acceptance checks. |
| |
| The modular notebook series is: |
| |
| - [01_data_building.ipynb](polyguard-rl/notebooks/01_data_building.ipynb) |
| - [02_knowledge_graph.ipynb](polyguard-rl/notebooks/02_knowledge_graph.ipynb) |
| - [03_risk_models.ipynb](polyguard-rl/notebooks/03_risk_models.ipynb) |
| - [04_environment_validation.ipynb](polyguard-rl/notebooks/04_environment_validation.ipynb) |
| - [05_sft_debug.ipynb](polyguard-rl/notebooks/05_sft_debug.ipynb) |
| - [06_grpo_debug.ipynb](polyguard-rl/notebooks/06_grpo_debug.ipynb) |
| - [07_policy_analysis.ipynb](polyguard-rl/notebooks/07_policy_analysis.ipynb) |
| - [08_dosing_analysis.ipynb](polyguard-rl/notebooks/08_dosing_analysis.ipynb) |
| - [09_training_loop.ipynb](polyguard-rl/notebooks/09_training_loop.ipynb) |
| |
| For exact local and remote execution details, use |
| [docs/training.md](polyguard-rl/docs/training.md) and |
| [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md). |
| This blog focuses on architecture, data, evaluation, and evidence rather than |
| private or environment-specific commands. |
| |
| ## Training Curves and Model Results |
| |
| The final curated evidence lives in |
| [polyguard-rl/docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/). |
| It replaces earlier smoke-run charts and older 0.5B/1.5B-only views. |
| |
| ### SFT Loss Across Qwen Runs |
| |
|  |
| |
| The SFT curves, post-save valid rates, and token-accuracy histories show that |
| the models learned the candidate-id output contract rather than only producing |
| unconstrained prose. The visible curves drop from roughly `3.0-3.6` initial |
| loss to low final loss across all three Qwen sizes. |
| |
|  |
| |
| The tracked per-model summaries are: |
| |
| | Run | Model | Epochs | Final step | Runtime | Key SFT metrics | |
| | --- | --- | ---: | ---: | ---: | --- | |
| | `qwen-qwen2-5-0-5b-instruct` | `Qwen/Qwen2.5-0.5B-Instruct` | 2 | 2,000 | `234.6302s` | loss `3.0856 -> 0.0626`, best `0.0057`, train loss `0.1923`, token accuracy `0.9717`, valid rate `1.0`, avg env reward `0.726`, latency `1.839s` | |
| | `qwen-qwen2-5-1-5b-instruct` | `Qwen/Qwen2.5-1.5B-Instruct` | 2 | 4,000 | `483.7085s` | loss `2.9686 -> 0.0681`, best `0.0009`, train loss `0.1152`, token accuracy `0.9726`, valid rate `1.0`, avg env reward `0.726`, latency `2.158s` | |
| | `qwen-qwen2-5-3b-instruct` | `Qwen/Qwen2.5-3B-Instruct` | 2 | 2,000 | `715.2908s` | loss `3.5687 -> 0.054`, best `0.0022`, train loss `0.1569`, token accuracy `0.9750`, SFT avg env reward `0.781`, SFT latency `2.863s` | |
| |
| Each SFT run used 2,000 examples. The 0.5B and 3B runs recorded 2,001 history |
| rows including the final trainer summary; the 1.5B run recorded 4,001 history |
| rows because its batch configuration produced 4,000 final steps. |
| |
| ### GRPO Reward Curve |
| |
|  |
| |
|  |
| |
| The complete GRPO evidence is available for Qwen 3B: |
| |
| - Backend: `trl_transformers` |
| - Model: `Qwen/Qwen2.5-3B-Instruct` |
| - Records: `2000` |
| - Epochs: `1.0` |
| - Final step: `2000` |
| - Runtime: `6873.9375s` (`1.91h`) |
| - Reward samples: `4000` |
| - GRPO average reward: `0.767` |
| - GRPO reward history: min `0.376`, max `0.880`, last `0.812`, average `0.76685` |
| - GRPO train loss: `0.000002665` |
| - Post-save GRPO valid rate: `1.0` |
| - Post-save GRPO average environment reward: `0.726` |
| - Post-save GRPO average latency: `3.681s` |
| - Artifact path recorded in the report: `checkpoints/sweeps/qwen-qwen2-5-3b-instruct/grpo_adapter` |
|
|
| Source reports: |
| [grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json), |
| [postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json), |
| and |
| [submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json). |
|
|
| ### SFT vs GRPO by Model |
|
|
|  |
|
|
| This chart is intentionally transparent about artifact availability. Qwen 0.5B |
| and 1.5B have SFT reports/histories and post-save SFT evidence in the repo, but |
| their adapter directories were not present in the local/final artifact mirrors |
| at packaging time. Qwen 3B has the complete SFT plus GRPO artifact set. |
|
|
| The packaged manifest records Qwen 3B as complete with 125 checkpoint files |
| (`433,208,536` bytes), 11 SFT adapter files (`30,655,905` bytes), 11 GRPO |
| adapter files (`30,656,841` bytes), and 9 report files (`5,930,214` bytes). |
| Qwen 0.5B and 1.5B are retained as report/post-save evidence only. |
|
|
| Manifest: |
| [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json). |
|
|
| ### Product Pipeline vs Basic LLM Proxy |
|
|
|  |
|
|
| Matched-seed evaluation compares a basic LLM-style first-legal proxy, an |
| SFT-style safety ranker, and the full PolyGuard orchestrated pipeline. The same |
| PolyGuard verifier/reward system judges all three. |
|
|
| | Policy | Episodes | Avg reward | Legality rate | Failure/exploit rate | Candidate diversity | |
| | --- | ---: | ---: | ---: | ---: | ---: | |
| | Basic LLM proxy | 8 | `0.762` | `1.0` | `0.25` | 1 | |
| | SFT policy proxy | 8 | `0.818` | `1.0` | `0.0` | 2 | |
| | Full PolyGuard pipeline | 8 | `0.805` | `1.0` | `0.0` | 2 | |
|
|
| The full pipeline improves average verifier reward over the basic LLM proxy by |
| `+0.043` while reducing visible failure/exploit rate from `0.25` to `0.0`. |
|
|
|  |
|
|
| Two matched seeds expose the core failure mode: the basic policy repeatedly |
| kept a regimen despite the hidden `warfarin_like` + `nsaid_like` DDI holdout, |
| triggering `holdout_ddi_not_addressed`. The full pipeline selected safer dose |
| or hold candidates and avoided those failure reasons. |
|
|
| Source: |
| [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json). |
|
|
| ### Reward Components and Channels |
|
|
|  |
|
|
|  |
|
|
| The reward charts are as important as the scalar reward curve. They show |
| whether the model is improving by becoming safer and more process-faithful or |
| merely exploiting one easy component. The reports log the full 13-component |
| reward vector and the four primary channels for GRPO and evaluation runs. |
|
|
| For Qwen 3B GRPO, the tracked average primary channels are: |
|
|
| | Channel | Average | |
| | --- | ---: | |
| | `safety_legality` | `0.816` | |
| | `clinical_improvement` | `0.609` | |
| | `dosing_quality` | `0.543` | |
| | `process_integrity` | `0.875` | |
|
|
| ### Post-Save Inference |
|
|
|  |
|
|
| Post-save inference is separate from training. The exported/activated artifact |
| is loaded and asked to choose candidate ids on held prompt samples. The Qwen 3B |
| GRPO adapter path produced: |
|
|
| - `model_source: adapter` |
| - `samples: 5` |
| - `valid_rate: 1.0` |
| - `avg_env_reward: 0.726` |
| - `avg_latency_seconds: 3.681` |
|
|
| The caveat matters: `valid_rate: 1.0` means the output was parseable and |
| executable as a candidate selection. In the five-sample Qwen 3B post-save GRPO |
| report, four valid samples still terminated with `exploit_detection`. That is |
| retained as safety evidence, because PolyGuard's job is to expose suspicious or |
| loop-like behavior instead of hiding it behind a clean parse metric. |
|
|
| ## Agentic Evaluation |
|
|
| Evaluation is not one benchmark number. The evaluation stack under |
| [app/evaluation/](polyguard-rl/app/evaluation/) includes offline policy |
| evaluation, safety evaluation, dosing evaluation, robustness under missing labs |
| and noisy inputs, calibration and abstention evaluation, process fidelity, |
| subgroup summaries, explainability grounding, baseline comparison, policy |
| ablations, failure mining, and action traces. |
|
|
| The tracked benchmark report records: |
|
|
| | Metric family | Result | |
| | --- | --- | |
| | Offline avg reward | `0.772833` | |
| | Offline legal rate | `1.0` | |
| | Severe violation rate | `0.0` | |
| | Illegal step rate | `0.0` | |
| | Dosing target attainment | `0.75` | |
| | Dosing toxicity avoidance | `1.0` | |
| | Missing-labs safety rate | `0.666667` | |
| | Noisy-dose, conflicting-meds, alias-noise, hidden-duplicate, wrong-candidate-id, stale-evidence, delayed-ADE safety/resilience | `1.0` | |
| | Calibration ECE proxy | `0.08625` | |
| | Process fidelity | `0.92` | |
| | Explainability grounding | `0.8` | |
|
|
| Source: |
| [docs/results/benchmark_report.json](polyguard-rl/docs/results/benchmark_report.json). |
|
|
| The improvement gate compares baseline and candidate reports: |
|
|
| | Gate dimension | Delta | |
| | --- | ---: | |
| | Average reward | `+0.025833` | |
| | Legality rate | `0.0` non-regression | |
| | Success rate | `0.0` non-regression | |
| | Process fidelity | `+0.92` | |
| | Timeout rate | `0.0` non-regression | |
| | Failure visibility | `0.0` non-regression | |
|
|
| Source: |
| [docs/results/improvement_report.json](polyguard-rl/docs/results/improvement_report.json). |
|
|
| ### Policy Ablation Results |
|
|
| | Stack | Avg reward | Legality | Visible failure rate | Exploit detections | Interpretation | |
| | --- | ---: | ---: | ---: | ---: | --- | |
| | `bandit_only` | `0.779625` | `1.0` | `0.0625` | 2 | Strong deterministic shortlist behavior with low failure visibility | |
| | `llm_only` | `0.772391` | `1.0` | `0.3043` | 7 | Legal, but more loop-like failure behavior | |
| | `llm+bandit` | `0.764739` | `1.0` | `0.3043` | 7 | Current combined stack needs tighter exploration/control in these ablation settings | |
|
|
|  |
|
|
| The point of these ablations is not to claim every combined policy is always |
| better. The point is that PolyGuard can localize behavior: legality remains |
| high, while failure mining shows whether a stack is looping, over-reviewing, |
| or selecting non-improving candidates. |
|
|
| Source: |
| [policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json). |
|
|
| ## OpenEnv and Product Surfaces |
|
|
| The OpenEnv package is compact: |
|
|
| ```yaml |
| spec_version: 1 |
| name: polyguard-openenv |
| runtime: fastapi |
| app: app.env.fastapi_app:app |
| port: 8100 |
| ``` |
|
|
| The OpenEnv runtime exposes `POST /reset`, `POST /step`, `GET /state`, |
| `GET /metadata`, `GET /schema`, `POST /mcp`, `GET /health`, `GET /ws`, and |
| backward-compatible `/env/*` routes. |
|
|
| The product API in [app/api/routes.py](polyguard-rl/app/api/routes.py) wraps |
| the environment, orchestrator, policy runtime, evaluation, evidence search, |
| cases, metrics, and medication-alternative tooling. Product-facing endpoints |
| include `/env/reset`, `/env/step_candidate`, `/agents/orchestrate`, |
| `/policy/infer`, `/policy/model_status`, `/eval/run_policy`, |
| `/metrics/training`, `/evidence/query`, and `/tools/medication_alternatives`. |
|
|
|  |
|
|
| ## Operations and Deployment |
|
|
| The repository keeps deployment and artifact operations explicit: |
|
|
| | Surface | Files | |
| | --- | --- | |
| | Local/container runtime | [Dockerfile](polyguard-rl/Dockerfile), [Dockerfile.space](polyguard-rl/Dockerfile.space), [docker-compose.yml](polyguard-rl/docker-compose.yml), [requirements.txt](polyguard-rl/requirements.txt), [requirements-space.txt](polyguard-rl/requirements-space.txt) | |
| | Product Space/API deployment | [scripts/deploy_space.sh](polyguard-rl/scripts/deploy_space.sh), [scripts/deploy_space_api.py](polyguard-rl/scripts/deploy_space_api.py), [docs/deployment.md](polyguard-rl/docs/deployment.md) | |
| | Training and evidence Spaces | [scripts/deploy_training_space.py](polyguard-rl/scripts/deploy_training_space.py), [scripts/monitor_training_space_status.py](polyguard-rl/scripts/monitor_training_space_status.py), [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py), [app/hf_space/evidence_runner.py](polyguard-rl/app/hf_space/evidence_runner.py) | |
| | Artifact packaging and activation | [scripts/deploy_final_artifact_space.py](polyguard-rl/scripts/deploy_final_artifact_space.py), [scripts/package_active_model_bundle.py](polyguard-rl/scripts/package_active_model_bundle.py), [scripts/install_hf_active_bundle.py](polyguard-rl/scripts/install_hf_active_bundle.py), [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json) | |
| | Submission validation | [scripts/acceptance_gate.py](polyguard-rl/scripts/acceptance_gate.py), [scripts/validate_submission_links.py](polyguard-rl/scripts/validate_submission_links.py), [docs/submission_checklist.md](polyguard-rl/docs/submission_checklist.md), [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) | |
|
|
| The important operational distinction is that local smoke artifacts, remote |
| training-space logs, final artifact packaging, and active-model installation |
| are separate stages. Final claims are tied to the curated evidence bundle, not |
| to whichever intermediate output directory happens to exist in a checkout. |
|
|
| ## The Workbench UI |
|
|
| The UI is a React 18 + Vite + TypeScript workbench under |
| [app/ui/frontend/](polyguard-rl/app/ui/frontend/). It is not the environment |
| itself; it is an operator surface over the API and OpenEnv runtime. |
|
|
| [Live workbench Space](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench) |
|
|
|  |
|
|
| The main views cover patient workbench, episode replay, policy comparison and |
| policy lab, precision dosing, training monitor, safety inspector, candidate |
| actions, reward panel, episode trace, and alternative medication search through |
| `/tools/medication_alternatives`. |
|
|
| The Patient Workbench shows the active model chip, current scenario, candidate |
| set, agent-vs-environment flow, reward breakdown, and action trace without |
| requiring the reader to inspect raw JSON. The UI is intentionally a workbench, |
| not a polished clinical application. |
|
|
| ### UI Sequence |
|
|
| The five UI screenshots are checked in under `polyguard-rl/docs/UI Images/`. |
|
|
| 1. The workbench opens with model truth, live episode context, scenario status, |
| candidate count, and reward state. |
|
|
|  |
|
|
| 2. The episode panel makes the patient, task, difficulty, sub-environment, risk |
| delta, and candidate-action console visible without reading raw JSON. |
|
|
|  |
|
|
| 3. Candidate selection is paired with reward-channel feedback, current |
| medications, and blocked/available action visibility. |
|
|
|  |
|
|
| 4. After an action, the workbench exposes history, warnings, decision payload, |
| grounded facts, explanation, evidence, and event logs. |
|
|
|  |
|
|
| 5. The alternatives tool surfaces medication substitutions from the current |
| regimen and links out to source labels. |
|
|
|  |
|
|
| ## Demo Videos |
|
|
| ### [UI Walkthrough Video](https://drive.google.com/file/d/1YOzad5gvx-tSmGzJNuBgokBF4-dX2T2H/view?usp=sharing) |
|
|
| This walkthrough shows the deployed workbench surface, including the live model |
| chip, episode context, candidate actions, reward panels, and evidence-oriented |
| patient review flow. |
|
|
| ### [Agent In Action: Action Button Demo](https://drive.google.com/file/d/1eHk1v0OYJRrLWVO97ZclN05MYHxmNnmc/view?usp=sharing) |
|
|
| This demo focuses on what the action button does: selecting a candidate, |
| submitting it through the environment, producing a verifier-scored transition, |
| and exposing the resulting reward, action history, warnings, and explanation. |
|
|
| ### [World Model Tool: Tavily and OpenFDA Alternative Suggestions](https://drive.google.com/file/d/1GaUyyaXaBCHjhHFbpkprojNt5pLNAoYi/view?usp=sharing) |
|
|
| This tool demo shows the world-model support path for alternative medication |
| suggestions, using Tavily and the OpenFDA government database to retrieve |
| candidate alternatives and side-effect evidence for safer review. |
|
|
| ## How a Reviewer Should Read the Repository |
|
|
| For a fresh reviewer, the intended path is: |
|
|
| 1. Read the artifact index: |
| [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md). |
| 2. Inspect the final curated evidence: |
| [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md). |
| 3. Open the one-run notebook: |
| [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb). |
| 4. For local smoke work, follow [docs/training.md](polyguard-rl/docs/training.md) |
| and the local scripts |
| [scripts/run_env_local.sh](polyguard-rl/scripts/run_env_local.sh), |
| [scripts/run_api_local.sh](polyguard-rl/scripts/run_api_local.sh), and |
| [scripts/run_ui_local.sh](polyguard-rl/scripts/run_ui_local.sh). |
| 5. For full training/reproduction, use the notebook or training docs rather |
| than copying private artifact commands out of old drafts. |
| 6. For final public artifacts, use the final artifact Space: |
| [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts). |
|
|
| ## Evidence and Artifact Inventory |
|
|
| Important evidence paths: |
|
|
| - Final overview: |
| [docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
| - Artifact manifest: |
| [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json) |
| - Three-model summary: |
| [docs/results/final_submission_evidence/reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json) |
| - Qwen 3B GRPO report: |
| [docs/results/final_submission_evidence/reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json) |
| - Post-save GRPO inference: |
| [docs/results/final_submission_evidence/reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json) |
| - Basic LLM vs PolyGuard: |
| [docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json) |
| - Policy ablation: |
| [docs/results/final_submission_evidence/reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json) |
| - Action traces: |
| [docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) |
| - Curated charts: |
| [docs/results/final_submission_evidence/charts/curated/README.md](polyguard-rl/docs/results/final_submission_evidence/charts/curated/README.md) |
|
|
| Important tests: |
|
|
| | Category | Tests | |
| | --- | --- | |
| | Environment contract | [tests/test_openenv_contract.py](polyguard-rl/tests/test_openenv_contract.py), [tests/test_env_reset.py](polyguard-rl/tests/test_env_reset.py), [tests/test_env_step.py](polyguard-rl/tests/test_env_step.py), [tests/test_env_step_flow.py](polyguard-rl/tests/test_env_step_flow.py), [tests/test_future_subenvs.py](polyguard-rl/tests/test_future_subenvs.py) | |
| | Reward and safety | [tests/test_reward_functions.py](polyguard-rl/tests/test_reward_functions.py), [tests/test_reward_range.py](polyguard-rl/tests/test_reward_range.py), [tests/test_reward_channels.py](polyguard-rl/tests/test_reward_channels.py), [tests/test_anti_cheat.py](polyguard-rl/tests/test_anti_cheat.py), [tests/test_constraints.py](polyguard-rl/tests/test_constraints.py), [tests/test_timeout_logic.py](polyguard-rl/tests/test_timeout_logic.py) | |
| | Policy and runtime | [tests/test_agents.py](polyguard-rl/tests/test_agents.py), [tests/test_contextual_bandit.py](polyguard-rl/tests/test_contextual_bandit.py), [tests/test_policy_schema.py](polyguard-rl/tests/test_policy_schema.py), [tests/test_provider_runtime.py](polyguard-rl/tests/test_provider_runtime.py), [tests/test_postsave_inference.py](polyguard-rl/tests/test_postsave_inference.py), [tests/test_checkpoint_integrity.py](polyguard-rl/tests/test_checkpoint_integrity.py) | |
| | API and product tooling | [tests/test_api.py](polyguard-rl/tests/test_api.py), [tests/test_medication_alternatives.py](polyguard-rl/tests/test_medication_alternatives.py), [tests/test_remote_env.py](polyguard-rl/tests/test_remote_env.py) | |
| | Data and evidence | [tests/test_parser.py](polyguard-rl/tests/test_parser.py), [tests/test_dataops_parser.py](polyguard-rl/tests/test_dataops_parser.py), [tests/test_graph_infer.py](polyguard-rl/tests/test_graph_infer.py), [tests/test_submission_evidence.py](polyguard-rl/tests/test_submission_evidence.py) | |
| | Submission, notebook, and HF flow | [tests/test_acceptance_gate.py](polyguard-rl/tests/test_acceptance_gate.py), [tests/test_runner_notebook.py](polyguard-rl/tests/test_runner_notebook.py), [tests/test_hf_training_sweep.py](polyguard-rl/tests/test_hf_training_sweep.py) | |
|
|
| Additional architecture diagrams: |
|
|
| - [System architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png) |
| - [Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png) |
| - [Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png) |
| - [Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png) |
| - [Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png) |
| - [Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png) |
| - [Evidence generation flow](polyguard-rl/docs/assets/diagrams/evidence_generation_flow.png) |
| - [Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png) |
| - [Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png) |
|
|
| ## Limitations |
|
|
| PolyGuard is a simulator and research environment. Its current data substrate |
| is compact and intentionally inspectable, not a production clinical knowledge |
| base. The final evidence set is strongest for Qwen 3B because that run has |
| complete SFT, GRPO, post-save GRPO, policy-ablation, adapter, and checkpoint |
| evidence. Qwen 0.5B and 1.5B have SFT reports/histories and post-save SFT |
| evidence, but their adapter directories are marked `reports_only_or_partial` in |
| the final manifest. |
|
|
| The reward model is hand-designed and auditable. That is a feature for this |
| OpenEnv setting, but it also means reward-channel design should be |
| stress-tested as the data grows. The current ablations show that contextual |
| bandits are useful and inspectable, while the `llm+bandit` combined stack needs |
| more tuning to avoid loop-like failure behavior in some settings. |
|
|
| The right conclusion is not "this is a clinical decision system." The right |
| conclusion is that constrained environment feedback, verifier-backed rewards, |
| agentic evaluation, and explicit failure mining are a better substrate for |
| safety-critical medication-policy learning than free-form prompt responses. |
|
|
| ## References |
|
|
| - Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois. |
| [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190). |
| arXiv:2212.05190. |
| - World Health Organization. |
| [Medication Without Harm](https://www.who.int/initiatives/medication-without-harm). |
| - CDC. |
| [FastStats: Medication Safety Data](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html). |
| - Shehab N, Lovegrove MC, Geller AI, Rose KO, Weidle NJ, Budnitz DS. |
| [US Emergency Department Visits for Outpatient Adverse Drug Events, 2013-2014](https://jamanetwork.com/journals/jama/fullarticle/2585977). |
| JAMA. 2016;316(20):2115-2125. |
| - AHRQ / NCBI Bookshelf. |
| [Deprescribing To Reduce Medication Harms in Older Adults](https://www.ncbi.nlm.nih.gov/books/NBK600387/). |
| - American Geriatrics Society. |
| [2023 updated AGS Beers Criteria for potentially inappropriate medication use in older adults](https://pmc.ncbi.nlm.nih.gov/articles/PMC12478568/). |
| - O'Mahony et al. |
| [STOPP/START criteria for potentially inappropriate prescribing in older people: version 3](https://pmc.ncbi.nlm.nih.gov/articles/PMC10447584/). |
|
|
| ## License |
|
|
| The project package declares an MIT license in |
| [polyguard-rl/pyproject.toml](polyguard-rl/pyproject.toml). See |
| [polyguard-rl/LICENSE](polyguard-rl/LICENSE) for the license text. |
|
|