| --- |
| title: PolyGuard OpenEnv Workbench |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| license: mit |
| --- |
| |
| # POLYGUARD-OPENENV |
|
|
| Someone does not experience an unsafe medication regimen as "polypharmacy." |
| They experience it as dizziness after a new sleep medication, bleeding after a |
| painkiller is added to a blood thinner, confusion from a sedative-opioid |
| combination, or a preventable emergency visit because five prescribers each saw |
| one slice of the medication list. The dangerous part is often not a single |
| drug. It is the combination: the wrong pair, the wrong dose in the wrong organ |
| function context, the missing lab, the duplicated class, the abrupt stop that |
| should have been a taper, or the model that confidently says "looks fine" |
| because it was never forced to act inside a safety-checked environment. |
|
|
| That is the problem PolyGuard was built for. The |
| [CDC](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html) |
| reports that adverse drug events send more than 1.5 million people to US |
| emergency departments every year, with almost 500,000 hospitalizations; adults |
| 65 and older account for more than 600,000 of those emergency visits. A |
| CDC-authored [JAMA surveillance study](https://jamanetwork.com/journals/jama/fullarticle/2585977) |
| found that older adults made up 34.5 percent of ED visits for outpatient adverse |
| drug events and had the highest hospitalization rate, 43.6 percent; among older |
| adults, anticoagulants, diabetes agents, and opioid analgesics were implicated |
| in about 59.9 percent of ADE ED visits. Globally, the |
| [WHO](https://www.who.int/initiatives/medication-without-harm) estimates |
| medication errors cost USD 42 billion annually. And AHRQ's deprescribing safety |
| review summarizes estimates that |
| [45 percent of older adults are exposed to polypharmacy and 58 percent to |
| potentially inappropriate medications](https://www.ncbi.nlm.nih.gov/books/NBK600387/). |
|
|
| Not every adverse drug event is caused by an incorrect drug combination, but |
| these numbers describe the harm surface this project targets: medication |
| decisions where combination risk, monitoring gaps, frailty, organ function, |
| uncertainty, and action sequencing all matter at once. |
|
|
| PolyGuard turns that problem into an OpenEnv-compatible reinforcement-learning |
| environment for polypharmacy safety, medication optimization, deprescribing, |
| safe substitution, missing-evidence recovery, and precision dosing. An LLM |
| policy observes a constrained patient/regimen state, chooses a legal candidate |
| action, receives verifier-backed reward, and improves through SFT plus |
| GRPO-style post-training. |
|
|
| This repository is both a research artifact and a product prototype. It contains |
| the OpenEnv server, a multi-agent policy stack, synthetic and structured |
| medication datasets, TRL training scripts, verifier-backed reward functions, |
| agentic evaluation, curated result charts, final artifacts, and a React |
| operator workbench. |
|
|
| PolyGuard is not medical software and is not clinical advice. It is a controlled |
| research environment for studying how language-model policies can be trained and |
| audited on safety-critical medication action selection. |
|
|
| ## Safety Contract |
|
|
| PolyGuard does not let a model directly mutate a medication list from free text. |
| Every decision is candidate-based, verifier-checked, reward-decomposed, and |
| traced. Illegal actions can be scored, penalized, and logged, but they do not |
| change patient state. The system is designed for research on safety-critical |
| action selection, not for clinical ordering or patient-specific treatment |
| advice. |
|
|
| ## Try, Read, And Review |
|
|
| - GitHub repository: |
| [Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK) |
| - Product Hugging Face Space: |
| [TheJackBright/polyguard-openenv-workbench](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench) |
| - One-run Colab/HF notebook: |
| [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb) |
| - Final evidence index: |
| [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
| - Shared environment, logs, scripts, and notebooks: |
| [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
| - Final artifact/evidence Space: |
| [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts) |
|
|
| Note: this Space hosts the Qwen 3B artifact bundle. The Qwen 0.5B and 1.5B |
| runs were trained using a second Hugging Face account, so their model |
| artifacts could not be hosted in the same final Space. Their report mirrors |
| are checked into this repo: |
| [0.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct) |
| and |
| [1.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct). |
|
|
| ## Why This Problem Matters |
|
|
| Medication safety is a combinatorial, partially observable, and high-stakes |
| decision problem. A useful policy has to do more than generate a plausible |
| sentence. It has to notice drug-drug interaction risk, reason about |
| comorbidities and organ function, respect taper and monitoring requirements, |
| choose safe substitutions, abstain or ask for review when uncertainty is high, |
| and expose why it acted. The AGS Beers Criteria and STOPP/START criteria exist |
| because many unsafe medication choices are systematic, recognizable, and |
| evaluable, but still hard to operationalize across fragmented medication lists |
| and incomplete context. |
|
|
| The machine-learning pressure is equally real. If a medication vocabulary has |
| 500 drugs, the number of possible five-drug combinations is: |
|
|
| ```text |
| C(500, 5) = 255,244,687,600 |
| ``` |
|
|
| Exhaustively evaluating every combination is impossible in realistic data |
| settings. The paper that inspired this project, [Neural Bandits for Data Mining: |
| Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190), frames |
| dangerous polypharmacy discovery as a bandit search problem over a massive |
| combination space. It benchmarks neural bandit search over simulated |
| polypharmacy datasets with 500 drugs and 100,000 distinct combinations, and |
| reports detection of up to 72 percent of potentially inappropriate |
| polypharmacies with 99 percent average precision after 30,000 time steps. |
|
|
| PolyGuard takes inspiration from that search framing, but moves the problem from |
| offline combination mining into an agentic environment: the policy sees a |
| patient state, chooses among legal clinical action candidates, and is judged by |
| a deterministic verifier and reward router rather than by free-form text |
| preference alone. |
|
|
| ## A Concrete Failure Trace |
|
|
| In the final matched-seed traces, the failure mode is not abstract. On seeds |
| `8000` and `8004`, the basic prompt-style proxy repeatedly chose `cand_01`, |
| the first legal candidate, which meant `KEEP_REGIMEN` while a hidden |
| `warfarin_like` + `nsaid_like` interaction remained unresolved. The verifier |
| recorded `holdout_ddi_not_addressed`. The full PolyGuard pipeline selected |
| `cand_03`, a safer intervention candidate, and avoided those failure reasons. |
|
|
| That is the core research bet of this repo: medication AI should be judged |
| inside a stateful safety environment, not only by whether its answer sounds |
| clinically plausible. |
|
|
| Internal evidence: |
| [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json) |
| and |
| [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl). |
|
|
| ## Core Idea |
|
|
| PolyGuard asks a narrow but important research question: |
|
|
| Can environment-backed feedback make a small open model better at safe |
| medication action selection than prompt-only, first-legal, rule-only, or |
| single-agent baselines? |
|
|
| The project answers that question with an inspectable stack: |
|
|
| 1. A finite-horizon OpenEnv simulation for medication decisions. |
| 2. A constrained action space, so the model chooses candidate actions instead |
| of inventing arbitrary clinical instructions. |
| 3. A legality verifier that prevents unsafe state mutation. |
| 4. Thirteen reward components rolled into four primary reward channels. |
| 5. A multi-agent policy stack with supervisor routing, contextual bandit |
| reranking, planner selection, critic veto, and explanation logging. |
| 6. SFT for format and clinical-prior warm start. |
| 7. GRPO with environment-backed reward, not an opaque LLM judge. |
| 8. Agentic evaluation with baseline comparison, policy ablations, post-save |
| inference, robustness checks, action traces, and failure mining. |
|
|
|  |
|
|
| ## Internal Evidence At A Glance |
|
|
| | Claim | Repo evidence | |
| | --- | --- | |
| | Hard contraindication examples are represented | [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) | |
| | Safer alternatives are explicit | [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) | |
| | Unsafe substitutions and dose escalations are blocked before state mutation | [app/env/verifier.py](polyguard-rl/app/env/verifier.py) | |
| | Reward hacking and loop-like behavior are surfaced | [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [docs/reward_design.md](polyguard-rl/docs/reward_design.md) | |
| | Baseline failure is traceable by seed and candidate | [docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json), [docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) | |
| | Final evidence is curated separately from older smoke artifacts | [docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) | |
|
|
| ## Project Map |
|
|
| The implementation lives under [polyguard-rl/](polyguard-rl/). |
|
|
| | Area | Key paths | |
| | --- | --- | |
| | OpenEnv runtime | [openenv.yaml](polyguard-rl/openenv.yaml), [app/env/env_core.py](polyguard-rl/app/env/env_core.py), [app/env/fastapi_app.py](polyguard-rl/app/env/fastapi_app.py), [server/app.py](polyguard-rl/server/app.py) | |
| | Action/state contracts | [app/common/types.py](polyguard-rl/app/common/types.py), [app/common/enums.py](polyguard-rl/app/common/enums.py) | |
| | Candidate generation and verifier | [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py), [app/env/verifier.py](polyguard-rl/app/env/verifier.py) | |
| | Reward and anti-cheat | [app/env/reward_router.py](polyguard-rl/app/env/reward_router.py), [app/env/reward_scaling.py](polyguard-rl/app/env/reward_scaling.py), [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [configs/rewards.yaml](polyguard-rl/configs/rewards.yaml) | |
| | Multi-agent policy | [app/agents/](polyguard-rl/app/agents/), [docs/agents.md](polyguard-rl/docs/agents.md) | |
| | Bandits and baselines | [app/models/baselines/contextual_bandit.py](polyguard-rl/app/models/baselines/contextual_bandit.py), [app/models/baselines/contextual_bandit_policy.py](polyguard-rl/app/models/baselines/contextual_bandit_policy.py), [app/models/baselines/](polyguard-rl/app/models/baselines/) | |
| | Training | [app/training/](polyguard-rl/app/training/), [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), [docs/training.md](polyguard-rl/docs/training.md) | |
| | Data | [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json), [data/processed/](polyguard-rl/data/processed/), [data/scenarios/](polyguard-rl/data/scenarios/), [docs/datasets.md](polyguard-rl/docs/datasets.md) | |
| | Evaluation | [app/evaluation/](polyguard-rl/app/evaluation/), [scripts/evaluate_all.py](polyguard-rl/scripts/evaluate_all.py), [docs/evaluation.md](polyguard-rl/docs/evaluation.md) | |
| | Product API/UI | [app/api/](polyguard-rl/app/api/), [app/ui/frontend/](polyguard-rl/app/ui/frontend/), [docs/ui.md](polyguard-rl/docs/ui.md) | |
| | Math | [docs/math.md](polyguard-rl/docs/math.md), [docs/mathematics.md](polyguard-rl/docs/mathematics.md) | |
| | Results | [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/) | |
|
|
| This README is the canonical narrative and evidence map. The docs under |
| [polyguard-rl/docs/](polyguard-rl/docs/) are supporting references: |
| [architecture.md](polyguard-rl/docs/architecture.md) for system design, |
| [environment_design.md](polyguard-rl/docs/environment_design.md) for |
| state/action mechanics, [reward_design.md](polyguard-rl/docs/reward_design.md) |
| for reward channels, [safety.md](polyguard-rl/docs/safety.md) for guardrails, |
| [precision_dosing.md](polyguard-rl/docs/precision_dosing.md) for dosing details, |
| [graph_models.md](polyguard-rl/docs/graph_models.md) for graph/risk modeling, |
| [ablations.md](polyguard-rl/docs/ablations.md) for policy-slice analysis, |
| [api.md](polyguard-rl/docs/api.md) for service routes, |
| [deployment.md](polyguard-rl/docs/deployment.md) for deployment surfaces, |
| [ui.md](polyguard-rl/docs/ui.md) and |
| [DEMO_RECORDING_SCRIPT.md](polyguard-rl/docs/DEMO_RECORDING_SCRIPT.md) for the |
| operator demo, and [submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
| for artifact traceability. |
|
|
| Older smoke-run mirrors are retained for auditability. Final claims in this |
| README use the curated evidence bundle under |
| [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/). |
|
|
| ## Environment Design |
|
|
| At the center is `PolyGuardEnv`, implemented in |
| [app/env/env_core.py](polyguard-rl/app/env/env_core.py). It follows the familiar |
| OpenEnv/Gym shape: |
|
|
| ```text |
| reset(seed, difficulty, sub_environment, scenario_id, patient_id) |
| -> PolyGuardObservation |
| |
| step(PolyGuardAction) |
| -> (PolyGuardObservation, reward, done, info) |
| ``` |
|
|
| At reset, the environment loads or generates a patient scenario, selects a |
| difficulty and sub-environment, computes a risk summary, builds candidate |
| actions, estimates uncertainty, and emits a strict observation. At step time, |
| the environment parses the action, checks legality, evaluates anti-cheat rules, |
| mutates state only if the action is safe, computes decomposed reward, appends a |
| trace, and returns detailed `info` fields such as failure reasons, transition |
| delta, primary reward channels, invalid-action count, and timeout checks. |
|
|
|  |
|
|
| ### Sub-Environments |
|
|
| PolyGuard is not a single task. It cycles through specialized sub-environments: |
|
|
| | Sub-environment | What it stresses | |
| | --- | --- | |
| | `DDI` | High-risk drug-drug interaction recognition and resolution | |
| | `BANDIT_MINING` | Candidate exploration and shortlist/ranking behavior inspired by bandit search | |
| | `REGIMEN_RISK` | General medication burden and regimen optimization | |
| | `PRECISION_DOSING` | Dose-hold, dose reduction, renal/hepatic guardrails, monitoring decisions | |
| | `LONGITUDINAL_DEPRESCRIBING` | Multi-step taper/deprescribing behavior over a longer horizon | |
| | `WEB_SEARCH_MISSING_DATA` | Evidence fetch or review when critical data is missing | |
| | `ALTERNATIVE_SUGGESTION` | Safe alternatives and within-class substitution | |
| | `NEW_DRUG_DECOMPOSITION` | First-pass reasoning over an unknown or combination medication | |
|
|
| The curriculum in [app/env/curriculum.py](polyguard-rl/app/env/curriculum.py) |
| starts with short easy DDI/regimen-risk episodes, then adds bandit and |
| alternative-selection tasks, and finally hard cases with precision dosing, |
| longitudinal deprescribing, missing data, and new-drug decomposition. |
|
|
| ### State And Observation |
|
|
| The latent state is represented by `PolyGuardState` and includes: |
|
|
| - Patient demographics and identifiers. |
| - Active decision mode. |
| - Step count and max step budget. |
| - Medications, dose buckets, comorbidities, labs, vitals, frailty, adherence, |
| monitoring gaps, and prior adverse event history. |
| - Burden score, severe-pair count, precision dosing flags, unresolved conflicts, |
| action history, cumulative reward, and done state. |
|
|
| The agent does not get all simulator internals. It receives a controlled |
| `PolyGuardObservation`: |
|
|
| - Patient summary. |
| - Medication table. |
| - Comorbidity summary. |
| - Organ function and labs/vitals. |
| - Graph safety summary. |
| - Burden score summary. |
| - Precision dosing flags. |
| - Unresolved conflicts. |
| - Candidate action set. |
| - Step budget remaining. |
| - Action history. |
| - Warning summary. |
| - Abstention indicators. |
| - Deterministic contract with seed, scenario, difficulty, and sub-environment. |
|
|
| This split matters: PolyGuard is a partially observable environment. Missing |
| labs and unresolved conflicts are visible as uncertainty signals, not as hidden |
| reward traps. |
|
|
| ## Action Space And Safety Constraints |
|
|
| PolyGuard deliberately avoids unconstrained text actions. The policy chooses a |
| strict `PolyGuardAction` with fields such as: |
|
|
| - `mode`: `REGIMEN_OPT`, `DOSE_OPT`, `REVIEW`, or `ABSTAIN_REVIEW`. |
| - `action_type`: one of the constrained clinical action types. |
| - `target_drug`, `replacement_drug`, `dose_bucket`, `taper_days`, |
| `monitoring_plan`, `evidence_query`, `new_drug_name`, and |
| `candidate_components`. |
| - `candidate_id`, `confidence`, and `rationale_brief`. |
|
|
| The action types are intentionally compact: |
|
|
| | Family | Action types | |
| | --- | --- | |
| | Regimen | `KEEP_REGIMEN`, `STOP_DRUG`, `SUBSTITUTE_WITHIN_CLASS`, `RECOMMEND_ALTERNATIVE` | |
| | Dosing | `REDUCE_DOSE_BUCKET`, `INCREASE_DOSE_BUCKET`, `DOSE_HOLD`, `ORDER_MONITORING_AND_WAIT` | |
| | Deprescribing | `TAPER_INITIATE`, `TAPER_CONTINUE` | |
| | Evidence and uncertainty | `FETCH_EXTERNAL_EVIDENCE`, `DECOMPOSE_NEW_DRUG`, `REQUEST_SPECIALIST_REVIEW`, `REQUEST_PHARMACIST_REVIEW` | |
|
|
| The candidate builder in |
| [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py) |
| generates a bounded candidate set: |
|
|
| ```text |
| 3 <= |C_t| <= 10 |
| ``` |
|
|
| Each candidate carries estimated safety delta, burden delta, disease stability, |
| uncertainty score, rationale tags, required monitoring, and a legality precheck. |
| Policy selection is candidate selection: |
|
|
| ```text |
| a_t = to_action(c_t), where c_t is in C_t |
| ``` |
|
|
| The verifier in [app/env/verifier.py](polyguard-rl/app/env/verifier.py) enforces |
| hard safety constraints before state mutation. It checks, among other things: |
|
|
| - The target drug exists in the regimen when required. |
| - Substitutions and alternatives are drawn from allowed substitution rules. |
| - External evidence domains are allowlisted. |
| - New-drug decomposition includes a new drug and components. |
| - Abrupt stopping is blocked when tapering is required. |
| - Renal/hepatic unsafe dose escalation is blocked. |
| - Duplicate therapy and contraindicated replacement pairs are blocked. |
| - Monitoring and hold actions include a monitoring plan. |
| - Destabilizing deprescribing patterns are blocked. |
|
|
| Illegal actions can receive reward penalties and become visible in traces, but |
| they do not mutate patient state. |
|
|
| ## Multi-Agent Policy Stack |
|
|
| The "agents" in PolyGuard are an auditable policy factorization rather than |
| independent chatbots. A step flows through: |
|
|
| ```text |
| MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate |
| -> Supervisor -> Planner -> Critic -> Env -> Explainer |
| ``` |
|
|
|  |
|
|
| | Agent/module | Role | |
| | --- | --- | |
| | `MedRecAgent` | Summarizes current regimen and medication burden | |
| | `EvidenceAgent` | Retrieves local or fallback evidence when missing data is present | |
| | `GraphSafetyAgent` | Scores risky pairs, side-effect load, duplicate therapy, and graph safety patterns | |
| | `DosingAgent` | Detects dose-sensitive cases and dose-hold opportunities | |
| | `CandidateAgent` | Exposes legal candidate actions from the environment candidate builder | |
| | `SupervisorAgent` | Routes to regimen optimization, dose optimization, or review mode | |
| | `PlannerAgent` | Selects an action from candidates through the policy provider | |
| | `CriticAgent` | Vetoes illegal or unsafe proposed actions and can force review fallback | |
| | `ExplainerAgent` | Records grounded rationale for demo, replay, and audit | |
|
|
| The orchestration modes are: |
|
|
| - `sequential_pipeline` |
| - `supervisor_routed` |
| - `replan_on_veto` |
| - `lightweight_debate` |
|
|
| Policy-stack ablations compare: |
|
|
| - `bandit-only` |
| - `llm-only` |
| - `llm+bandit` |
|
|
| ## Contextual Bandits |
|
|
| PolyGuard uses contextual bandits as an inspectable candidate-reranking layer. |
| This is where the project most directly echoes the arXiv bandit inspiration: |
| unsafe polypharmacy search is combinatorial, so the system should learn which |
| regions of the candidate/action space are worth exploring rather than enumerate |
| everything. |
|
|
| Each candidate becomes an 8-dimensional feature vector: |
|
|
| ```text |
| x(c) = [ |
| 1, |
| I[legality_precheck], |
| estimated_safety_delta, |
| burden_delta, |
| disease_stability_estimate, |
| 1 - uncertainty_score, |
| I[mode = DOSE_OPT], |
| I[mode = REVIEW] |
| ] |
| ``` |
|
|
| An arm is keyed by macro mode and action type: |
|
|
| ```text |
| arm(c) = mode(c) || ":" || action_type(c) |
| ``` |
|
|
| The LinUCB variant maintains, for each arm `a`: |
|
|
| ```text |
| A_a = I + sum x x^T |
| b_a = sum r x |
| theta_a = A_a^{-1} b_a |
| |
| score_a(x) = theta_a^T x + alpha * sqrt(x^T A_a^{-1} x) |
| ``` |
|
|
| There is also a Thompson-style variant: |
|
|
| ```text |
| score_a(x) = theta_a^T x + Normal(0, alpha) |
| ``` |
|
|
| This layer can shortlist candidates before the planner emits the final action. |
| It is deliberately kept inside the candidate space: the bandit can improve |
| ordering and exploration, but it cannot invent an unsafe action outside the |
| environment contract. |
|
|
| ## Reward Model |
|
|
| The reward model is decomposed on purpose. A single scalar reward is needed for |
| RL, but safety-critical RL needs more than one opaque number. PolyGuard logs 13 |
| component columns and four primary channels on every step. |
|
|
|  |
|
|
| All reward values are clamped and quantized: |
|
|
| ```text |
| q(x) = round(clip(x, 0.001, 0.999), 3) |
| ``` |
|
|
| The 13 reward components are: |
|
|
| | Component | Weight | Meaning | |
| | --- | ---: | --- | |
| | `format_compliance_score` | 0.08 | Action payload conforms to the schema | |
| | `candidate_alignment_score` | 0.08 | The model selected a valid candidate-style id | |
| | `legality_score` | 0.12 | The verifier accepted the action | |
| | `safety_delta_score` | 0.15 | Severe-pair and burden risk decreased | |
| | `burden_improvement_score` | 0.08 | Dose-weighted medication burden improved | |
| | `disease_stability_score` | 0.10 | The action did not destabilize underlying disease management | |
| | `dosing_quality_score` | 0.08 | Dose-sensitive routing/action quality | |
| | `abstention_quality_score` | 0.06 | Review/abstention is appropriate under uncertainty | |
| | `efficiency_score` | 0.06 | The action uses the finite step budget well | |
| | `process_fidelity_score` | 0.06 | The action follows task-specific process expectations | |
| | `explanation_grounding_score` | 0.03 | The rationale is present and grounded | |
| | `anti_cheat_score` | 0.06 | Reward-hacking checks did not fire | |
| | `uncertainty_calibration_score` | 0.04 | Confidence matches observable uncertainty | |
|
|
| The scalar reward is a weighted average: |
|
|
| ```text |
| R_env(s_t, a_t, s_{t+1}) = q( sum_i w_i c_i / sum_i w_i ) |
| ``` |
|
|
| Safety-heavy terms dominate the total weight: |
|
|
| ```text |
| legality + safety_delta + burden + disease_stability + anti_cheat |
| = 0.12 + 0.15 + 0.08 + 0.10 + 0.06 |
| = 0.51 |
| ``` |
|
|
| The four primary reward channels are: |
|
|
| | Channel | Component family | |
| | --- | --- | |
| | `safety_legality` | legality, candidate alignment, anti-cheat, uncertainty calibration | |
| | `clinical_improvement` | safety delta, burden improvement, disease stability | |
| | `dosing_quality` | dosing quality and abstention quality | |
| | `process_integrity` | format compliance, efficiency, process fidelity, explanation grounding | |
|
|
| These channels are emitted in `info.primary_reward_channels`, GRPO reward logs, |
| reports, plots, and ablation summaries. |
|
|
| ## Anti-Cheat And Failure Visibility |
|
|
| RL policies exploit reward functions. PolyGuard makes common shortcut failures |
| explicit: |
|
|
| - Repeated action loops. |
| - Excessive keep-regimen behavior. |
| - Excessive review/abstention behavior. |
| - Candidate ID mismatch. |
| - Candidate outside the legal set. |
| - Hidden high-risk DDI no-op behavior. |
| - Parser exploit patterns in rationales. |
| - Retrying a failed no-op action. |
|
|
| If an exploit is detected: |
|
|
| ```text |
| anti_cheat_score = 0.001 |
| done = true |
| termination_reason = "exploit_detection" |
| ``` |
|
|
| Episodes can also terminate on step budget exhaustion, repeated invalid actions, |
| safety-veto threshold, patient destabilization, safe resolution, wall-clock |
| timeout, or per-step timeout. |
|
|
|  |
|
|
| ## Mathematics |
|
|
| PolyGuard can be read as a finite-horizon constrained partially observable |
| Markov decision process: |
|
|
| ```text |
| M = (S, A, O, T, R, H, C) |
| ``` |
|
|
| where: |
|
|
| - `S` is latent patient/regimen state. |
| - `A` is the constrained medication action set. |
| - `O` is the controlled observation. |
| - `T(s' | s, a)` is the transition function. |
| - `R(s, a, s')` is verifier-backed reward. |
| - `H` is the episode horizon. |
| - `C(s, a)` is the hard safety/legality constraint predicate. |
|
|
| The objective is: |
|
|
| ```text |
| maximize_pi E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ] |
| subject to C(s_t, a_t) = 1 whenever possible |
| ``` |
|
|
| There is no explicit discount factor in the runtime. Time preference enters |
| through finite horizons and the efficiency reward: |
|
|
| ```text |
| efficiency_t = q(1 - step_count_t / (max_steps + 1)) |
| ``` |
|
|
| State transition is two-gated: |
|
|
| ```text |
| if verifier(s_t, a_t).legal and not anti_cheat(s_t, a_t): |
| s_{t+1} = T(s_t, a_t) |
| else: |
| s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t) |
| ``` |
|
|
| Risk-like deltas become reward through: |
|
|
| ```text |
| delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post)) |
| ``` |
|
|
| For burden and contraindicated-pair improvement: |
|
|
| ```text |
| burden_reward = delta_reward(pre_burden, post_burden) |
| pair_reward = delta_reward(pre_pairs, post_pairs) |
| |
| safety_delta_score = |
| q(0.65 * pair_reward + 0.35 * burden_reward) if legal |
| 0.001 otherwise |
| ``` |
|
|
| GRPO uses environment execution as the reward function. For each prompt, the |
| model emits candidate completions; PolyGuard parses the candidate id, resets a |
| deterministic environment using the recorded seed and scenario fields, executes |
| one step, and returns reward. The training reward combines environment reward |
| with a legality bonus: |
|
|
| ```text |
| legal_bonus = 0.95 if action is legal else 0.05 |
| |
| R_GRPO = q(0.80 * R_env + 0.20 * legal_bonus) |
| ``` |
|
|
| Conceptually, GRPO forms a within-prompt advantage: |
|
|
| ```text |
| A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon) |
| ``` |
|
|
| and optimizes a clipped policy-ratio objective with KL regularization. The |
| optimizer mechanics are TRL's; PolyGuard's contribution is the verifier-backed |
| reward function and the controlled action/state environment. |
|
|
| The expanded derivation is in |
| [polyguard-rl/docs/mathematics.md](polyguard-rl/docs/mathematics.md). |
|
|
| ## Data And Dataset Pipeline |
|
|
| The data pipeline builds a compact medication-safety substrate from local drug |
| knowledge, synthetic patients, scenario files, retrieval text, and optional |
| external augmentation. |
|
|
|  |
|
|
| Tracked local processed data currently includes: |
|
|
| | Artifact | Count | Path | |
| | --- | ---: | --- | |
| | Normalized drug rows | 10 | [data/processed/normalized_drugs.parquet](polyguard-rl/data/processed/normalized_drugs.parquet) | |
| | Drug class rows | 10 | [data/processed/drug_classes.parquet](polyguard-rl/data/processed/drug_classes.parquet) | |
| | Interaction rows | 2 | [data/processed/interactions.parquet](polyguard-rl/data/processed/interactions.parquet) | |
| | Graph edges | 18 | [data/processed/graph_edges.parquet](polyguard-rl/data/processed/graph_edges.parquet) | |
| | Synthetic patients | 20 | [data/processed/patients_synthetic.parquet](polyguard-rl/data/processed/patients_synthetic.parquet) | |
| | Retrieval documents | 8 | [data/processed/retrieval_corpus.jsonl](polyguard-rl/data/processed/retrieval_corpus.jsonl) | |
| | Easy scenarios | 100 | [data/scenarios/scenarios_easy.jsonl](polyguard-rl/data/scenarios/scenarios_easy.jsonl) | |
| | Medium scenarios | 200 | [data/scenarios/scenarios_medium.jsonl](polyguard-rl/data/scenarios/scenarios_medium.jsonl) | |
| | Hard scenarios | 200 | [data/scenarios/scenarios_hard.jsonl](polyguard-rl/data/scenarios/scenarios_hard.jsonl) | |
| | Local small SFT rows | 80 | [data/processed/training_corpus_sft.jsonl](polyguard-rl/data/processed/training_corpus_sft.jsonl) | |
| | Local small GRPO prompts | 80 | [data/processed/training_corpus_grpo_prompts.jsonl](polyguard-rl/data/processed/training_corpus_grpo_prompts.jsonl) | |
|
|
| The provenance manifest records the source policy and counts: |
| [data/processed/provenance_manifest.json](polyguard-rl/data/processed/provenance_manifest.json). |
|
|
| Additional data-governance and rule artifacts are intentionally checked in: |
|
|
| | Artifact | Why it matters | |
| | --- | --- | |
| | [data/processed/ingested_sources.json](polyguard-rl/data/processed/ingested_sources.json) | Source ingestion ledger used by the local build | |
| | [data/processed/feature_dictionary.json](polyguard-rl/data/processed/feature_dictionary.json) | Names and meanings of structured model features | |
| | [data/processed/burden_rules.yaml](polyguard-rl/data/processed/burden_rules.yaml) | Medication-burden and duplicate-therapy rules | |
| | [data/processed/substitution_rules.yaml](polyguard-rl/data/processed/substitution_rules.yaml) | Data-level safer-substitution rules | |
| | [data/processed/taper_rules.yaml](polyguard-rl/data/processed/taper_rules.yaml) | Deprescribing and taper requirements | |
| | [data/retrieval_index/index.json](polyguard-rl/data/retrieval_index/index.json) | Retrieval index over local evidence chunks | |
|
|
| The local knowledge seed is |
| [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json). |
| It contains drug classes, example high-risk pairs, renal and hepatic flags, |
| side-effect tags, substitution rules, and taper requirements. The processed |
| tables then feed graph modeling, candidate generation, environment scenarios, |
| retrieval, SFT rows, and GRPO prompts. |
|
|
| The full training/evidence runs used 2,000 examples per Qwen model, recorded in |
| the final reports under |
| [docs/results/final_submission_evidence/reports/](polyguard-rl/docs/results/final_submission_evidence/reports/). |
|
|
| ## Models Inside The Environment |
|
|
| PolyGuard combines learned and rule-backed components: |
|
|
| - Graph safety model: |
| [app/models/graph/](polyguard-rl/app/models/graph/) produces regimen |
| embeddings, pairwise DDI severity, severe-alert probability, and side-effect |
| tag probabilities. Fallback graph features include drug identity hashes, |
| class counts, side-effect load, medication count, contraindicated-pair count, |
| and class flags. |
| - Tabular risk model: |
| [app/models/tabular/](polyguard-rl/app/models/tabular/) supports calibrated |
| patient/regimen risk heads and evaluation. |
| - Dosing model: |
| [app/models/dosing/](polyguard-rl/app/models/dosing/) models dose-sensitive |
| states with target attainment, toxicity, underdose risk, organ stress, |
| interaction load, and monitoring need. |
| - Retrieval: |
| [app/models/retrieval/](polyguard-rl/app/models/retrieval/) and |
| [app/knowledge/](polyguard-rl/app/knowledge/) provide local evidence chunks, |
| drug rules, renal/hepatic guardrails, duplicate therapy rules, substitution |
| rules, taper rules, burden scoring, and side-effect ontology. |
| - Active model runtime: |
| [app/models/policy/active_model.py](polyguard-rl/app/models/policy/active_model.py) |
| discovers activated artifacts from `checkpoints/active/active_model_manifest.json`. |
| The provider load order prefers a GRPO adapter, then merged model, then SFT |
| adapter. |
| - Provider runtime: |
| [app/models/policy/provider_runtime.py](polyguard-rl/app/models/policy/provider_runtime.py) |
| is Transformers-first, with optional Ollama when enabled. If model loading is |
| unavailable, the runtime falls back to deterministic safety ranking. |
|
|
| Tracked support-model reports show that the environment is not only an LLM |
| wrapper: |
|
|
| | Component | Report | Current tracked result | |
| | --- | --- | --- | |
| | Graph model | [docs/results/graph_train.json](polyguard-rl/docs/results/graph_train.json) | `status: trained`, `num_samples: 180`, artifact path `outputs/models/graph_model.pkl` | |
| | Tabular risk model | [docs/results/risk_train.json](polyguard-rl/docs/results/risk_train.json) | `status: trained`, `dataset_size: 180`, `train_mae: 0.0033`, artifact path `outputs/models/tabular_risk.pkl` | |
| | Dose surrogate model | [docs/results/dose_train.json](polyguard-rl/docs/results/dose_train.json) | `status: trained`, `dataset_size: 120`, `train_mae: 0.0025`, artifact path `outputs/models/dose_model.pkl` | |
|
|
| The hard-coded contraindicated seed pairs in |
| [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) |
| include `warfarin_like` + `nsaid_like` and `benzodiazepine_like` + |
| `opioid_like`. Substitution rules in |
| [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) |
| include safer alternatives such as `nsaid_like -> acetaminophen_like`, |
| `nsaid_like -> topical_nsaid_like`, `benzodiazepine_like -> |
| non_benzo_sleep_support`, and `opioid_like -> non_opioid_analgesic`. |
| |
| ### Precision Dosing Details |
| |
| Precision dosing uses sensitive classes such as anticoagulants, sedatives, and |
| glucose-lowering drugs. The dosing agent and surrogate model are implemented in |
| [app/agents/dosing_agent.py](polyguard-rl/app/agents/dosing_agent.py) and |
| [app/models/dosing/](polyguard-rl/app/models/dosing/). |
| |
| The surrogate PK/PD transition in |
| [app/models/dosing/surrogate_pkpd.py](polyguard-rl/app/models/dosing/surrogate_pkpd.py) |
| uses effect, toxicity, underdose, organ stress, and interaction load: |
| |
| ```text |
| effective_delta = dose_delta * (1 - min(0.6, organ_factor * 0.4)) |
|
|
| effect' = |
| clip(effect + 0.28 * effective_delta - 0.05 * interaction_factor, 0, 1) |
|
|
| toxicity_gain = |
| max(0, dose_delta) * (0.35 + 0.25 * organ_factor + 0.20 * interaction_factor) |
|
|
| toxicity' = |
| clip(0.85 * toxicity + toxicity_gain, 0, 1) |
| |
| underdose' = |
| clip(1 - effect' + 0.15 * max(0, -dose_delta), 0, 1) |
| ``` |
| |
| The higher-level dosing metrics use target attainment, toxicity avoidance, |
| underdose risk, and monitoring need: |
| |
| ```text |
| target_attainment = 1 - abs(effect_level - 0.62) |
| toxicity_proxy = toxicity_level + 0.20 * organ_stress + 0.12 * interaction_load |
| measurement_need = max(toxicity_proxy, underdose_proxy) |
| ``` |
| |
| ## Training And Post-Training |
| |
| The training stack is deliberately staged: |
| |
| 1. Build structured data, scenarios, retrieval records, SFT examples, and GRPO |
| prompts. |
| 2. Run SFT with TRL to teach the model the candidate-id format and obvious |
| clinical priors. |
| 3. Run GRPO with environment-backed reward, where sampled candidate completions |
| are executed in PolyGuardEnv and scored by the verifier/reward router. |
| 4. Track sampled generations, reward components, primary reward channels, |
| legality, anti-cheat events, and training curves. |
| 5. Run policy-stack ablations and baseline comparisons. |
| 6. Merge or export adapters safely. |
| 7. Validate post-save inference from the saved artifact, not from an in-memory |
| training object. |
| 8. Generate reports, charts, action traces, and final artifact manifests. |
| |
| The relevant training source files are: |
| |
| - [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py) |
| - [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py) |
| - [app/training/sft_trl.py](polyguard-rl/app/training/sft_trl.py) |
| - [app/training/grpo_trl.py](polyguard-rl/app/training/grpo_trl.py) |
| - [app/training/reward_functions.py](polyguard-rl/app/training/reward_functions.py) |
| - [app/training/openenv_wrapper.py](polyguard-rl/app/training/openenv_wrapper.py) |
| - [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py) |
| |
| The one-run notebook is |
| [polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb). |
| It is the accessible Colab/HF workflow for building data, running checks, |
| launching training, pulling reports, generating charts, validating inference, |
| activating a model, deploying the product Space, and running acceptance checks. |
| |
| The modular notebook series is: |
| |
| - [01_data_building.ipynb](polyguard-rl/notebooks/01_data_building.ipynb) |
| - [02_knowledge_graph.ipynb](polyguard-rl/notebooks/02_knowledge_graph.ipynb) |
| - [03_risk_models.ipynb](polyguard-rl/notebooks/03_risk_models.ipynb) |
| - [04_environment_validation.ipynb](polyguard-rl/notebooks/04_environment_validation.ipynb) |
| - [05_sft_debug.ipynb](polyguard-rl/notebooks/05_sft_debug.ipynb) |
| - [06_grpo_debug.ipynb](polyguard-rl/notebooks/06_grpo_debug.ipynb) |
| - [07_policy_analysis.ipynb](polyguard-rl/notebooks/07_policy_analysis.ipynb) |
| - [08_dosing_analysis.ipynb](polyguard-rl/notebooks/08_dosing_analysis.ipynb) |
| - [09_training_loop.ipynb](polyguard-rl/notebooks/09_training_loop.ipynb) |
| |
| For exact local and remote execution details, use |
| [docs/training.md](polyguard-rl/docs/training.md) and |
| [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md). |
| Those docs contain operational notes; this README keeps the blog story focused |
| on architecture, data, evaluation, and evidence. |
| |
| ## Training Curves And Model Results |
| |
| The final curated evidence lives in |
| [polyguard-rl/docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/). |
| It replaces earlier smoke-run charts and older 0.5B/1.5B-only views. |
| |
| ### SFT Loss Across Qwen Runs |
| |
|  |
| |
| The SFT curves, post-save valid rates, and token-accuracy histories together |
| show that the models learned the candidate-id output contract rather than only |
| producing unconstrained prose. The visible curves drop from roughly `3.0-3.6` |
| initial loss to low final loss across all three Qwen sizes. |
| |
|  |
| |
| The tracked per-model summaries are: |
| |
| | Run | Model | Epochs | Final step | Runtime | Key SFT metrics | |
| | --- | --- | ---: | ---: | ---: | --- | |
| | `qwen-qwen2-5-0-5b-instruct` | `Qwen/Qwen2.5-0.5B-Instruct` | 2 | 2,000 | `234.6302s` | loss `3.0856 -> 0.0626`, best `0.0057`, train loss `0.1923`, token accuracy `0.9717`, valid rate `1.0`, avg env reward `0.726`, latency `1.839s` | |
| | `qwen-qwen2-5-1-5b-instruct` | `Qwen/Qwen2.5-1.5B-Instruct` | 2 | 4,000 | `483.7085s` | loss `2.9686 -> 0.0681`, best `0.0009`, train loss `0.1152`, token accuracy `0.9726`, valid rate `1.0`, avg env reward `0.726`, latency `2.158s` | |
| | `qwen-qwen2-5-3b-instruct` | `Qwen/Qwen2.5-3B-Instruct` | 2 | 2,000 | `715.2908s` | loss `3.5687 -> 0.054`, best `0.0022`, train loss `0.1569`, token accuracy `0.9750`, SFT avg env reward `0.781`, SFT latency `2.863s` | |
| |
| Each SFT run used `2,000` examples. The 0.5B and 3B runs recorded `2,001` |
| history rows including the final trainer summary; the 1.5B run recorded `4,001` |
| history rows because its batch configuration produced `4,000` final steps. |
| |
| ### GRPO Reward Curve |
| |
|  |
| |
|  |
| |
| The complete GRPO evidence is available for Qwen 3B: |
| |
| - Backend: `trl_transformers` |
| - Model: `Qwen/Qwen2.5-3B-Instruct` |
| - Records: `2000` |
| - Epochs: `1.0` |
| - Final step: `2000` |
| - Runtime: `6873.9375s` (`1.91h`) |
| - Reward samples: `4000` |
| - GRPO average reward: `0.767` |
| - GRPO reward history: min `0.376`, max `0.880`, last `0.812`, average `0.76685` |
| - GRPO train loss: `0.000002665` |
| - Post-save GRPO valid rate: `1.0` |
| - Post-save GRPO average environment reward: `0.726` |
| - Post-save GRPO average latency: `3.681s` |
| - Artifact path recorded in the report: `checkpoints/sweeps/qwen-qwen2-5-3b-instruct/grpo_adapter` |
|
|
| The source reports are: |
|
|
| - [reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json) |
| - [reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json) |
| - [reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json) |
|
|
| ### SFT vs GRPO By Model |
|
|
|  |
|
|
| This chart is intentionally transparent about artifact availability. Qwen 0.5B |
| and 1.5B have SFT reports/histories and post-save SFT evidence in the repo, but |
| their adapter directories were not present in the local/final artifact mirrors |
| at packaging time. Qwen 3B has the complete SFT plus GRPO artifact set. |
|
|
| The packaged manifest records Qwen 3B as complete with `125` checkpoint files |
| (`433,208,536` bytes), `11` SFT adapter files (`30,655,905` bytes), `11` GRPO |
| adapter files (`30,656,841` bytes), and `9` report files (`5,930,214` bytes). |
| Qwen 0.5B and 1.5B are retained as report/post-save evidence only. |
|
|
| The manifest records this explicitly: |
| [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json). |
|
|
| ### Product Pipeline vs Basic LLM Proxy |
|
|
|  |
|
|
| Matched-seed evaluation compares a basic LLM-style first-legal proxy, an |
| SFT-style safety ranker, and the full PolyGuard orchestrated pipeline. The same |
| PolyGuard verifier/reward system judges all three. |
|
|
| | Policy | Episodes | Avg reward | Legality rate | Failure/exploit rate | Candidate diversity | |
| | --- | ---: | ---: | ---: | ---: | ---: | |
| | Basic LLM proxy | 8 | `0.762` | `1.0` | `0.25` | 1 | |
| | SFT policy proxy | 8 | `0.818` | `1.0` | `0.0` | 2 | |
| | Full PolyGuard pipeline | 8 | `0.805` | `1.0` | `0.0` | 2 | |
|
|
| The full pipeline improves average verifier reward over the basic LLM proxy by |
| `+0.043` while reducing visible failure/exploit rate from `0.25` to `0.0`. |
|
|
|  |
|
|
| Two matched seeds expose the core failure mode: the basic policy repeatedly |
| kept a regimen despite the hidden `warfarin_like` + `nsaid_like` DDI holdout, |
| triggering `holdout_ddi_not_addressed`. The full pipeline selected safer dose |
| or hold candidates and avoided those failure reasons. |
|
|
| Source: |
| [reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json). |
|
|
| ### Reward Components And Channels |
|
|
|  |
|
|
|  |
|
|
| The reward charts are as important as the scalar reward curve. They show whether |
| the model is improving by becoming safer and more process-faithful or merely |
| exploiting one easy component. The reports log the full 13-component reward |
| vector and the four primary channels for GRPO and evaluation runs. |
|
|
| For Qwen 3B GRPO, the tracked average primary channels are: |
|
|
| | Channel | Average | |
| | --- | ---: | |
| | `safety_legality` | `0.816` | |
| | `clinical_improvement` | `0.609` | |
| | `dosing_quality` | `0.543` | |
| | `process_integrity` | `0.875` | |
|
|
| ### Post-Save Inference |
|
|
|  |
|
|
| Post-save inference is a separate check from training. The exported/activated |
| artifact is loaded and asked to choose candidate ids on held prompt samples. The |
| Qwen 3B GRPO adapter path produced: |
|
|
| - `model_source: adapter` |
| - `samples: 5` |
| - `valid_rate: 1.0` |
| - `avg_env_reward: 0.726` |
| - `avg_latency_seconds: 3.681` |
|
|
| This is why the README treats post-training as more than a training log: the |
| saved artifact must still produce parseable candidate ids and executable |
| environment actions. |
|
|
| One caveat matters: `valid_rate: 1.0` means the output was parseable and |
| executable as a candidate selection. In the five-sample Qwen 3B post-save GRPO |
| report, four valid samples still terminated with `exploit_detection`. That is |
| retained as safety evidence, because PolyGuard's job is to expose suspicious or |
| loop-like behavior instead of hiding it behind a clean parse metric. |
|
|
| ## Agentic Evaluation |
|
|
| Evaluation is not just one benchmark number. The evaluation stack under |
| [app/evaluation/](polyguard-rl/app/evaluation/) includes: |
|
|
| - Offline policy evaluation. |
| - Safety evaluation. |
| - Dosing evaluation. |
| - Robustness under missing labs, noisy dose info, conflicting medications, |
| alias noise, hidden duplicate therapy, wrong candidate ids, stale evidence, |
| and delayed adverse event manifestation. |
| - Calibration and abstention evaluation. |
| - Process fidelity and invalid-action tracking. |
| - Subgroup summaries for renal compromise, hepatic compromise, and frailty. |
| - Explainability grounding. |
| - Baseline comparison. |
| - Policy ablations. |
| - Failure mining and action traces. |
|
|
| The tracked benchmark report records: |
|
|
| | Metric family | Result | |
| | --- | --- | |
| | Offline avg reward | `0.772833` | |
| | Offline legal rate | `1.0` | |
| | Severe violation rate | `0.0` | |
| | Illegal step rate | `0.0` | |
| | Dosing target attainment | `0.75` | |
| | Dosing toxicity avoidance | `1.0` | |
| | Missing-labs safety rate | `0.666667` | |
| | Noisy-dose, conflicting-meds, alias-noise, hidden-duplicate, wrong-candidate-id, stale-evidence, delayed-ADE safety/resilience | `1.0` | |
| | Calibration ECE proxy | `0.08625` | |
| | Process fidelity | `0.92` | |
| | Explainability grounding | `0.8` | |
|
|
| Source: |
| [docs/results/benchmark_report.json](polyguard-rl/docs/results/benchmark_report.json). |
|
|
| The improvement gate compares baseline and candidate reports: |
|
|
| | Gate dimension | Delta | |
| | --- | ---: | |
| | Average reward | `+0.025833` | |
| | Legality rate | `0.0` non-regression | |
| | Success rate | `0.0` non-regression | |
| | Process fidelity | `+0.92` | |
| | Timeout rate | `0.0` non-regression | |
| | Failure visibility | `0.0` non-regression | |
|
|
| Source: |
| [docs/results/improvement_report.json](polyguard-rl/docs/results/improvement_report.json). |
|
|
| ### Policy Ablation Results |
|
|
| | Stack | Avg reward | Legality | Visible failure rate | Exploit detections | Interpretation | |
| | --- | ---: | ---: | ---: | ---: | --- | |
| | `bandit_only` | `0.779625` | `1.0` | `0.0625` | 2 | Strong deterministic shortlist behavior with low failure visibility | |
| | `llm_only` | `0.772391` | `1.0` | `0.3043` | 7 | Legal, but more loop-like failure behavior | |
| | `llm+bandit` | `0.764739` | `1.0` | `0.3043` | 7 | Current combined stack needs tighter exploration/control in these ablation settings | |
|
|
|  |
|
|
| The point of these ablations is not to claim every combined policy is always |
| better. The point is that PolyGuard can localize behavior: legality remains |
| high, while failure mining shows whether a stack is looping, over-reviewing, |
| or selecting non-improving candidates. |
|
|
| Source: |
| [reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json). |
|
|
| ## OpenEnv And Product Surfaces |
|
|
| The OpenEnv package is compact: |
|
|
| ```yaml |
| spec_version: 1 |
| name: polyguard-openenv |
| runtime: fastapi |
| app: app.env.fastapi_app:app |
| port: 8100 |
| ``` |
|
|
| The OpenEnv runtime exposes: |
|
|
| - `POST /reset` |
| - `POST /step` |
| - `GET /state` |
| - `GET /metadata` |
| - `GET /schema` |
| - `POST /mcp` |
| - `GET /health` |
| - `GET /ws` |
| - Backward-compatible `/env/*` routes |
|
|
| The product API in [app/api/routes.py](polyguard-rl/app/api/routes.py) wraps the |
| environment, orchestrator, policy runtime, evaluation, evidence search, cases, |
| metrics, and medication-alternative tooling. Useful product-facing endpoints |
| include `/env/reset`, `/env/step_candidate`, `/agents/orchestrate`, |
| `/policy/infer`, `/policy/model_status`, `/eval/run_policy`, |
| `/metrics/training`, `/evidence/query`, and `/tools/medication_alternatives`. |
|
|
|  |
|
|
| ## Operations And Deployment |
|
|
| The repository keeps deployment and artifact operations explicit: |
|
|
| | Surface | Files | |
| | --- | --- | |
| | Local/container runtime | [Dockerfile](polyguard-rl/Dockerfile), [Dockerfile.space](polyguard-rl/Dockerfile.space), [docker-compose.yml](polyguard-rl/docker-compose.yml), [requirements.txt](polyguard-rl/requirements.txt), [requirements-space.txt](polyguard-rl/requirements-space.txt) | |
| | Product Space/API deployment | [scripts/deploy_space.sh](polyguard-rl/scripts/deploy_space.sh), [scripts/deploy_space_api.py](polyguard-rl/scripts/deploy_space_api.py), [docs/deployment.md](polyguard-rl/docs/deployment.md) | |
| | Training and evidence Spaces | [scripts/deploy_training_space.py](polyguard-rl/scripts/deploy_training_space.py), [scripts/monitor_training_space_status.py](polyguard-rl/scripts/monitor_training_space_status.py), [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py), [app/hf_space/evidence_runner.py](polyguard-rl/app/hf_space/evidence_runner.py) | |
| | Artifact packaging and activation | [scripts/deploy_final_artifact_space.py](polyguard-rl/scripts/deploy_final_artifact_space.py), [scripts/package_active_model_bundle.py](polyguard-rl/scripts/package_active_model_bundle.py), [scripts/install_hf_active_bundle.py](polyguard-rl/scripts/install_hf_active_bundle.py), [checkpoints/active/active_model_manifest.json](polyguard-rl/checkpoints/active/active_model_manifest.json) | |
| | Submission validation | [scripts/acceptance_gate.py](polyguard-rl/scripts/acceptance_gate.py), [scripts/validate_submission_links.py](polyguard-rl/scripts/validate_submission_links.py), [docs/submission_checklist.md](polyguard-rl/docs/submission_checklist.md), [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) | |
|
|
| The important operational distinction is that local smoke artifacts, remote |
| training-space logs, final artifact packaging, and active-model installation are |
| separate stages. The final README claims are tied to the curated evidence |
| bundle, not to whichever intermediate output directory happens to exist in a |
| developer checkout. |
|
|
| ## UI Workbench |
|
|
| The UI is a React 18 + Vite + TypeScript workbench under |
| [app/ui/frontend/](polyguard-rl/app/ui/frontend/). It is not the environment |
| itself; it is an operator surface over the API and OpenEnv runtime. |
|
|
| [Live workbench Space](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench) |
|
|
|  |
|
|
| The main views cover: |
|
|
| - Patient workbench. |
| - Episode replay. |
| - Policy comparison and policy lab. |
| - Precision dosing. |
| - Training monitor. |
| - Safety inspector. |
| - Candidate actions. |
| - Reward panel. |
| - Episode trace. |
| - Alternative medication search through `/tools/medication_alternatives`. |
|
|
| The Patient Workbench shows the active model chip, current scenario, candidate |
| set, agent-vs-environment flow, reward breakdown, and action trace without |
| requiring the reader to inspect raw JSON. The UI is intentionally a workbench, |
| not a polished clinical application. |
|
|
| ### UI Sequence |
|
|
| These screenshots are included in the repo under `polyguard-rl/docs/UI Images/`. |
| The image links below use URL-encoded paths so they render correctly when the |
| README is viewed on GitHub or inside the Hugging Face Space. |
|
|
| 1. The workbench opens with model truth, live episode context, scenario status, |
| candidate count, and reward state. |
|
|
|  |
|
|
| 2. The episode panel makes the patient, task, difficulty, sub-environment, risk |
| delta, and candidate-action console visible without reading raw JSON. |
|
|
|  |
|
|
| 3. Candidate selection is paired with reward-channel feedback, current |
| medications, and blocked/available action visibility. |
|
|
|  |
|
|
| 4. After an action, the workbench exposes history, warnings, decision payload, |
| grounded facts, explanation, evidence, and event logs. |
|
|
|  |
|
|
| 5. The alternatives tool surfaces medication substitutions from the current |
| regimen and links out to source labels. |
|
|
|  |
|
|
| ## [UI Walkthrough Video](https://drive.google.com/file/d/1YOzad5gvx-tSmGzJNuBgokBF4-dX2T2H/view?usp=sharing) |
|
|
| This walkthrough shows the deployed workbench surface, including the live model |
| chip, episode context, candidate actions, reward panels, and evidence-oriented |
| patient review flow. |
|
|
| ## [Agent In Action: Action Button Demo](https://drive.google.com/file/d/1eHk1v0OYJRrLWVO97ZclN05MYHxmNnmc/view?usp=sharing) |
|
|
| This demo focuses on what the action button does: selecting a candidate, |
| submitting it through the environment, producing a verifier-scored transition, |
| and exposing the resulting reward, action history, warnings, and explanation. |
|
|
| ## [World Model Tool: Tavily And OpenFDA Alternative Suggestions](https://drive.google.com/file/d/1GaUyyaXaBCHjhHFbpkprojNt5pLNAoYi/view?usp=sharing) |
|
|
| This tool demo shows the world-model support path for alternative medication |
| suggestions, using Tavily and the OpenFDA government database to retrieve |
| candidate alternatives and side-effect evidence for safer review. |
|
|
| ## Execution Path For Readers |
|
|
| For a fresh reviewer, the intended path is: |
|
|
| 1. Read the artifact index: |
| [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md). |
| 2. Inspect the final curated evidence: |
| [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md). |
| 3. Open the one-run notebook: |
| [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb). |
| 4. For local smoke work, follow [docs/training.md](polyguard-rl/docs/training.md) |
| and the local scripts: |
| [scripts/run_env_local.sh](polyguard-rl/scripts/run_env_local.sh), |
| [scripts/run_api_local.sh](polyguard-rl/scripts/run_api_local.sh), and |
| [scripts/run_ui_local.sh](polyguard-rl/scripts/run_ui_local.sh). |
| 5. For full training/reproduction, use the notebook or training docs rather |
| than copying private artifact commands out of old drafts. |
| 6. For final public artifacts, use the final artifact Space: |
| [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts). |
|
|
| ## Evidence And Artifact Inventory |
|
|
| Important evidence paths: |
|
|
| - Final overview: |
| [docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
| - Artifact manifest: |
| [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json) |
| - Three-model summary: |
| [docs/results/final_submission_evidence/reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json) |
| - Qwen 3B GRPO report: |
| [docs/results/final_submission_evidence/reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json) |
| - Post-save GRPO inference: |
| [docs/results/final_submission_evidence/reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json) |
| - Basic LLM vs PolyGuard: |
| [docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json) |
| - Policy ablation: |
| [docs/results/final_submission_evidence/reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json) |
| - Action traces: |
| [docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) |
| - Curated charts: |
| [docs/results/final_submission_evidence/charts/curated/README.md](polyguard-rl/docs/results/final_submission_evidence/charts/curated/README.md) |
|
|
| Important tests: |
|
|
| | Category | Tests | |
| | --- | --- | |
| | Environment contract | [tests/test_openenv_contract.py](polyguard-rl/tests/test_openenv_contract.py), [tests/test_env_reset.py](polyguard-rl/tests/test_env_reset.py), [tests/test_env_step.py](polyguard-rl/tests/test_env_step.py), [tests/test_env_step_flow.py](polyguard-rl/tests/test_env_step_flow.py), [tests/test_future_subenvs.py](polyguard-rl/tests/test_future_subenvs.py) | |
| | Reward and safety | [tests/test_reward_functions.py](polyguard-rl/tests/test_reward_functions.py), [tests/test_reward_range.py](polyguard-rl/tests/test_reward_range.py), [tests/test_reward_channels.py](polyguard-rl/tests/test_reward_channels.py), [tests/test_anti_cheat.py](polyguard-rl/tests/test_anti_cheat.py), [tests/test_constraints.py](polyguard-rl/tests/test_constraints.py), [tests/test_timeout_logic.py](polyguard-rl/tests/test_timeout_logic.py) | |
| | Policy and runtime | [tests/test_agents.py](polyguard-rl/tests/test_agents.py), [tests/test_contextual_bandit.py](polyguard-rl/tests/test_contextual_bandit.py), [tests/test_policy_schema.py](polyguard-rl/tests/test_policy_schema.py), [tests/test_provider_runtime.py](polyguard-rl/tests/test_provider_runtime.py), [tests/test_postsave_inference.py](polyguard-rl/tests/test_postsave_inference.py), [tests/test_checkpoint_integrity.py](polyguard-rl/tests/test_checkpoint_integrity.py) | |
| | API and product tooling | [tests/test_api.py](polyguard-rl/tests/test_api.py), [tests/test_medication_alternatives.py](polyguard-rl/tests/test_medication_alternatives.py), [tests/test_remote_env.py](polyguard-rl/tests/test_remote_env.py) | |
| | Data and evidence | [tests/test_parser.py](polyguard-rl/tests/test_parser.py), [tests/test_dataops_parser.py](polyguard-rl/tests/test_dataops_parser.py), [tests/test_graph_infer.py](polyguard-rl/tests/test_graph_infer.py), [tests/test_submission_evidence.py](polyguard-rl/tests/test_submission_evidence.py) | |
| | Submission, notebook, and HF flow | [tests/test_acceptance_gate.py](polyguard-rl/tests/test_acceptance_gate.py), [tests/test_runner_notebook.py](polyguard-rl/tests/test_runner_notebook.py), [tests/test_hf_training_sweep.py](polyguard-rl/tests/test_hf_training_sweep.py) | |
|
|
| Additional architecture diagrams: |
|
|
| - [System architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png) |
| - [Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png) |
| - [Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png) |
| - [Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png) |
| - [Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png) |
| - [Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png) |
| - [Evidence generation flow](polyguard-rl/docs/assets/diagrams/evidence_generation_flow.png) |
| - [Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png) |
| - [Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png) |
|
|
| ## Limitations |
|
|
| PolyGuard is a simulator and research environment. Its current data substrate is |
| compact and intentionally inspectable, not a production clinical knowledge base. |
| The final evidence set is strongest for Qwen 3B because that run has complete |
| SFT, GRPO, post-save GRPO, policy-ablation, adapter, and checkpoint evidence. |
| Qwen 0.5B and 1.5B have SFT reports/histories and post-save SFT evidence, but |
| their adapter directories are marked `reports_only_or_partial` in the final |
| manifest. |
|
|
| The reward model is hand-designed and auditable; that is a feature for this |
| OpenEnv setting, but it also means reward-channel design should be stress-tested |
| as the data grows. The current ablations show that contextual bandits are useful |
| and inspectable, while the `llm+bandit` combined stack needs more tuning to |
| avoid loop-like failure behavior in some settings. |
|
|
| The right conclusion is not "this is a clinical decision system." The right |
| conclusion is that constrained environment feedback, verifier-backed rewards, |
| agentic evaluation, and explicit failure mining are a better substrate for |
| safety-critical medication-policy learning than free-form prompt responses. |
|
|
| ## References |
|
|
| - Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois. |
| [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190). |
| arXiv:2212.05190. |
| - World Health Organization. |
| [Medication Without Harm](https://www.who.int/initiatives/medication-without-harm). |
| - CDC. |
| [FastStats: Medication Safety Data](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html). |
| - Shehab N, Lovegrove MC, Geller AI, et al. |
| [US Emergency Department Visits for Outpatient Adverse Drug Events, 2013-2014](https://jamanetwork.com/journals/jama/fullarticle/2585977). |
| JAMA. 2016;316(20):2115-2125. |
| - AHRQ / NCBI Bookshelf. |
| [Deprescribing To Reduce Medication Harms in Older Adults](https://www.ncbi.nlm.nih.gov/books/NBK600387/). |
| - American Geriatrics Society. |
| [2023 updated AGS Beers Criteria for potentially inappropriate medication use in older adults](https://pmc.ncbi.nlm.nih.gov/articles/PMC12478568/). |
| - O'Mahony et al. |
| [STOPP/START criteria for potentially inappropriate prescribing in older people: version 3](https://pmc.ncbi.nlm.nih.gov/articles/PMC10447584/). |
|
|
| ## License |
|
|
| The project package declares an MIT license in |
| [polyguard-rl/pyproject.toml](polyguard-rl/pyproject.toml). See |
| [polyguard-rl/LICENSE](polyguard-rl/LICENSE) for the license text. |
|
|