TheJackBright commited on
Commit
0ea9615
·
1 Parent(s): 3e0e65c

Write public PolyGuard project blog

Browse files
Files changed (1) hide show
  1. blog.md +1144 -0
blog.md CHANGED
@@ -0,0 +1,1144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PolyGuard OpenEnv: Training Medication-Safety Agents Inside a Verifier-Backed World
2
+
3
+ Someone does not experience an unsafe medication regimen as "polypharmacy."
4
+ They experience it as dizziness after a new sleep medication, bleeding after a
5
+ painkiller is added to a blood thinner, confusion from a sedative-opioid
6
+ combination, or a preventable emergency visit because five prescribers each saw
7
+ one slice of the medication list.
8
+
9
+ The dangerous part is often not a single drug. It is the combination: the wrong
10
+ pair, the wrong dose in the wrong organ-function context, the missing lab, the
11
+ duplicated class, the abrupt stop that should have been a taper, or the model
12
+ that confidently says "looks fine" because it was never forced to act inside a
13
+ safety-checked environment.
14
+
15
+ That is the problem PolyGuard was built for.
16
+
17
+ The [CDC medication-safety data page](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html)
18
+ reports that adverse drug events send more than 1.5 million people to emergency
19
+ departments in the United States every year, with almost 500,000
20
+ hospitalizations. Adults 65 and older account for more than 600,000 of those
21
+ visits. A CDC-authored [JAMA surveillance study](https://jamanetwork.com/journals/jama/fullarticle/2585977)
22
+ found that older adults made up 34.5 percent of outpatient adverse-drug-event
23
+ ED visits and had the highest hospitalization rate, 43.6 percent. Globally, the
24
+ [WHO Medication Without Harm challenge](https://www.who.int/initiatives/medication-without-harm)
25
+ estimates the cost associated with medication errors at USD 42 billion
26
+ annually. AHRQ's deprescribing safety review summarizes estimates that
27
+ [45 percent of older adults are exposed to polypharmacy and 58 percent to
28
+ potentially inappropriate medications](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
29
+
30
+ Not every adverse drug event is caused by an incorrect drug combination. But
31
+ these numbers describe the harm surface PolyGuard targets: medication decisions
32
+ where combination risk, monitoring gaps, frailty, organ function, uncertainty,
33
+ and action sequencing all matter at once.
34
+
35
+ PolyGuard turns that problem into an OpenEnv-compatible reinforcement-learning
36
+ environment for polypharmacy safety, medication optimization, deprescribing,
37
+ safe substitution, missing-evidence recovery, and precision dosing. A language
38
+ model policy observes a constrained patient/regimen state, chooses one legal
39
+ candidate action, receives verifier-backed reward, and improves through SFT
40
+ plus GRPO-style post-training.
41
+
42
+ It is not medical software and it is not clinical advice. It is a controlled
43
+ research environment for studying how language-model policies can be trained,
44
+ audited, and stress-tested on safety-critical medication action selection.
45
+
46
+ ## What To Open First
47
+
48
+ - GitHub repository:
49
+ [Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
50
+ - Live product Space:
51
+ [TheJackBright/polyguard-openenv-workbench](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
52
+ - One-run Colab/HF notebook:
53
+ [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb)
54
+ - Final evidence index:
55
+ [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
56
+ - Artifact and traceability guide:
57
+ [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md)
58
+ - Final artifact/evidence Space:
59
+ [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts)
60
+
61
+ The final artifact/evidence Space hosts the Qwen 3B artifact bundle. The Qwen
62
+ 0.5B and 1.5B runs were trained using a second Hugging Face account, so their
63
+ model artifacts could not be hosted in the same final Space. Their report
64
+ mirrors are checked into this repo:
65
+ [0.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct)
66
+ and
67
+ [1.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct).
68
+
69
+ ## The Research Bet
70
+
71
+ Medication safety is combinatorial, partially observable, and high stakes. A
72
+ useful policy has to do more than generate a plausible answer. It has to notice
73
+ drug-drug interaction risk, reason about comorbidities and organ function,
74
+ respect taper and monitoring requirements, choose safe substitutions, abstain
75
+ or ask for review when uncertainty is high, and expose why it acted.
76
+
77
+ The machine-learning pressure is just as real. If a medication vocabulary has
78
+ 500 drugs, the number of possible five-drug combinations is:
79
+
80
+ ```text
81
+ C(500, 5) = 255,244,687,600
82
+ ```
83
+
84
+ Exhaustive search is not a serious option. The paper that inspired this
85
+ project, [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190),
86
+ frames dangerous polypharmacy discovery as a bandit search problem over a huge
87
+ combination space. It benchmarks neural bandit search over simulated
88
+ polypharmacy datasets with 500 drugs and 100,000 distinct combinations, and
89
+ reports detection of up to 72 percent of potentially inappropriate
90
+ polypharmacies with 99 percent average precision after 30,000 time steps.
91
+
92
+ PolyGuard borrows that search instinct, then moves the problem from offline
93
+ combination mining into an agentic environment. The policy sees a patient
94
+ state, chooses among legal clinical action candidates, and is judged by a
95
+ deterministic verifier and reward router rather than by free-form text
96
+ preference alone.
97
+
98
+ The research question is narrow and concrete:
99
+
100
+ Can environment-backed feedback make a small open model better at safe
101
+ medication action selection than prompt-only, first-legal, rule-only, or
102
+ single-agent baselines?
103
+
104
+ The answer in this repository is an inspectable system:
105
+
106
+ 1. A finite-horizon OpenEnv simulation for medication decisions.
107
+ 2. A constrained action space, so the model chooses candidate actions instead
108
+ of inventing arbitrary clinical instructions.
109
+ 3. A legality verifier that prevents unsafe state mutation.
110
+ 4. Thirteen reward components rolled into four primary reward channels.
111
+ 5. A multi-agent policy stack with supervisor routing, contextual bandit
112
+ reranking, planner selection, critic veto, and explanation logging.
113
+ 6. SFT for format and clinical-prior warm start.
114
+ 7. GRPO with environment-backed reward, not an opaque LLM judge.
115
+ 8. Agentic evaluation with baseline comparison, policy ablations, post-save
116
+ inference, robustness checks, action traces, and failure mining.
117
+
118
+ ![PolyGuard system architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png)
119
+
120
+ ## A Failure Trace That Motivated the Design
121
+
122
+ In the final matched-seed traces, the failure mode is not abstract. On seeds
123
+ `8000` and `8004`, the basic prompt-style proxy repeatedly chose `cand_01`,
124
+ the first legal candidate. In those cases, `cand_01` meant `KEEP_REGIMEN` while
125
+ a hidden `warfarin_like` + `nsaid_like` interaction remained unresolved. The
126
+ verifier recorded `holdout_ddi_not_addressed`.
127
+
128
+ The full PolyGuard pipeline selected `cand_03`, a safer intervention candidate,
129
+ and avoided those failure reasons.
130
+
131
+ That is the core argument of the project: medication AI should be judged inside
132
+ a stateful safety environment, not only by whether its answer sounds clinically
133
+ plausible.
134
+
135
+ Internal evidence:
136
+ [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
137
+ and
138
+ [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl).
139
+
140
+ ## Safety Contract
141
+
142
+ PolyGuard does not let a model directly mutate a medication list from free
143
+ text. Every decision is candidate-based, verifier-checked, reward-decomposed,
144
+ and traced. Illegal actions can be scored, penalized, and logged, but they do
145
+ not change patient state.
146
+
147
+ The repo evidence for this contract is spread across the environment, rules,
148
+ and final reports:
149
+
150
+ | Claim | Repo evidence |
151
+ | --- | --- |
152
+ | Hard contraindication examples are represented | [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) |
153
+ | Safer alternatives are explicit | [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) |
154
+ | Unsafe substitutions and dose escalations are blocked before state mutation | [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
155
+ | Reward hacking and loop-like behavior are surfaced | [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [docs/reward_design.md](polyguard-rl/docs/reward_design.md) |
156
+ | Baseline failure is traceable by seed and candidate | [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json), [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) |
157
+ | Final claims are separated from older smoke artifacts | [final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
158
+
159
+ ## Project Map
160
+
161
+ The implementation lives under [polyguard-rl/](polyguard-rl/).
162
+
163
+ | Area | Key paths |
164
+ | --- | --- |
165
+ | OpenEnv runtime | [openenv.yaml](polyguard-rl/openenv.yaml), [app/env/env_core.py](polyguard-rl/app/env/env_core.py), [app/env/fastapi_app.py](polyguard-rl/app/env/fastapi_app.py), [server/app.py](polyguard-rl/server/app.py) |
166
+ | Action/state contracts | [app/common/types.py](polyguard-rl/app/common/types.py), [app/common/enums.py](polyguard-rl/app/common/enums.py) |
167
+ | Candidate generation and verifier | [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py), [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
168
+ | Reward and anti-cheat | [app/env/reward_router.py](polyguard-rl/app/env/reward_router.py), [app/env/reward_scaling.py](polyguard-rl/app/env/reward_scaling.py), [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [configs/rewards.yaml](polyguard-rl/configs/rewards.yaml) |
169
+ | Multi-agent policy | [app/agents/](polyguard-rl/app/agents/), [docs/agents.md](polyguard-rl/docs/agents.md) |
170
+ | Bandits and baselines | [app/models/baselines/contextual_bandit.py](polyguard-rl/app/models/baselines/contextual_bandit.py), [app/models/baselines/contextual_bandit_policy.py](polyguard-rl/app/models/baselines/contextual_bandit_policy.py), [app/models/baselines/](polyguard-rl/app/models/baselines/) |
171
+ | Training | [app/training/](polyguard-rl/app/training/), [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), [docs/training.md](polyguard-rl/docs/training.md) |
172
+ | Data | [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json), [data/scenarios/](polyguard-rl/data/scenarios/), [docs/datasets.md](polyguard-rl/docs/datasets.md) |
173
+ | Evaluation | [app/evaluation/](polyguard-rl/app/evaluation/), [scripts/evaluate_all.py](polyguard-rl/scripts/evaluate_all.py), [docs/evaluation.md](polyguard-rl/docs/evaluation.md) |
174
+ | Product API/UI | [app/api/](polyguard-rl/app/api/), [app/ui/frontend/](polyguard-rl/app/ui/frontend/), [docs/ui.md](polyguard-rl/docs/ui.md) |
175
+ | Math | [docs/math.md](polyguard-rl/docs/math.md), [docs/mathematics.md](polyguard-rl/docs/mathematics.md) |
176
+ | Results | [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/) |
177
+
178
+ Supporting docs include [architecture.md](polyguard-rl/docs/architecture.md),
179
+ [environment_design.md](polyguard-rl/docs/environment_design.md),
180
+ [reward_design.md](polyguard-rl/docs/reward_design.md),
181
+ [safety.md](polyguard-rl/docs/safety.md),
182
+ [precision_dosing.md](polyguard-rl/docs/precision_dosing.md),
183
+ [graph_models.md](polyguard-rl/docs/graph_models.md),
184
+ [ablations.md](polyguard-rl/docs/ablations.md),
185
+ [api.md](polyguard-rl/docs/api.md),
186
+ [deployment.md](polyguard-rl/docs/deployment.md),
187
+ [ui.md](polyguard-rl/docs/ui.md),
188
+ [DEMO_RECORDING_SCRIPT.md](polyguard-rl/docs/DEMO_RECORDING_SCRIPT.md), and
189
+ [submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
190
+
191
+ ## The OpenEnv Environment
192
+
193
+ At the center is `PolyGuardEnv`, implemented in
194
+ [app/env/env_core.py](polyguard-rl/app/env/env_core.py). It follows the
195
+ OpenEnv/Gym shape:
196
+
197
+ ```text
198
+ reset(seed, difficulty, sub_environment, scenario_id, patient_id)
199
+ -> PolyGuardObservation
200
+
201
+ step(PolyGuardAction)
202
+ -> (PolyGuardObservation, reward, done, info)
203
+ ```
204
+
205
+ At reset, the environment loads or generates a patient scenario, selects a
206
+ difficulty and sub-environment, computes a risk summary, builds candidate
207
+ actions, estimates uncertainty, and emits a strict observation. At step time,
208
+ the environment parses the action, checks legality, evaluates anti-cheat rules,
209
+ mutates state only if the action is safe, computes decomposed reward, appends a
210
+ trace, and returns detailed `info` fields such as failure reasons, transition
211
+ delta, primary reward channels, invalid-action count, and timeout checks.
212
+
213
+ ![Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png)
214
+
215
+ PolyGuard is not one task. It cycles through specialized sub-environments:
216
+
217
+ | Sub-environment | What it stresses |
218
+ | --- | --- |
219
+ | `DDI` | High-risk drug-drug interaction recognition and resolution |
220
+ | `BANDIT_MINING` | Candidate exploration and shortlist/ranking behavior inspired by bandit search |
221
+ | `REGIMEN_RISK` | General medication burden and regimen optimization |
222
+ | `PRECISION_DOSING` | Dose-hold, dose reduction, renal/hepatic guardrails, monitoring decisions |
223
+ | `LONGITUDINAL_DEPRESCRIBING` | Multi-step taper/deprescribing behavior over a longer horizon |
224
+ | `WEB_SEARCH_MISSING_DATA` | Evidence fetch or review when critical data is missing |
225
+ | `ALTERNATIVE_SUGGESTION` | Safe alternatives and within-class substitution |
226
+ | `NEW_DRUG_DECOMPOSITION` | First-pass reasoning over an unknown or combination medication |
227
+
228
+ The curriculum in [app/env/curriculum.py](polyguard-rl/app/env/curriculum.py)
229
+ starts with short easy DDI/regimen-risk episodes, then adds bandit and
230
+ alternative-selection tasks, and finally hard cases with precision dosing,
231
+ longitudinal deprescribing, missing data, and new-drug decomposition.
232
+
233
+ ### State and Observation
234
+
235
+ The latent state is represented by `PolyGuardState` and includes patient
236
+ demographics, active decision mode, step budget, medications, dose buckets,
237
+ comorbidities, labs, vitals, frailty, adherence, monitoring gaps, prior adverse
238
+ event history, burden score, severe-pair count, precision dosing flags,
239
+ unresolved conflicts, action history, cumulative reward, and done state.
240
+
241
+ The agent does not get every simulator internal. It receives a controlled
242
+ `PolyGuardObservation` with a patient summary, medication table, comorbidity
243
+ summary, organ function, labs/vitals, graph safety summary, burden summary,
244
+ precision dosing flags, unresolved conflicts, candidate actions, step budget,
245
+ action history, warnings, abstention indicators, seed, scenario, difficulty,
246
+ and sub-environment.
247
+
248
+ This split matters. PolyGuard is a partially observable environment. Missing
249
+ labs and unresolved conflicts are visible as uncertainty signals, not as hidden
250
+ reward traps.
251
+
252
+ ## Action Space and Safety Constraints
253
+
254
+ PolyGuard deliberately avoids unconstrained text actions. The policy chooses a
255
+ strict `PolyGuardAction` with fields such as `mode`, `action_type`,
256
+ `target_drug`, `replacement_drug`, `dose_bucket`, `taper_days`,
257
+ `monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`,
258
+ `candidate_id`, `confidence`, and `rationale_brief`.
259
+
260
+ The action types are compact:
261
+
262
+ | Family | Action types |
263
+ | --- | --- |
264
+ | Regimen | `KEEP_REGIMEN`, `STOP_DRUG`, `SUBSTITUTE_WITHIN_CLASS`, `RECOMMEND_ALTERNATIVE` |
265
+ | Dosing | `REDUCE_DOSE_BUCKET`, `INCREASE_DOSE_BUCKET`, `DOSE_HOLD`, `ORDER_MONITORING_AND_WAIT` |
266
+ | Deprescribing | `TAPER_INITIATE`, `TAPER_CONTINUE` |
267
+ | Evidence and uncertainty | `FETCH_EXTERNAL_EVIDENCE`, `DECOMPOSE_NEW_DRUG`, `REQUEST_SPECIALIST_REVIEW`, `REQUEST_PHARMACIST_REVIEW` |
268
+
269
+ The candidate builder in
270
+ [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py)
271
+ generates a bounded candidate set:
272
+
273
+ ```text
274
+ 3 <= |C_t| <= 10
275
+ ```
276
+
277
+ Each candidate carries estimated safety delta, burden delta, disease stability,
278
+ uncertainty score, rationale tags, required monitoring, and a legality
279
+ precheck. Policy selection is candidate selection:
280
+
281
+ ```text
282
+ a_t = to_action(c_t), where c_t is in C_t
283
+ ```
284
+
285
+ The verifier in [app/env/verifier.py](polyguard-rl/app/env/verifier.py)
286
+ enforces hard safety constraints before state mutation. It checks that target
287
+ drugs exist when required, substitutions and alternatives are allowed, evidence
288
+ domains are allowlisted, new-drug decomposition includes required components,
289
+ taper-required drugs are not stopped abruptly, renal/hepatic unsafe dose
290
+ escalation is blocked, duplicate therapy and contraindicated replacement pairs
291
+ are blocked, and monitoring/hold actions include a monitoring plan.
292
+
293
+ Illegal actions can receive reward penalties and become visible in traces, but
294
+ they do not mutate patient state.
295
+
296
+ ## Multi-Agent Policy Stack
297
+
298
+ The "agents" in PolyGuard are an auditable policy factorization rather than
299
+ free-form independent chatbots. A step flows through:
300
+
301
+ ```text
302
+ MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate
303
+ -> Supervisor -> Planner -> Critic -> Env -> Explainer
304
+ ```
305
+
306
+ ![Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png)
307
+
308
+ | Agent/module | Role |
309
+ | --- | --- |
310
+ | `MedRecAgent` | Summarizes current regimen and medication burden |
311
+ | `EvidenceAgent` | Retrieves local or fallback evidence when missing data is present |
312
+ | `GraphSafetyAgent` | Scores risky pairs, side-effect load, duplicate therapy, and graph safety patterns |
313
+ | `DosingAgent` | Detects dose-sensitive cases and dose-hold opportunities |
314
+ | `CandidateAgent` | Exposes legal candidate actions from the environment candidate builder |
315
+ | `SupervisorAgent` | Routes to regimen optimization, dose optimization, or review mode |
316
+ | `PlannerAgent` | Selects an action from candidates through the policy provider |
317
+ | `CriticAgent` | Vetoes illegal or unsafe proposed actions and can force review fallback |
318
+ | `ExplainerAgent` | Records grounded rationale for demo, replay, and audit |
319
+
320
+ The orchestration modes are `sequential_pipeline`, `supervisor_routed`,
321
+ `replan_on_veto`, and `lightweight_debate`. Policy-stack ablations compare
322
+ `bandit-only`, `llm-only`, and `llm+bandit`.
323
+
324
+ ## Contextual Bandits
325
+
326
+ PolyGuard uses contextual bandits as an inspectable candidate-reranking layer.
327
+ This is where the project most directly echoes the arXiv bandit inspiration:
328
+ unsafe polypharmacy search is combinatorial, so the system should learn which
329
+ regions of the candidate/action space are worth exploring rather than enumerate
330
+ everything.
331
+
332
+ Each candidate becomes an 8-dimensional feature vector:
333
+
334
+ ```text
335
+ x(c) = [
336
+ 1,
337
+ I[legality_precheck],
338
+ estimated_safety_delta,
339
+ burden_delta,
340
+ disease_stability_estimate,
341
+ 1 - uncertainty_score,
342
+ I[mode = DOSE_OPT],
343
+ I[mode = REVIEW]
344
+ ]
345
+ ```
346
+
347
+ An arm is keyed by macro mode and action type:
348
+
349
+ ```text
350
+ arm(c) = mode(c) || ":" || action_type(c)
351
+ ```
352
+
353
+ The LinUCB variant maintains, for each arm `a`:
354
+
355
+ ```text
356
+ A_a = I + sum x x^T
357
+ b_a = sum r x
358
+ theta_a = A_a^{-1} b_a
359
+
360
+ score_a(x) = theta_a^T x + alpha * sqrt(x^T A_a^{-1} x)
361
+ ```
362
+
363
+ There is also a Thompson-style variant:
364
+
365
+ ```text
366
+ score_a(x) = theta_a^T x + Normal(0, alpha)
367
+ ```
368
+
369
+ This layer can shortlist candidates before the planner emits the final action.
370
+ It is deliberately kept inside the candidate space: the bandit can improve
371
+ ordering and exploration, but it cannot invent an unsafe action outside the
372
+ environment contract.
373
+
374
+ ## Reward Model
375
+
376
+ The reward model is decomposed on purpose. A single scalar reward is needed for
377
+ RL, but safety-critical RL needs more than one opaque number. PolyGuard logs 13
378
+ component columns and four primary channels on every step.
379
+
380
+ ![Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png)
381
+
382
+ All reward values are clamped and quantized:
383
+
384
+ ```text
385
+ q(x) = round(clip(x, 0.001, 0.999), 3)
386
+ ```
387
+
388
+ The 13 reward components are:
389
+
390
+ | Component | Weight | Meaning |
391
+ | --- | ---: | --- |
392
+ | `format_compliance_score` | 0.08 | Action payload conforms to the schema |
393
+ | `candidate_alignment_score` | 0.08 | The model selected a valid candidate-style id |
394
+ | `legality_score` | 0.12 | The verifier accepted the action |
395
+ | `safety_delta_score` | 0.15 | Severe-pair and burden risk decreased |
396
+ | `burden_improvement_score` | 0.08 | Dose-weighted medication burden improved |
397
+ | `disease_stability_score` | 0.10 | The action did not destabilize underlying disease management |
398
+ | `dosing_quality_score` | 0.08 | Dose-sensitive routing/action quality |
399
+ | `abstention_quality_score` | 0.06 | Review/abstention is appropriate under uncertainty |
400
+ | `efficiency_score` | 0.06 | The action uses the finite step budget well |
401
+ | `process_fidelity_score` | 0.06 | The action follows task-specific process expectations |
402
+ | `explanation_grounding_score` | 0.03 | The rationale is present and grounded |
403
+ | `anti_cheat_score` | 0.06 | Reward-hacking checks did not fire |
404
+ | `uncertainty_calibration_score` | 0.04 | Confidence matches observable uncertainty |
405
+
406
+ The scalar reward is a weighted average:
407
+
408
+ ```text
409
+ R_env(s_t, a_t, s_{t+1}) = q( sum_i w_i c_i / sum_i w_i )
410
+ ```
411
+
412
+ Safety-heavy terms dominate the total weight:
413
+
414
+ ```text
415
+ legality + safety_delta + burden + disease_stability + anti_cheat
416
+ = 0.12 + 0.15 + 0.08 + 0.10 + 0.06
417
+ = 0.51
418
+ ```
419
+
420
+ The four primary reward channels are:
421
+
422
+ | Channel | Component family |
423
+ | --- | --- |
424
+ | `safety_legality` | legality, candidate alignment, anti-cheat, uncertainty calibration |
425
+ | `clinical_improvement` | safety delta, burden improvement, disease stability |
426
+ | `dosing_quality` | dosing quality and abstention quality |
427
+ | `process_integrity` | format compliance, efficiency, process fidelity, explanation grounding |
428
+
429
+ These channels are emitted in `info.primary_reward_channels`, GRPO reward logs,
430
+ reports, plots, and ablation summaries.
431
+
432
+ ## Anti-Cheat and Failure Visibility
433
+
434
+ RL policies exploit reward functions. PolyGuard makes common shortcut failures
435
+ explicit: repeated action loops, excessive keep-regimen behavior, excessive
436
+ review/abstention behavior, candidate ID mismatch, candidate outside the legal
437
+ set, hidden high-risk DDI no-op behavior, parser exploit patterns in rationales,
438
+ and retries of failed no-op actions.
439
+
440
+ If an exploit is detected:
441
+
442
+ ```text
443
+ anti_cheat_score = 0.001
444
+ done = true
445
+ termination_reason = "exploit_detection"
446
+ ```
447
+
448
+ Episodes can also terminate on step budget exhaustion, repeated invalid
449
+ actions, safety-veto threshold, patient destabilization, safe resolution,
450
+ wall-clock timeout, or per-step timeout.
451
+
452
+ ![Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png)
453
+
454
+ ## Mathematics
455
+
456
+ PolyGuard can be read as a finite-horizon constrained partially observable
457
+ Markov decision process:
458
+
459
+ ```text
460
+ M = (S, A, O, T, R, H, C)
461
+ ```
462
+
463
+ where `S` is latent patient/regimen state, `A` is the constrained medication
464
+ action set, `O` is the controlled observation, `T(s' | s, a)` is the transition
465
+ function, `R(s, a, s')` is verifier-backed reward, `H` is the episode horizon,
466
+ and `C(s, a)` is the hard safety/legality constraint predicate.
467
+
468
+ The objective is:
469
+
470
+ ```text
471
+ maximize_pi E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ]
472
+ subject to C(s_t, a_t) = 1 whenever possible
473
+ ```
474
+
475
+ There is no explicit discount factor in the runtime. Time preference enters
476
+ through finite horizons and the efficiency reward:
477
+
478
+ ```text
479
+ efficiency_t = q(1 - step_count_t / (max_steps + 1))
480
+ ```
481
+
482
+ State transition is two-gated:
483
+
484
+ ```text
485
+ if verifier(s_t, a_t).legal and not anti_cheat(s_t, a_t):
486
+ s_{t+1} = T(s_t, a_t)
487
+ else:
488
+ s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t)
489
+ ```
490
+
491
+ Risk-like deltas become reward through:
492
+
493
+ ```text
494
+ delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post))
495
+ ```
496
+
497
+ For burden and contraindicated-pair improvement:
498
+
499
+ ```text
500
+ burden_reward = delta_reward(pre_burden, post_burden)
501
+ pair_reward = delta_reward(pre_pairs, post_pairs)
502
+
503
+ safety_delta_score =
504
+ q(0.65 * pair_reward + 0.35 * burden_reward) if legal
505
+ 0.001 otherwise
506
+ ```
507
+
508
+ GRPO uses environment execution as the reward function. For each prompt, the
509
+ model emits candidate completions; PolyGuard parses the candidate id, resets a
510
+ deterministic environment using the recorded seed and scenario fields, executes
511
+ one step, and returns reward. The training reward combines environment reward
512
+ with a legality bonus:
513
+
514
+ ```text
515
+ legal_bonus = 0.95 if action is legal else 0.05
516
+
517
+ R_GRPO = q(0.80 * R_env + 0.20 * legal_bonus)
518
+ ```
519
+
520
+ Conceptually, GRPO forms a within-prompt advantage:
521
+
522
+ ```text
523
+ A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon)
524
+ ```
525
+
526
+ and optimizes a clipped policy-ratio objective with KL regularization. The
527
+ optimizer mechanics are TRL's; PolyGuard's contribution is the verifier-backed
528
+ reward function and the controlled action/state environment. The expanded
529
+ derivation is in
530
+ [polyguard-rl/docs/mathematics.md](polyguard-rl/docs/mathematics.md).
531
+
532
+ ## Data and Dataset Pipeline
533
+
534
+ The data pipeline builds a compact medication-safety substrate from local drug
535
+ knowledge, synthetic patients, scenario files, retrieval text, and optional
536
+ external augmentation.
537
+
538
+ ![Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png)
539
+
540
+ The dataset design is documented in
541
+ [docs/datasets.md](polyguard-rl/docs/datasets.md). The local generated
542
+ pipeline produces these processed artifacts and counts:
543
+
544
+ | Artifact | Count | Path |
545
+ | --- | ---: | --- |
546
+ | Normalized drug rows | 10 | `data/processed/normalized_drugs.parquet` |
547
+ | Drug class rows | 10 | `data/processed/drug_classes.parquet` |
548
+ | Interaction rows | 2 | `data/processed/interactions.parquet` |
549
+ | Graph edges | 18 | `data/processed/graph_edges.parquet` |
550
+ | Synthetic patients | 20 | `data/processed/patients_synthetic.parquet` |
551
+ | Retrieval documents | 8 | `data/processed/retrieval_corpus.jsonl` |
552
+ | Easy scenarios | 100 | [data/scenarios/easy/](polyguard-rl/data/scenarios/easy/) |
553
+ | Medium scenarios | 200 | [data/scenarios/medium/](polyguard-rl/data/scenarios/medium/) |
554
+ | Hard scenarios | 200 | [data/scenarios/hard/](polyguard-rl/data/scenarios/hard/) |
555
+ | Local small SFT rows | 80 | `data/processed/training_corpus_sft.jsonl` |
556
+ | Local small GRPO prompts | 80 | `data/processed/training_corpus_grpo_prompts.jsonl` |
557
+
558
+ The provenance manifest generated by the local pipeline records source policy
559
+ and counts at `data/processed/provenance_manifest.json`.
560
+
561
+ Additional data-governance and rule artifacts are produced or consumed by the
562
+ pipeline:
563
+
564
+ | Artifact | Why it matters |
565
+ | --- | --- |
566
+ | `data/processed/ingested_sources.json` | Source ingestion ledger used by the local build |
567
+ | `data/processed/feature_dictionary.json` | Names and meanings of structured model features |
568
+ | `data/processed/burden_rules.yaml` | Medication-burden and duplicate-therapy rules |
569
+ | `data/processed/substitution_rules.yaml` | Data-level safer-substitution rules |
570
+ | `data/processed/taper_rules.yaml` | Deprescribing and taper requirements |
571
+ | `data/retrieval_index/index.json` | Retrieval index over local evidence chunks |
572
+
573
+ The local knowledge seed is
574
+ [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json).
575
+ It contains drug classes, example high-risk pairs, renal and hepatic flags,
576
+ side-effect tags, substitution rules, and taper requirements. The processed
577
+ tables then feed graph modeling, candidate generation, environment scenarios,
578
+ retrieval, SFT rows, and GRPO prompts.
579
+
580
+ The full training/evidence runs used 2,000 examples per Qwen model, recorded in
581
+ the final reports under
582
+ [docs/results/final_submission_evidence/reports/](polyguard-rl/docs/results/final_submission_evidence/reports/).
583
+
584
+ ## Models Inside the Environment
585
+
586
+ PolyGuard combines learned and rule-backed components:
587
+
588
+ - Graph safety model:
589
+ [app/models/graph/](polyguard-rl/app/models/graph/) produces regimen
590
+ embeddings, pairwise DDI severity, severe-alert probability, and side-effect
591
+ tag probabilities.
592
+ - Tabular risk model:
593
+ [app/models/tabular/](polyguard-rl/app/models/tabular/) supports calibrated
594
+ patient/regimen risk heads and evaluation.
595
+ - Dosing model:
596
+ [app/models/dosing/](polyguard-rl/app/models/dosing/) models dose-sensitive
597
+ states with target attainment, toxicity, underdose risk, organ stress,
598
+ interaction load, and monitoring need.
599
+ - Retrieval:
600
+ [app/models/retrieval/](polyguard-rl/app/models/retrieval/) and
601
+ [app/knowledge/](polyguard-rl/app/knowledge/) provide local evidence chunks,
602
+ drug rules, renal/hepatic guardrails, duplicate therapy rules, substitution
603
+ rules, taper rules, burden scoring, and side-effect ontology.
604
+ - Active model runtime:
605
+ [app/models/policy/active_model.py](polyguard-rl/app/models/policy/active_model.py)
606
+ discovers activated artifacts from `checkpoints/active/active_model_manifest.json`;
607
+ the tracked evidence mirror includes
608
+ [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json).
609
+ The provider load order prefers a GRPO adapter, then merged model, then SFT
610
+ adapter.
611
+ - Provider runtime:
612
+ [app/models/policy/provider_runtime.py](polyguard-rl/app/models/policy/provider_runtime.py)
613
+ is Transformers-first, with optional Ollama when enabled. If model loading is
614
+ unavailable, the runtime falls back to deterministic safety ranking.
615
+
616
+ Tracked support-model reports show that the environment is not only an LLM
617
+ wrapper:
618
+
619
+ | Component | Report | Current tracked result |
620
+ | --- | --- | --- |
621
+ | Graph model | [docs/results/graph_train.json](polyguard-rl/docs/results/graph_train.json) | `status: trained`, `num_samples: 180`, artifact path `outputs/models/graph_model.pkl` |
622
+ | Tabular risk model | [docs/results/risk_train.json](polyguard-rl/docs/results/risk_train.json) | `status: trained`, `dataset_size: 180`, `train_mae: 0.0033`, artifact path `outputs/models/tabular_risk.pkl` |
623
+ | Dose surrogate model | [docs/results/dose_train.json](polyguard-rl/docs/results/dose_train.json) | `status: trained`, `dataset_size: 120`, `train_mae: 0.0025`, artifact path `outputs/models/dose_model.pkl` |
624
+
625
+ The hard-coded contraindicated seed pairs in
626
+ [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py)
627
+ include `warfarin_like` + `nsaid_like` and `benzodiazepine_like` +
628
+ `opioid_like`. Substitution rules in
629
+ [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py)
630
+ include safer alternatives such as `nsaid_like -> acetaminophen_like`,
631
+ `nsaid_like -> topical_nsaid_like`, `benzodiazepine_like ->
632
+ non_benzo_sleep_support`, and `opioid_like -> non_opioid_analgesic`.
633
+
634
+ ### Precision Dosing
635
+
636
+ Precision dosing uses sensitive classes such as anticoagulants, sedatives, and
637
+ glucose-lowering drugs. The dosing agent and surrogate model are implemented in
638
+ [app/agents/dosing_agent.py](polyguard-rl/app/agents/dosing_agent.py) and
639
+ [app/models/dosing/](polyguard-rl/app/models/dosing/).
640
+
641
+ The surrogate PK/PD transition in
642
+ [app/models/dosing/surrogate_pkpd.py](polyguard-rl/app/models/dosing/surrogate_pkpd.py)
643
+ uses effect, toxicity, underdose, organ stress, and interaction load:
644
+
645
+ ```text
646
+ effective_delta = dose_delta * (1 - min(0.6, organ_factor * 0.4))
647
+
648
+ effect' =
649
+ clip(effect + 0.28 * effective_delta - 0.05 * interaction_factor, 0, 1)
650
+
651
+ toxicity_gain =
652
+ max(0, dose_delta) * (0.35 + 0.25 * organ_factor + 0.20 * interaction_factor)
653
+
654
+ toxicity' =
655
+ clip(0.85 * toxicity + toxicity_gain, 0, 1)
656
+
657
+ underdose' =
658
+ clip(1 - effect' + 0.15 * max(0, -dose_delta), 0, 1)
659
+ ```
660
+
661
+ The higher-level dosing metrics use target attainment, toxicity avoidance,
662
+ underdose risk, and monitoring need:
663
+
664
+ ```text
665
+ target_attainment = 1 - abs(effect_level - 0.62)
666
+ toxicity_proxy = toxicity_level + 0.20 * organ_stress + 0.12 * interaction_load
667
+ measurement_need = max(toxicity_proxy, underdose_proxy)
668
+ ```
669
+
670
+ ## Training and Post-Training
671
+
672
+ The training stack is deliberately staged:
673
+
674
+ 1. Build structured data, scenarios, retrieval records, SFT examples, and GRPO
675
+ prompts.
676
+ 2. Run SFT with TRL to teach the model the candidate-id format and obvious
677
+ clinical priors.
678
+ 3. Run GRPO with environment-backed reward, where sampled candidate completions
679
+ are executed in PolyGuardEnv and scored by the verifier/reward router.
680
+ 4. Track sampled generations, reward components, primary reward channels,
681
+ legality, anti-cheat events, and training curves.
682
+ 5. Run policy-stack ablations and baseline comparisons.
683
+ 6. Merge or export adapters safely.
684
+ 7. Validate post-save inference from the saved artifact, not from an in-memory
685
+ training object.
686
+ 8. Generate reports, charts, action traces, and final artifact manifests.
687
+
688
+ The relevant training source files are
689
+ [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py),
690
+ [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py),
691
+ [app/training/sft_trl.py](polyguard-rl/app/training/sft_trl.py),
692
+ [app/training/grpo_trl.py](polyguard-rl/app/training/grpo_trl.py),
693
+ [app/training/reward_functions.py](polyguard-rl/app/training/reward_functions.py),
694
+ [app/training/openenv_wrapper.py](polyguard-rl/app/training/openenv_wrapper.py),
695
+ and [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py).
696
+
697
+ The one-run notebook is
698
+ [polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
699
+ It is the accessible Colab/HF workflow for building data, running checks,
700
+ launching training, pulling reports, generating charts, validating inference,
701
+ activating a model, deploying the product Space, and running acceptance checks.
702
+
703
+ The modular notebook series is:
704
+
705
+ - [01_data_building.ipynb](polyguard-rl/notebooks/01_data_building.ipynb)
706
+ - [02_knowledge_graph.ipynb](polyguard-rl/notebooks/02_knowledge_graph.ipynb)
707
+ - [03_risk_models.ipynb](polyguard-rl/notebooks/03_risk_models.ipynb)
708
+ - [04_environment_validation.ipynb](polyguard-rl/notebooks/04_environment_validation.ipynb)
709
+ - [05_sft_debug.ipynb](polyguard-rl/notebooks/05_sft_debug.ipynb)
710
+ - [06_grpo_debug.ipynb](polyguard-rl/notebooks/06_grpo_debug.ipynb)
711
+ - [07_policy_analysis.ipynb](polyguard-rl/notebooks/07_policy_analysis.ipynb)
712
+ - [08_dosing_analysis.ipynb](polyguard-rl/notebooks/08_dosing_analysis.ipynb)
713
+ - [09_training_loop.ipynb](polyguard-rl/notebooks/09_training_loop.ipynb)
714
+
715
+ For exact local and remote execution details, use
716
+ [docs/training.md](polyguard-rl/docs/training.md) and
717
+ [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
718
+ This blog focuses on architecture, data, evaluation, and evidence rather than
719
+ private or environment-specific commands.
720
+
721
+ ## Training Curves and Model Results
722
+
723
+ The final curated evidence lives in
724
+ [polyguard-rl/docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/).
725
+ It replaces earlier smoke-run charts and older 0.5B/1.5B-only views.
726
+
727
+ ### SFT Loss Across Qwen Runs
728
+
729
+ ![SFT loss curves across Qwen runs](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/sft_loss_curves_all_models.png)
730
+
731
+ The SFT curves, post-save valid rates, and token-accuracy histories show that
732
+ the models learned the candidate-id output contract rather than only producing
733
+ unconstrained prose. The visible curves drop from roughly `3.0-3.6` initial
734
+ loss to low final loss across all three Qwen sizes.
735
+
736
+ ![Qwen 3B SFT training loss](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_sft_training_loss.png)
737
+
738
+ The tracked per-model summaries are:
739
+
740
+ | Run | Model | Epochs | Final step | Runtime | Key SFT metrics |
741
+ | --- | --- | ---: | ---: | ---: | --- |
742
+ | `qwen-qwen2-5-0-5b-instruct` | `Qwen/Qwen2.5-0.5B-Instruct` | 2 | 2,000 | `234.6302s` | loss `3.0856 -> 0.0626`, best `0.0057`, train loss `0.1923`, token accuracy `0.9717`, valid rate `1.0`, avg env reward `0.726`, latency `1.839s` |
743
+ | `qwen-qwen2-5-1-5b-instruct` | `Qwen/Qwen2.5-1.5B-Instruct` | 2 | 4,000 | `483.7085s` | loss `2.9686 -> 0.0681`, best `0.0009`, train loss `0.1152`, token accuracy `0.9726`, valid rate `1.0`, avg env reward `0.726`, latency `2.158s` |
744
+ | `qwen-qwen2-5-3b-instruct` | `Qwen/Qwen2.5-3B-Instruct` | 2 | 2,000 | `715.2908s` | loss `3.5687 -> 0.054`, best `0.0022`, train loss `0.1569`, token accuracy `0.9750`, SFT avg env reward `0.781`, SFT latency `2.863s` |
745
+
746
+ Each SFT run used 2,000 examples. The 0.5B and 3B runs recorded 2,001 history
747
+ rows including the final trainer summary; the 1.5B run recorded 4,001 history
748
+ rows because its batch configuration produced 4,000 final steps.
749
+
750
+ ### GRPO Reward Curve
751
+
752
+ ![Qwen 3B GRPO reward curve](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_grpo_reward_curve.png)
753
+
754
+ ![Qwen 3B GRPO training loss](polyguard-rl/docs/results/final_submission_evidence/charts/curated/training/qwen_3b_grpo_loss_curve.png)
755
+
756
+ The complete GRPO evidence is available for Qwen 3B:
757
+
758
+ - Backend: `trl_transformers`
759
+ - Model: `Qwen/Qwen2.5-3B-Instruct`
760
+ - Records: `2000`
761
+ - Epochs: `1.0`
762
+ - Final step: `2000`
763
+ - Runtime: `6873.9375s` (`1.91h`)
764
+ - Reward samples: `4000`
765
+ - GRPO average reward: `0.767`
766
+ - GRPO reward history: min `0.376`, max `0.880`, last `0.812`, average `0.76685`
767
+ - GRPO train loss: `0.000002665`
768
+ - Post-save GRPO valid rate: `1.0`
769
+ - Post-save GRPO average environment reward: `0.726`
770
+ - Post-save GRPO average latency: `3.681s`
771
+ - Artifact path recorded in the report: `checkpoints/sweeps/qwen-qwen2-5-3b-instruct/grpo_adapter`
772
+
773
+ Source reports:
774
+ [grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json),
775
+ [postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json),
776
+ and
777
+ [submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json).
778
+
779
+ ### SFT vs GRPO by Model
780
+
781
+ ![SFT vs GRPO verifier reward by model](polyguard-rl/docs/results/final_submission_evidence/charts/curated/model_comparison/sft_vs_grpo_reward_by_model.png)
782
+
783
+ This chart is intentionally transparent about artifact availability. Qwen 0.5B
784
+ and 1.5B have SFT reports/histories and post-save SFT evidence in the repo, but
785
+ their adapter directories were not present in the local/final artifact mirrors
786
+ at packaging time. Qwen 3B has the complete SFT plus GRPO artifact set.
787
+
788
+ The packaged manifest records Qwen 3B as complete with 125 checkpoint files
789
+ (`433,208,536` bytes), 11 SFT adapter files (`30,655,905` bytes), 11 GRPO
790
+ adapter files (`30,656,841` bytes), and 9 report files (`5,930,214` bytes).
791
+ Qwen 0.5B and 1.5B are retained as report/post-save evidence only.
792
+
793
+ Manifest:
794
+ [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json).
795
+
796
+ ### Product Pipeline vs Basic LLM Proxy
797
+
798
+ ![Basic LLM vs full PolyGuard pipeline](polyguard-rl/docs/results/final_submission_evidence/charts/curated/product_over_basic_llm/basic_llm_vs_full_pipeline_reward.png)
799
+
800
+ Matched-seed evaluation compares a basic LLM-style first-legal proxy, an
801
+ SFT-style safety ranker, and the full PolyGuard orchestrated pipeline. The same
802
+ PolyGuard verifier/reward system judges all three.
803
+
804
+ | Policy | Episodes | Avg reward | Legality rate | Failure/exploit rate | Candidate diversity |
805
+ | --- | ---: | ---: | ---: | ---: | ---: |
806
+ | Basic LLM proxy | 8 | `0.762` | `1.0` | `0.25` | 1 |
807
+ | SFT policy proxy | 8 | `0.818` | `1.0` | `0.0` | 2 |
808
+ | Full PolyGuard pipeline | 8 | `0.805` | `1.0` | `0.0` | 2 |
809
+
810
+ The full pipeline improves average verifier reward over the basic LLM proxy by
811
+ `+0.043` while reducing visible failure/exploit rate from `0.25` to `0.0`.
812
+
813
+ ![Reward delta by matched seed](polyguard-rl/docs/results/final_submission_evidence/charts/curated/product_over_basic_llm/reward_delta_by_seed.png)
814
+
815
+ Two matched seeds expose the core failure mode: the basic policy repeatedly
816
+ kept a regimen despite the hidden `warfarin_like` + `nsaid_like` DDI holdout,
817
+ triggering `holdout_ddi_not_addressed`. The full pipeline selected safer dose
818
+ or hold candidates and avoided those failure reasons.
819
+
820
+ Source:
821
+ [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json).
822
+
823
+ ### Reward Components and Channels
824
+
825
+ ![Reward component bars](polyguard-rl/docs/results/final_submission_evidence/charts/curated/reward_and_safety/reward_component_bars.png)
826
+
827
+ ![Primary reward channel bars](polyguard-rl/docs/results/final_submission_evidence/charts/curated/reward_and_safety/primary_reward_channel_bars.png)
828
+
829
+ The reward charts are as important as the scalar reward curve. They show
830
+ whether the model is improving by becoming safer and more process-faithful or
831
+ merely exploiting one easy component. The reports log the full 13-component
832
+ reward vector and the four primary channels for GRPO and evaluation runs.
833
+
834
+ For Qwen 3B GRPO, the tracked average primary channels are:
835
+
836
+ | Channel | Average |
837
+ | --- | ---: |
838
+ | `safety_legality` | `0.816` |
839
+ | `clinical_improvement` | `0.609` |
840
+ | `dosing_quality` | `0.543` |
841
+ | `process_integrity` | `0.875` |
842
+
843
+ ### Post-Save Inference
844
+
845
+ ![Inference validity and reward](polyguard-rl/docs/results/final_submission_evidence/charts/curated/inference/inference_validity_reward.png)
846
+
847
+ Post-save inference is separate from training. The exported/activated artifact
848
+ is loaded and asked to choose candidate ids on held prompt samples. The Qwen 3B
849
+ GRPO adapter path produced:
850
+
851
+ - `model_source: adapter`
852
+ - `samples: 5`
853
+ - `valid_rate: 1.0`
854
+ - `avg_env_reward: 0.726`
855
+ - `avg_latency_seconds: 3.681`
856
+
857
+ The caveat matters: `valid_rate: 1.0` means the output was parseable and
858
+ executable as a candidate selection. In the five-sample Qwen 3B post-save GRPO
859
+ report, four valid samples still terminated with `exploit_detection`. That is
860
+ retained as safety evidence, because PolyGuard's job is to expose suspicious or
861
+ loop-like behavior instead of hiding it behind a clean parse metric.
862
+
863
+ ## Agentic Evaluation
864
+
865
+ Evaluation is not one benchmark number. The evaluation stack under
866
+ [app/evaluation/](polyguard-rl/app/evaluation/) includes offline policy
867
+ evaluation, safety evaluation, dosing evaluation, robustness under missing labs
868
+ and noisy inputs, calibration and abstention evaluation, process fidelity,
869
+ subgroup summaries, explainability grounding, baseline comparison, policy
870
+ ablations, failure mining, and action traces.
871
+
872
+ The tracked benchmark report records:
873
+
874
+ | Metric family | Result |
875
+ | --- | --- |
876
+ | Offline avg reward | `0.772833` |
877
+ | Offline legal rate | `1.0` |
878
+ | Severe violation rate | `0.0` |
879
+ | Illegal step rate | `0.0` |
880
+ | Dosing target attainment | `0.75` |
881
+ | Dosing toxicity avoidance | `1.0` |
882
+ | Missing-labs safety rate | `0.666667` |
883
+ | Noisy-dose, conflicting-meds, alias-noise, hidden-duplicate, wrong-candidate-id, stale-evidence, delayed-ADE safety/resilience | `1.0` |
884
+ | Calibration ECE proxy | `0.08625` |
885
+ | Process fidelity | `0.92` |
886
+ | Explainability grounding | `0.8` |
887
+
888
+ Source:
889
+ [docs/results/benchmark_report.json](polyguard-rl/docs/results/benchmark_report.json).
890
+
891
+ The improvement gate compares baseline and candidate reports:
892
+
893
+ | Gate dimension | Delta |
894
+ | --- | ---: |
895
+ | Average reward | `+0.025833` |
896
+ | Legality rate | `0.0` non-regression |
897
+ | Success rate | `0.0` non-regression |
898
+ | Process fidelity | `+0.92` |
899
+ | Timeout rate | `0.0` non-regression |
900
+ | Failure visibility | `0.0` non-regression |
901
+
902
+ Source:
903
+ [docs/results/improvement_report.json](polyguard-rl/docs/results/improvement_report.json).
904
+
905
+ ### Policy Ablation Results
906
+
907
+ | Stack | Avg reward | Legality | Visible failure rate | Exploit detections | Interpretation |
908
+ | --- | ---: | ---: | ---: | ---: | --- |
909
+ | `bandit_only` | `0.779625` | `1.0` | `0.0625` | 2 | Strong deterministic shortlist behavior with low failure visibility |
910
+ | `llm_only` | `0.772391` | `1.0` | `0.3043` | 7 | Legal, but more loop-like failure behavior |
911
+ | `llm+bandit` | `0.764739` | `1.0` | `0.3043` | 7 | Current combined stack needs tighter exploration/control in these ablation settings |
912
+
913
+ ![Policy ablation reward](polyguard-rl/docs/results/final_submission_evidence/charts/curated/policy_ablation/policy_ablation_reward.png)
914
+
915
+ The point of these ablations is not to claim every combined policy is always
916
+ better. The point is that PolyGuard can localize behavior: legality remains
917
+ high, while failure mining shows whether a stack is looping, over-reviewing,
918
+ or selecting non-improving candidates.
919
+
920
+ Source:
921
+ [policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json).
922
+
923
+ ## OpenEnv and Product Surfaces
924
+
925
+ The OpenEnv package is compact:
926
+
927
+ ```yaml
928
+ spec_version: 1
929
+ name: polyguard-openenv
930
+ runtime: fastapi
931
+ app: app.env.fastapi_app:app
932
+ port: 8100
933
+ ```
934
+
935
+ The OpenEnv runtime exposes `POST /reset`, `POST /step`, `GET /state`,
936
+ `GET /metadata`, `GET /schema`, `POST /mcp`, `GET /health`, `GET /ws`, and
937
+ backward-compatible `/env/*` routes.
938
+
939
+ The product API in [app/api/routes.py](polyguard-rl/app/api/routes.py) wraps
940
+ the environment, orchestrator, policy runtime, evaluation, evidence search,
941
+ cases, metrics, and medication-alternative tooling. Product-facing endpoints
942
+ include `/env/reset`, `/env/step_candidate`, `/agents/orchestrate`,
943
+ `/policy/infer`, `/policy/model_status`, `/eval/run_policy`,
944
+ `/metrics/training`, `/evidence/query`, and `/tools/medication_alternatives`.
945
+
946
+ ![Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png)
947
+
948
+ ## Operations and Deployment
949
+
950
+ The repository keeps deployment and artifact operations explicit:
951
+
952
+ | Surface | Files |
953
+ | --- | --- |
954
+ | Local/container runtime | [Dockerfile](polyguard-rl/Dockerfile), [Dockerfile.space](polyguard-rl/Dockerfile.space), [docker-compose.yml](polyguard-rl/docker-compose.yml), [requirements.txt](polyguard-rl/requirements.txt), [requirements-space.txt](polyguard-rl/requirements-space.txt) |
955
+ | Product Space/API deployment | [scripts/deploy_space.sh](polyguard-rl/scripts/deploy_space.sh), [scripts/deploy_space_api.py](polyguard-rl/scripts/deploy_space_api.py), [docs/deployment.md](polyguard-rl/docs/deployment.md) |
956
+ | Training and evidence Spaces | [scripts/deploy_training_space.py](polyguard-rl/scripts/deploy_training_space.py), [scripts/monitor_training_space_status.py](polyguard-rl/scripts/monitor_training_space_status.py), [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py), [app/hf_space/evidence_runner.py](polyguard-rl/app/hf_space/evidence_runner.py) |
957
+ | Artifact packaging and activation | [scripts/deploy_final_artifact_space.py](polyguard-rl/scripts/deploy_final_artifact_space.py), [scripts/package_active_model_bundle.py](polyguard-rl/scripts/package_active_model_bundle.py), [scripts/install_hf_active_bundle.py](polyguard-rl/scripts/install_hf_active_bundle.py), [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json) |
958
+ | Submission validation | [scripts/acceptance_gate.py](polyguard-rl/scripts/acceptance_gate.py), [scripts/validate_submission_links.py](polyguard-rl/scripts/validate_submission_links.py), [docs/submission_checklist.md](polyguard-rl/docs/submission_checklist.md), [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
959
+
960
+ The important operational distinction is that local smoke artifacts, remote
961
+ training-space logs, final artifact packaging, and active-model installation
962
+ are separate stages. Final claims are tied to the curated evidence bundle, not
963
+ to whichever intermediate output directory happens to exist in a checkout.
964
+
965
+ ## The Workbench UI
966
+
967
+ The UI is a React 18 + Vite + TypeScript workbench under
968
+ [app/ui/frontend/](polyguard-rl/app/ui/frontend/). It is not the environment
969
+ itself; it is an operator surface over the API and OpenEnv runtime.
970
+
971
+ [Live workbench Space](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
972
+
973
+ ![Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png)
974
+
975
+ The main views cover patient workbench, episode replay, policy comparison and
976
+ policy lab, precision dosing, training monitor, safety inspector, candidate
977
+ actions, reward panel, episode trace, and alternative medication search through
978
+ `/tools/medication_alternatives`.
979
+
980
+ The Patient Workbench shows the active model chip, current scenario, candidate
981
+ set, agent-vs-environment flow, reward breakdown, and action trace without
982
+ requiring the reader to inspect raw JSON. The UI is intentionally a workbench,
983
+ not a polished clinical application.
984
+
985
+ ### UI Sequence
986
+
987
+ The five UI screenshots are checked in under `polyguard-rl/docs/UI Images/`.
988
+
989
+ 1. The workbench opens with model truth, live episode context, scenario status,
990
+ candidate count, and reward state.
991
+
992
+ ![PolyGuard workbench overview](polyguard-rl/docs/UI%20Images/1.jpeg)
993
+
994
+ 2. The episode panel makes the patient, task, difficulty, sub-environment, risk
995
+ delta, and candidate-action console visible without reading raw JSON.
996
+
997
+ ![Episode overview and candidate console](polyguard-rl/docs/UI%20Images/2.jpeg)
998
+
999
+ 3. Candidate selection is paired with reward-channel feedback, current
1000
+ medications, and blocked/available action visibility.
1001
+
1002
+ ![Candidate actions and reward channels](polyguard-rl/docs/UI%20Images/3.jpeg)
1003
+
1004
+ 4. After an action, the workbench exposes history, warnings, decision payload,
1005
+ grounded facts, explanation, evidence, and event logs.
1006
+
1007
+ ![Action history, decision payload, and evidence](polyguard-rl/docs/UI%20Images/4.jpeg)
1008
+
1009
+ 5. The alternatives tool surfaces medication substitutions from the current
1010
+ regimen and links out to source labels.
1011
+
1012
+ ![Medication alternatives tool](polyguard-rl/docs/UI%20Images/5.jpeg)
1013
+
1014
+ ## Demo Videos
1015
+
1016
+ ### [UI Walkthrough Video](https://drive.google.com/file/d/1YOzad5gvx-tSmGzJNuBgokBF4-dX2T2H/view?usp=sharing)
1017
+
1018
+ This walkthrough shows the deployed workbench surface, including the live model
1019
+ chip, episode context, candidate actions, reward panels, and evidence-oriented
1020
+ patient review flow.
1021
+
1022
+ ### [Agent In Action: Action Button Demo](https://drive.google.com/file/d/1eHk1v0OYJRrLWVO97ZclN05MYHxmNnmc/view?usp=sharing)
1023
+
1024
+ This demo focuses on what the action button does: selecting a candidate,
1025
+ submitting it through the environment, producing a verifier-scored transition,
1026
+ and exposing the resulting reward, action history, warnings, and explanation.
1027
+
1028
+ ### [World Model Tool: Tavily and OpenFDA Alternative Suggestions](https://drive.google.com/file/d/1GaUyyaXaBCHjhHFbpkprojNt5pLNAoYi/view?usp=sharing)
1029
+
1030
+ This tool demo shows the world-model support path for alternative medication
1031
+ suggestions, using Tavily and the OpenFDA government database to retrieve
1032
+ candidate alternatives and side-effect evidence for safer review.
1033
+
1034
+ ## How a Reviewer Should Read the Repository
1035
+
1036
+ For a fresh reviewer, the intended path is:
1037
+
1038
+ 1. Read the artifact index:
1039
+ [polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
1040
+ 2. Inspect the final curated evidence:
1041
+ [polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md).
1042
+ 3. Open the one-run notebook:
1043
+ [PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
1044
+ 4. For local smoke work, follow [docs/training.md](polyguard-rl/docs/training.md)
1045
+ and the local scripts
1046
+ [scripts/run_env_local.sh](polyguard-rl/scripts/run_env_local.sh),
1047
+ [scripts/run_api_local.sh](polyguard-rl/scripts/run_api_local.sh), and
1048
+ [scripts/run_ui_local.sh](polyguard-rl/scripts/run_ui_local.sh).
1049
+ 5. For full training/reproduction, use the notebook or training docs rather
1050
+ than copying private artifact commands out of old drafts.
1051
+ 6. For final public artifacts, use the final artifact Space:
1052
+ [adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts).
1053
+
1054
+ ## Evidence and Artifact Inventory
1055
+
1056
+ Important evidence paths:
1057
+
1058
+ - Final overview:
1059
+ [docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
1060
+ - Artifact manifest:
1061
+ [docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json)
1062
+ - Three-model summary:
1063
+ [docs/results/final_submission_evidence/reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json)
1064
+ - Qwen 3B GRPO report:
1065
+ [docs/results/final_submission_evidence/reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json)
1066
+ - Post-save GRPO inference:
1067
+ [docs/results/final_submission_evidence/reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json)
1068
+ - Basic LLM vs PolyGuard:
1069
+ [docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
1070
+ - Policy ablation:
1071
+ [docs/results/final_submission_evidence/reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json)
1072
+ - Action traces:
1073
+ [docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl)
1074
+ - Curated charts:
1075
+ [docs/results/final_submission_evidence/charts/curated/README.md](polyguard-rl/docs/results/final_submission_evidence/charts/curated/README.md)
1076
+
1077
+ Important tests:
1078
+
1079
+ | Category | Tests |
1080
+ | --- | --- |
1081
+ | Environment contract | [tests/test_openenv_contract.py](polyguard-rl/tests/test_openenv_contract.py), [tests/test_env_reset.py](polyguard-rl/tests/test_env_reset.py), [tests/test_env_step.py](polyguard-rl/tests/test_env_step.py), [tests/test_env_step_flow.py](polyguard-rl/tests/test_env_step_flow.py), [tests/test_future_subenvs.py](polyguard-rl/tests/test_future_subenvs.py) |
1082
+ | Reward and safety | [tests/test_reward_functions.py](polyguard-rl/tests/test_reward_functions.py), [tests/test_reward_range.py](polyguard-rl/tests/test_reward_range.py), [tests/test_reward_channels.py](polyguard-rl/tests/test_reward_channels.py), [tests/test_anti_cheat.py](polyguard-rl/tests/test_anti_cheat.py), [tests/test_constraints.py](polyguard-rl/tests/test_constraints.py), [tests/test_timeout_logic.py](polyguard-rl/tests/test_timeout_logic.py) |
1083
+ | Policy and runtime | [tests/test_agents.py](polyguard-rl/tests/test_agents.py), [tests/test_contextual_bandit.py](polyguard-rl/tests/test_contextual_bandit.py), [tests/test_policy_schema.py](polyguard-rl/tests/test_policy_schema.py), [tests/test_provider_runtime.py](polyguard-rl/tests/test_provider_runtime.py), [tests/test_postsave_inference.py](polyguard-rl/tests/test_postsave_inference.py), [tests/test_checkpoint_integrity.py](polyguard-rl/tests/test_checkpoint_integrity.py) |
1084
+ | API and product tooling | [tests/test_api.py](polyguard-rl/tests/test_api.py), [tests/test_medication_alternatives.py](polyguard-rl/tests/test_medication_alternatives.py), [tests/test_remote_env.py](polyguard-rl/tests/test_remote_env.py) |
1085
+ | Data and evidence | [tests/test_parser.py](polyguard-rl/tests/test_parser.py), [tests/test_dataops_parser.py](polyguard-rl/tests/test_dataops_parser.py), [tests/test_graph_infer.py](polyguard-rl/tests/test_graph_infer.py), [tests/test_submission_evidence.py](polyguard-rl/tests/test_submission_evidence.py) |
1086
+ | Submission, notebook, and HF flow | [tests/test_acceptance_gate.py](polyguard-rl/tests/test_acceptance_gate.py), [tests/test_runner_notebook.py](polyguard-rl/tests/test_runner_notebook.py), [tests/test_hf_training_sweep.py](polyguard-rl/tests/test_hf_training_sweep.py) |
1087
+
1088
+ Additional architecture diagrams:
1089
+
1090
+ - [System architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png)
1091
+ - [Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png)
1092
+ - [Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png)
1093
+ - [Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png)
1094
+ - [Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png)
1095
+ - [Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png)
1096
+ - [Evidence generation flow](polyguard-rl/docs/assets/diagrams/evidence_generation_flow.png)
1097
+ - [Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png)
1098
+ - [Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png)
1099
+
1100
+ ## Limitations
1101
+
1102
+ PolyGuard is a simulator and research environment. Its current data substrate
1103
+ is compact and intentionally inspectable, not a production clinical knowledge
1104
+ base. The final evidence set is strongest for Qwen 3B because that run has
1105
+ complete SFT, GRPO, post-save GRPO, policy-ablation, adapter, and checkpoint
1106
+ evidence. Qwen 0.5B and 1.5B have SFT reports/histories and post-save SFT
1107
+ evidence, but their adapter directories are marked `reports_only_or_partial` in
1108
+ the final manifest.
1109
+
1110
+ The reward model is hand-designed and auditable. That is a feature for this
1111
+ OpenEnv setting, but it also means reward-channel design should be
1112
+ stress-tested as the data grows. The current ablations show that contextual
1113
+ bandits are useful and inspectable, while the `llm+bandit` combined stack needs
1114
+ more tuning to avoid loop-like failure behavior in some settings.
1115
+
1116
+ The right conclusion is not "this is a clinical decision system." The right
1117
+ conclusion is that constrained environment feedback, verifier-backed rewards,
1118
+ agentic evaluation, and explicit failure mining are a better substrate for
1119
+ safety-critical medication-policy learning than free-form prompt responses.
1120
+
1121
+ ## References
1122
+
1123
+ - Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois.
1124
+ [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190).
1125
+ arXiv:2212.05190.
1126
+ - World Health Organization.
1127
+ [Medication Without Harm](https://www.who.int/initiatives/medication-without-harm).
1128
+ - CDC.
1129
+ [FastStats: Medication Safety Data](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html).
1130
+ - Shehab N, Lovegrove MC, Geller AI, Rose KO, Weidle NJ, Budnitz DS.
1131
+ [US Emergency Department Visits for Outpatient Adverse Drug Events, 2013-2014](https://jamanetwork.com/journals/jama/fullarticle/2585977).
1132
+ JAMA. 2016;316(20):2115-2125.
1133
+ - AHRQ / NCBI Bookshelf.
1134
+ [Deprescribing To Reduce Medication Harms in Older Adults](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
1135
+ - American Geriatrics Society.
1136
+ [2023 updated AGS Beers Criteria for potentially inappropriate medication use in older adults](https://pmc.ncbi.nlm.nih.gov/articles/PMC12478568/).
1137
+ - O'Mahony et al.
1138
+ [STOPP/START criteria for potentially inappropriate prescribing in older people: version 3](https://pmc.ncbi.nlm.nih.gov/articles/PMC10447584/).
1139
+
1140
+ ## License
1141
+
1142
+ The project package declares an MIT license in
1143
+ [polyguard-rl/pyproject.toml](polyguard-rl/pyproject.toml). See
1144
+ [polyguard-rl/LICENSE](polyguard-rl/LICENSE) for the license text.