Commit ·
0ea9615
1
Parent(s): 3e0e65c
Write public PolyGuard project blog
Browse files
blog.md
CHANGED
|
@@ -0,0 +1,1144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PolyGuard OpenEnv: Training Medication-Safety Agents Inside a Verifier-Backed World
|
| 2 |
+
|
| 3 |
+
Someone does not experience an unsafe medication regimen as "polypharmacy."
|
| 4 |
+
They experience it as dizziness after a new sleep medication, bleeding after a
|
| 5 |
+
painkiller is added to a blood thinner, confusion from a sedative-opioid
|
| 6 |
+
combination, or a preventable emergency visit because five prescribers each saw
|
| 7 |
+
one slice of the medication list.
|
| 8 |
+
|
| 9 |
+
The dangerous part is often not a single drug. It is the combination: the wrong
|
| 10 |
+
pair, the wrong dose in the wrong organ-function context, the missing lab, the
|
| 11 |
+
duplicated class, the abrupt stop that should have been a taper, or the model
|
| 12 |
+
that confidently says "looks fine" because it was never forced to act inside a
|
| 13 |
+
safety-checked environment.
|
| 14 |
+
|
| 15 |
+
That is the problem PolyGuard was built for.
|
| 16 |
+
|
| 17 |
+
The [CDC medication-safety data page](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html)
|
| 18 |
+
reports that adverse drug events send more than 1.5 million people to emergency
|
| 19 |
+
departments in the United States every year, with almost 500,000
|
| 20 |
+
hospitalizations. Adults 65 and older account for more than 600,000 of those
|
| 21 |
+
visits. A CDC-authored [JAMA surveillance study](https://jamanetwork.com/journals/jama/fullarticle/2585977)
|
| 22 |
+
found that older adults made up 34.5 percent of outpatient adverse-drug-event
|
| 23 |
+
ED visits and had the highest hospitalization rate, 43.6 percent. Globally, the
|
| 24 |
+
[WHO Medication Without Harm challenge](https://www.who.int/initiatives/medication-without-harm)
|
| 25 |
+
estimates the cost associated with medication errors at USD 42 billion
|
| 26 |
+
annually. AHRQ's deprescribing safety review summarizes estimates that
|
| 27 |
+
[45 percent of older adults are exposed to polypharmacy and 58 percent to
|
| 28 |
+
potentially inappropriate medications](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
|
| 29 |
+
|
| 30 |
+
Not every adverse drug event is caused by an incorrect drug combination. But
|
| 31 |
+
these numbers describe the harm surface PolyGuard targets: medication decisions
|
| 32 |
+
where combination risk, monitoring gaps, frailty, organ function, uncertainty,
|
| 33 |
+
and action sequencing all matter at once.
|
| 34 |
+
|
| 35 |
+
PolyGuard turns that problem into an OpenEnv-compatible reinforcement-learning
|
| 36 |
+
environment for polypharmacy safety, medication optimization, deprescribing,
|
| 37 |
+
safe substitution, missing-evidence recovery, and precision dosing. A language
|
| 38 |
+
model policy observes a constrained patient/regimen state, chooses one legal
|
| 39 |
+
candidate action, receives verifier-backed reward, and improves through SFT
|
| 40 |
+
plus GRPO-style post-training.
|
| 41 |
+
|
| 42 |
+
It is not medical software and it is not clinical advice. It is a controlled
|
| 43 |
+
research environment for studying how language-model policies can be trained,
|
| 44 |
+
audited, and stress-tested on safety-critical medication action selection.
|
| 45 |
+
|
| 46 |
+
## What To Open First
|
| 47 |
+
|
| 48 |
+
- GitHub repository:
|
| 49 |
+
[Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
|
| 50 |
+
- Live product Space:
|
| 51 |
+
[TheJackBright/polyguard-openenv-workbench](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
|
| 52 |
+
- One-run Colab/HF notebook:
|
| 53 |
+
[PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb)
|
| 54 |
+
- Final evidence index:
|
| 55 |
+
[polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
|
| 56 |
+
- Artifact and traceability guide:
|
| 57 |
+
[polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md)
|
| 58 |
+
- Final artifact/evidence Space:
|
| 59 |
+
[adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts)
|
| 60 |
+
|
| 61 |
+
The final artifact/evidence Space hosts the Qwen 3B artifact bundle. The Qwen
|
| 62 |
+
0.5B and 1.5B runs were trained using a second Hugging Face account, so their
|
| 63 |
+
model artifacts could not be hosted in the same final Space. Their report
|
| 64 |
+
mirrors are checked into this repo:
|
| 65 |
+
[0.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct)
|
| 66 |
+
and
|
| 67 |
+
[1.5B reports](polyguard-rl/docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct).
|
| 68 |
+
|
| 69 |
+
## The Research Bet
|
| 70 |
+
|
| 71 |
+
Medication safety is combinatorial, partially observable, and high stakes. A
|
| 72 |
+
useful policy has to do more than generate a plausible answer. It has to notice
|
| 73 |
+
drug-drug interaction risk, reason about comorbidities and organ function,
|
| 74 |
+
respect taper and monitoring requirements, choose safe substitutions, abstain
|
| 75 |
+
or ask for review when uncertainty is high, and expose why it acted.
|
| 76 |
+
|
| 77 |
+
The machine-learning pressure is just as real. If a medication vocabulary has
|
| 78 |
+
500 drugs, the number of possible five-drug combinations is:
|
| 79 |
+
|
| 80 |
+
```text
|
| 81 |
+
C(500, 5) = 255,244,687,600
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
Exhaustive search is not a serious option. The paper that inspired this
|
| 85 |
+
project, [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190),
|
| 86 |
+
frames dangerous polypharmacy discovery as a bandit search problem over a huge
|
| 87 |
+
combination space. It benchmarks neural bandit search over simulated
|
| 88 |
+
polypharmacy datasets with 500 drugs and 100,000 distinct combinations, and
|
| 89 |
+
reports detection of up to 72 percent of potentially inappropriate
|
| 90 |
+
polypharmacies with 99 percent average precision after 30,000 time steps.
|
| 91 |
+
|
| 92 |
+
PolyGuard borrows that search instinct, then moves the problem from offline
|
| 93 |
+
combination mining into an agentic environment. The policy sees a patient
|
| 94 |
+
state, chooses among legal clinical action candidates, and is judged by a
|
| 95 |
+
deterministic verifier and reward router rather than by free-form text
|
| 96 |
+
preference alone.
|
| 97 |
+
|
| 98 |
+
The research question is narrow and concrete:
|
| 99 |
+
|
| 100 |
+
Can environment-backed feedback make a small open model better at safe
|
| 101 |
+
medication action selection than prompt-only, first-legal, rule-only, or
|
| 102 |
+
single-agent baselines?
|
| 103 |
+
|
| 104 |
+
The answer in this repository is an inspectable system:
|
| 105 |
+
|
| 106 |
+
1. A finite-horizon OpenEnv simulation for medication decisions.
|
| 107 |
+
2. A constrained action space, so the model chooses candidate actions instead
|
| 108 |
+
of inventing arbitrary clinical instructions.
|
| 109 |
+
3. A legality verifier that prevents unsafe state mutation.
|
| 110 |
+
4. Thirteen reward components rolled into four primary reward channels.
|
| 111 |
+
5. A multi-agent policy stack with supervisor routing, contextual bandit
|
| 112 |
+
reranking, planner selection, critic veto, and explanation logging.
|
| 113 |
+
6. SFT for format and clinical-prior warm start.
|
| 114 |
+
7. GRPO with environment-backed reward, not an opaque LLM judge.
|
| 115 |
+
8. Agentic evaluation with baseline comparison, policy ablations, post-save
|
| 116 |
+
inference, robustness checks, action traces, and failure mining.
|
| 117 |
+
|
| 118 |
+

|
| 119 |
+
|
| 120 |
+
## A Failure Trace That Motivated the Design
|
| 121 |
+
|
| 122 |
+
In the final matched-seed traces, the failure mode is not abstract. On seeds
|
| 123 |
+
`8000` and `8004`, the basic prompt-style proxy repeatedly chose `cand_01`,
|
| 124 |
+
the first legal candidate. In those cases, `cand_01` meant `KEEP_REGIMEN` while
|
| 125 |
+
a hidden `warfarin_like` + `nsaid_like` interaction remained unresolved. The
|
| 126 |
+
verifier recorded `holdout_ddi_not_addressed`.
|
| 127 |
+
|
| 128 |
+
The full PolyGuard pipeline selected `cand_03`, a safer intervention candidate,
|
| 129 |
+
and avoided those failure reasons.
|
| 130 |
+
|
| 131 |
+
That is the core argument of the project: medication AI should be judged inside
|
| 132 |
+
a stateful safety environment, not only by whether its answer sounds clinically
|
| 133 |
+
plausible.
|
| 134 |
+
|
| 135 |
+
Internal evidence:
|
| 136 |
+
[basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
|
| 137 |
+
and
|
| 138 |
+
[action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl).
|
| 139 |
+
|
| 140 |
+
## Safety Contract
|
| 141 |
+
|
| 142 |
+
PolyGuard does not let a model directly mutate a medication list from free
|
| 143 |
+
text. Every decision is candidate-based, verifier-checked, reward-decomposed,
|
| 144 |
+
and traced. Illegal actions can be scored, penalized, and logged, but they do
|
| 145 |
+
not change patient state.
|
| 146 |
+
|
| 147 |
+
The repo evidence for this contract is spread across the environment, rules,
|
| 148 |
+
and final reports:
|
| 149 |
+
|
| 150 |
+
| Claim | Repo evidence |
|
| 151 |
+
| --- | --- |
|
| 152 |
+
| Hard contraindication examples are represented | [app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py) |
|
| 153 |
+
| Safer alternatives are explicit | [app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py) |
|
| 154 |
+
| Unsafe substitutions and dose escalations are blocked before state mutation | [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
|
| 155 |
+
| Reward hacking and loop-like behavior are surfaced | [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [docs/reward_design.md](polyguard-rl/docs/reward_design.md) |
|
| 156 |
+
| Baseline failure is traceable by seed and candidate | [basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json), [action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl) |
|
| 157 |
+
| Final claims are separated from older smoke artifacts | [final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md) |
|
| 158 |
+
|
| 159 |
+
## Project Map
|
| 160 |
+
|
| 161 |
+
The implementation lives under [polyguard-rl/](polyguard-rl/).
|
| 162 |
+
|
| 163 |
+
| Area | Key paths |
|
| 164 |
+
| --- | --- |
|
| 165 |
+
| OpenEnv runtime | [openenv.yaml](polyguard-rl/openenv.yaml), [app/env/env_core.py](polyguard-rl/app/env/env_core.py), [app/env/fastapi_app.py](polyguard-rl/app/env/fastapi_app.py), [server/app.py](polyguard-rl/server/app.py) |
|
| 166 |
+
| Action/state contracts | [app/common/types.py](polyguard-rl/app/common/types.py), [app/common/enums.py](polyguard-rl/app/common/enums.py) |
|
| 167 |
+
| Candidate generation and verifier | [app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py), [app/env/verifier.py](polyguard-rl/app/env/verifier.py) |
|
| 168 |
+
| Reward and anti-cheat | [app/env/reward_router.py](polyguard-rl/app/env/reward_router.py), [app/env/reward_scaling.py](polyguard-rl/app/env/reward_scaling.py), [app/env/anti_cheat.py](polyguard-rl/app/env/anti_cheat.py), [configs/rewards.yaml](polyguard-rl/configs/rewards.yaml) |
|
| 169 |
+
| Multi-agent policy | [app/agents/](polyguard-rl/app/agents/), [docs/agents.md](polyguard-rl/docs/agents.md) |
|
| 170 |
+
| Bandits and baselines | [app/models/baselines/contextual_bandit.py](polyguard-rl/app/models/baselines/contextual_bandit.py), [app/models/baselines/contextual_bandit_policy.py](polyguard-rl/app/models/baselines/contextual_bandit_policy.py), [app/models/baselines/](polyguard-rl/app/models/baselines/) |
|
| 171 |
+
| Training | [app/training/](polyguard-rl/app/training/), [scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py), [scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py), [docs/training.md](polyguard-rl/docs/training.md) |
|
| 172 |
+
| Data | [data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json), [data/scenarios/](polyguard-rl/data/scenarios/), [docs/datasets.md](polyguard-rl/docs/datasets.md) |
|
| 173 |
+
| Evaluation | [app/evaluation/](polyguard-rl/app/evaluation/), [scripts/evaluate_all.py](polyguard-rl/scripts/evaluate_all.py), [docs/evaluation.md](polyguard-rl/docs/evaluation.md) |
|
| 174 |
+
| Product API/UI | [app/api/](polyguard-rl/app/api/), [app/ui/frontend/](polyguard-rl/app/ui/frontend/), [docs/ui.md](polyguard-rl/docs/ui.md) |
|
| 175 |
+
| Math | [docs/math.md](polyguard-rl/docs/math.md), [docs/mathematics.md](polyguard-rl/docs/mathematics.md) |
|
| 176 |
+
| Results | [docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/) |
|
| 177 |
+
|
| 178 |
+
Supporting docs include [architecture.md](polyguard-rl/docs/architecture.md),
|
| 179 |
+
[environment_design.md](polyguard-rl/docs/environment_design.md),
|
| 180 |
+
[reward_design.md](polyguard-rl/docs/reward_design.md),
|
| 181 |
+
[safety.md](polyguard-rl/docs/safety.md),
|
| 182 |
+
[precision_dosing.md](polyguard-rl/docs/precision_dosing.md),
|
| 183 |
+
[graph_models.md](polyguard-rl/docs/graph_models.md),
|
| 184 |
+
[ablations.md](polyguard-rl/docs/ablations.md),
|
| 185 |
+
[api.md](polyguard-rl/docs/api.md),
|
| 186 |
+
[deployment.md](polyguard-rl/docs/deployment.md),
|
| 187 |
+
[ui.md](polyguard-rl/docs/ui.md),
|
| 188 |
+
[DEMO_RECORDING_SCRIPT.md](polyguard-rl/docs/DEMO_RECORDING_SCRIPT.md), and
|
| 189 |
+
[submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
|
| 190 |
+
|
| 191 |
+
## The OpenEnv Environment
|
| 192 |
+
|
| 193 |
+
At the center is `PolyGuardEnv`, implemented in
|
| 194 |
+
[app/env/env_core.py](polyguard-rl/app/env/env_core.py). It follows the
|
| 195 |
+
OpenEnv/Gym shape:
|
| 196 |
+
|
| 197 |
+
```text
|
| 198 |
+
reset(seed, difficulty, sub_environment, scenario_id, patient_id)
|
| 199 |
+
-> PolyGuardObservation
|
| 200 |
+
|
| 201 |
+
step(PolyGuardAction)
|
| 202 |
+
-> (PolyGuardObservation, reward, done, info)
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
At reset, the environment loads or generates a patient scenario, selects a
|
| 206 |
+
difficulty and sub-environment, computes a risk summary, builds candidate
|
| 207 |
+
actions, estimates uncertainty, and emits a strict observation. At step time,
|
| 208 |
+
the environment parses the action, checks legality, evaluates anti-cheat rules,
|
| 209 |
+
mutates state only if the action is safe, computes decomposed reward, appends a
|
| 210 |
+
trace, and returns detailed `info` fields such as failure reasons, transition
|
| 211 |
+
delta, primary reward channels, invalid-action count, and timeout checks.
|
| 212 |
+
|
| 213 |
+

|
| 214 |
+
|
| 215 |
+
PolyGuard is not one task. It cycles through specialized sub-environments:
|
| 216 |
+
|
| 217 |
+
| Sub-environment | What it stresses |
|
| 218 |
+
| --- | --- |
|
| 219 |
+
| `DDI` | High-risk drug-drug interaction recognition and resolution |
|
| 220 |
+
| `BANDIT_MINING` | Candidate exploration and shortlist/ranking behavior inspired by bandit search |
|
| 221 |
+
| `REGIMEN_RISK` | General medication burden and regimen optimization |
|
| 222 |
+
| `PRECISION_DOSING` | Dose-hold, dose reduction, renal/hepatic guardrails, monitoring decisions |
|
| 223 |
+
| `LONGITUDINAL_DEPRESCRIBING` | Multi-step taper/deprescribing behavior over a longer horizon |
|
| 224 |
+
| `WEB_SEARCH_MISSING_DATA` | Evidence fetch or review when critical data is missing |
|
| 225 |
+
| `ALTERNATIVE_SUGGESTION` | Safe alternatives and within-class substitution |
|
| 226 |
+
| `NEW_DRUG_DECOMPOSITION` | First-pass reasoning over an unknown or combination medication |
|
| 227 |
+
|
| 228 |
+
The curriculum in [app/env/curriculum.py](polyguard-rl/app/env/curriculum.py)
|
| 229 |
+
starts with short easy DDI/regimen-risk episodes, then adds bandit and
|
| 230 |
+
alternative-selection tasks, and finally hard cases with precision dosing,
|
| 231 |
+
longitudinal deprescribing, missing data, and new-drug decomposition.
|
| 232 |
+
|
| 233 |
+
### State and Observation
|
| 234 |
+
|
| 235 |
+
The latent state is represented by `PolyGuardState` and includes patient
|
| 236 |
+
demographics, active decision mode, step budget, medications, dose buckets,
|
| 237 |
+
comorbidities, labs, vitals, frailty, adherence, monitoring gaps, prior adverse
|
| 238 |
+
event history, burden score, severe-pair count, precision dosing flags,
|
| 239 |
+
unresolved conflicts, action history, cumulative reward, and done state.
|
| 240 |
+
|
| 241 |
+
The agent does not get every simulator internal. It receives a controlled
|
| 242 |
+
`PolyGuardObservation` with a patient summary, medication table, comorbidity
|
| 243 |
+
summary, organ function, labs/vitals, graph safety summary, burden summary,
|
| 244 |
+
precision dosing flags, unresolved conflicts, candidate actions, step budget,
|
| 245 |
+
action history, warnings, abstention indicators, seed, scenario, difficulty,
|
| 246 |
+
and sub-environment.
|
| 247 |
+
|
| 248 |
+
This split matters. PolyGuard is a partially observable environment. Missing
|
| 249 |
+
labs and unresolved conflicts are visible as uncertainty signals, not as hidden
|
| 250 |
+
reward traps.
|
| 251 |
+
|
| 252 |
+
## Action Space and Safety Constraints
|
| 253 |
+
|
| 254 |
+
PolyGuard deliberately avoids unconstrained text actions. The policy chooses a
|
| 255 |
+
strict `PolyGuardAction` with fields such as `mode`, `action_type`,
|
| 256 |
+
`target_drug`, `replacement_drug`, `dose_bucket`, `taper_days`,
|
| 257 |
+
`monitoring_plan`, `evidence_query`, `new_drug_name`, `candidate_components`,
|
| 258 |
+
`candidate_id`, `confidence`, and `rationale_brief`.
|
| 259 |
+
|
| 260 |
+
The action types are compact:
|
| 261 |
+
|
| 262 |
+
| Family | Action types |
|
| 263 |
+
| --- | --- |
|
| 264 |
+
| Regimen | `KEEP_REGIMEN`, `STOP_DRUG`, `SUBSTITUTE_WITHIN_CLASS`, `RECOMMEND_ALTERNATIVE` |
|
| 265 |
+
| Dosing | `REDUCE_DOSE_BUCKET`, `INCREASE_DOSE_BUCKET`, `DOSE_HOLD`, `ORDER_MONITORING_AND_WAIT` |
|
| 266 |
+
| Deprescribing | `TAPER_INITIATE`, `TAPER_CONTINUE` |
|
| 267 |
+
| Evidence and uncertainty | `FETCH_EXTERNAL_EVIDENCE`, `DECOMPOSE_NEW_DRUG`, `REQUEST_SPECIALIST_REVIEW`, `REQUEST_PHARMACIST_REVIEW` |
|
| 268 |
+
|
| 269 |
+
The candidate builder in
|
| 270 |
+
[app/models/policy/candidate_builder.py](polyguard-rl/app/models/policy/candidate_builder.py)
|
| 271 |
+
generates a bounded candidate set:
|
| 272 |
+
|
| 273 |
+
```text
|
| 274 |
+
3 <= |C_t| <= 10
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
Each candidate carries estimated safety delta, burden delta, disease stability,
|
| 278 |
+
uncertainty score, rationale tags, required monitoring, and a legality
|
| 279 |
+
precheck. Policy selection is candidate selection:
|
| 280 |
+
|
| 281 |
+
```text
|
| 282 |
+
a_t = to_action(c_t), where c_t is in C_t
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
The verifier in [app/env/verifier.py](polyguard-rl/app/env/verifier.py)
|
| 286 |
+
enforces hard safety constraints before state mutation. It checks that target
|
| 287 |
+
drugs exist when required, substitutions and alternatives are allowed, evidence
|
| 288 |
+
domains are allowlisted, new-drug decomposition includes required components,
|
| 289 |
+
taper-required drugs are not stopped abruptly, renal/hepatic unsafe dose
|
| 290 |
+
escalation is blocked, duplicate therapy and contraindicated replacement pairs
|
| 291 |
+
are blocked, and monitoring/hold actions include a monitoring plan.
|
| 292 |
+
|
| 293 |
+
Illegal actions can receive reward penalties and become visible in traces, but
|
| 294 |
+
they do not mutate patient state.
|
| 295 |
+
|
| 296 |
+
## Multi-Agent Policy Stack
|
| 297 |
+
|
| 298 |
+
The "agents" in PolyGuard are an auditable policy factorization rather than
|
| 299 |
+
free-form independent chatbots. A step flows through:
|
| 300 |
+
|
| 301 |
+
```text
|
| 302 |
+
MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate
|
| 303 |
+
-> Supervisor -> Planner -> Critic -> Env -> Explainer
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+

|
| 307 |
+
|
| 308 |
+
| Agent/module | Role |
|
| 309 |
+
| --- | --- |
|
| 310 |
+
| `MedRecAgent` | Summarizes current regimen and medication burden |
|
| 311 |
+
| `EvidenceAgent` | Retrieves local or fallback evidence when missing data is present |
|
| 312 |
+
| `GraphSafetyAgent` | Scores risky pairs, side-effect load, duplicate therapy, and graph safety patterns |
|
| 313 |
+
| `DosingAgent` | Detects dose-sensitive cases and dose-hold opportunities |
|
| 314 |
+
| `CandidateAgent` | Exposes legal candidate actions from the environment candidate builder |
|
| 315 |
+
| `SupervisorAgent` | Routes to regimen optimization, dose optimization, or review mode |
|
| 316 |
+
| `PlannerAgent` | Selects an action from candidates through the policy provider |
|
| 317 |
+
| `CriticAgent` | Vetoes illegal or unsafe proposed actions and can force review fallback |
|
| 318 |
+
| `ExplainerAgent` | Records grounded rationale for demo, replay, and audit |
|
| 319 |
+
|
| 320 |
+
The orchestration modes are `sequential_pipeline`, `supervisor_routed`,
|
| 321 |
+
`replan_on_veto`, and `lightweight_debate`. Policy-stack ablations compare
|
| 322 |
+
`bandit-only`, `llm-only`, and `llm+bandit`.
|
| 323 |
+
|
| 324 |
+
## Contextual Bandits
|
| 325 |
+
|
| 326 |
+
PolyGuard uses contextual bandits as an inspectable candidate-reranking layer.
|
| 327 |
+
This is where the project most directly echoes the arXiv bandit inspiration:
|
| 328 |
+
unsafe polypharmacy search is combinatorial, so the system should learn which
|
| 329 |
+
regions of the candidate/action space are worth exploring rather than enumerate
|
| 330 |
+
everything.
|
| 331 |
+
|
| 332 |
+
Each candidate becomes an 8-dimensional feature vector:
|
| 333 |
+
|
| 334 |
+
```text
|
| 335 |
+
x(c) = [
|
| 336 |
+
1,
|
| 337 |
+
I[legality_precheck],
|
| 338 |
+
estimated_safety_delta,
|
| 339 |
+
burden_delta,
|
| 340 |
+
disease_stability_estimate,
|
| 341 |
+
1 - uncertainty_score,
|
| 342 |
+
I[mode = DOSE_OPT],
|
| 343 |
+
I[mode = REVIEW]
|
| 344 |
+
]
|
| 345 |
+
```
|
| 346 |
+
|
| 347 |
+
An arm is keyed by macro mode and action type:
|
| 348 |
+
|
| 349 |
+
```text
|
| 350 |
+
arm(c) = mode(c) || ":" || action_type(c)
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
The LinUCB variant maintains, for each arm `a`:
|
| 354 |
+
|
| 355 |
+
```text
|
| 356 |
+
A_a = I + sum x x^T
|
| 357 |
+
b_a = sum r x
|
| 358 |
+
theta_a = A_a^{-1} b_a
|
| 359 |
+
|
| 360 |
+
score_a(x) = theta_a^T x + alpha * sqrt(x^T A_a^{-1} x)
|
| 361 |
+
```
|
| 362 |
+
|
| 363 |
+
There is also a Thompson-style variant:
|
| 364 |
+
|
| 365 |
+
```text
|
| 366 |
+
score_a(x) = theta_a^T x + Normal(0, alpha)
|
| 367 |
+
```
|
| 368 |
+
|
| 369 |
+
This layer can shortlist candidates before the planner emits the final action.
|
| 370 |
+
It is deliberately kept inside the candidate space: the bandit can improve
|
| 371 |
+
ordering and exploration, but it cannot invent an unsafe action outside the
|
| 372 |
+
environment contract.
|
| 373 |
+
|
| 374 |
+
## Reward Model
|
| 375 |
+
|
| 376 |
+
The reward model is decomposed on purpose. A single scalar reward is needed for
|
| 377 |
+
RL, but safety-critical RL needs more than one opaque number. PolyGuard logs 13
|
| 378 |
+
component columns and four primary channels on every step.
|
| 379 |
+
|
| 380 |
+

|
| 381 |
+
|
| 382 |
+
All reward values are clamped and quantized:
|
| 383 |
+
|
| 384 |
+
```text
|
| 385 |
+
q(x) = round(clip(x, 0.001, 0.999), 3)
|
| 386 |
+
```
|
| 387 |
+
|
| 388 |
+
The 13 reward components are:
|
| 389 |
+
|
| 390 |
+
| Component | Weight | Meaning |
|
| 391 |
+
| --- | ---: | --- |
|
| 392 |
+
| `format_compliance_score` | 0.08 | Action payload conforms to the schema |
|
| 393 |
+
| `candidate_alignment_score` | 0.08 | The model selected a valid candidate-style id |
|
| 394 |
+
| `legality_score` | 0.12 | The verifier accepted the action |
|
| 395 |
+
| `safety_delta_score` | 0.15 | Severe-pair and burden risk decreased |
|
| 396 |
+
| `burden_improvement_score` | 0.08 | Dose-weighted medication burden improved |
|
| 397 |
+
| `disease_stability_score` | 0.10 | The action did not destabilize underlying disease management |
|
| 398 |
+
| `dosing_quality_score` | 0.08 | Dose-sensitive routing/action quality |
|
| 399 |
+
| `abstention_quality_score` | 0.06 | Review/abstention is appropriate under uncertainty |
|
| 400 |
+
| `efficiency_score` | 0.06 | The action uses the finite step budget well |
|
| 401 |
+
| `process_fidelity_score` | 0.06 | The action follows task-specific process expectations |
|
| 402 |
+
| `explanation_grounding_score` | 0.03 | The rationale is present and grounded |
|
| 403 |
+
| `anti_cheat_score` | 0.06 | Reward-hacking checks did not fire |
|
| 404 |
+
| `uncertainty_calibration_score` | 0.04 | Confidence matches observable uncertainty |
|
| 405 |
+
|
| 406 |
+
The scalar reward is a weighted average:
|
| 407 |
+
|
| 408 |
+
```text
|
| 409 |
+
R_env(s_t, a_t, s_{t+1}) = q( sum_i w_i c_i / sum_i w_i )
|
| 410 |
+
```
|
| 411 |
+
|
| 412 |
+
Safety-heavy terms dominate the total weight:
|
| 413 |
+
|
| 414 |
+
```text
|
| 415 |
+
legality + safety_delta + burden + disease_stability + anti_cheat
|
| 416 |
+
= 0.12 + 0.15 + 0.08 + 0.10 + 0.06
|
| 417 |
+
= 0.51
|
| 418 |
+
```
|
| 419 |
+
|
| 420 |
+
The four primary reward channels are:
|
| 421 |
+
|
| 422 |
+
| Channel | Component family |
|
| 423 |
+
| --- | --- |
|
| 424 |
+
| `safety_legality` | legality, candidate alignment, anti-cheat, uncertainty calibration |
|
| 425 |
+
| `clinical_improvement` | safety delta, burden improvement, disease stability |
|
| 426 |
+
| `dosing_quality` | dosing quality and abstention quality |
|
| 427 |
+
| `process_integrity` | format compliance, efficiency, process fidelity, explanation grounding |
|
| 428 |
+
|
| 429 |
+
These channels are emitted in `info.primary_reward_channels`, GRPO reward logs,
|
| 430 |
+
reports, plots, and ablation summaries.
|
| 431 |
+
|
| 432 |
+
## Anti-Cheat and Failure Visibility
|
| 433 |
+
|
| 434 |
+
RL policies exploit reward functions. PolyGuard makes common shortcut failures
|
| 435 |
+
explicit: repeated action loops, excessive keep-regimen behavior, excessive
|
| 436 |
+
review/abstention behavior, candidate ID mismatch, candidate outside the legal
|
| 437 |
+
set, hidden high-risk DDI no-op behavior, parser exploit patterns in rationales,
|
| 438 |
+
and retries of failed no-op actions.
|
| 439 |
+
|
| 440 |
+
If an exploit is detected:
|
| 441 |
+
|
| 442 |
+
```text
|
| 443 |
+
anti_cheat_score = 0.001
|
| 444 |
+
done = true
|
| 445 |
+
termination_reason = "exploit_detection"
|
| 446 |
+
```
|
| 447 |
+
|
| 448 |
+
Episodes can also terminate on step budget exhaustion, repeated invalid
|
| 449 |
+
actions, safety-veto threshold, patient destabilization, safe resolution,
|
| 450 |
+
wall-clock timeout, or per-step timeout.
|
| 451 |
+
|
| 452 |
+

|
| 453 |
+
|
| 454 |
+
## Mathematics
|
| 455 |
+
|
| 456 |
+
PolyGuard can be read as a finite-horizon constrained partially observable
|
| 457 |
+
Markov decision process:
|
| 458 |
+
|
| 459 |
+
```text
|
| 460 |
+
M = (S, A, O, T, R, H, C)
|
| 461 |
+
```
|
| 462 |
+
|
| 463 |
+
where `S` is latent patient/regimen state, `A` is the constrained medication
|
| 464 |
+
action set, `O` is the controlled observation, `T(s' | s, a)` is the transition
|
| 465 |
+
function, `R(s, a, s')` is verifier-backed reward, `H` is the episode horizon,
|
| 466 |
+
and `C(s, a)` is the hard safety/legality constraint predicate.
|
| 467 |
+
|
| 468 |
+
The objective is:
|
| 469 |
+
|
| 470 |
+
```text
|
| 471 |
+
maximize_pi E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ]
|
| 472 |
+
subject to C(s_t, a_t) = 1 whenever possible
|
| 473 |
+
```
|
| 474 |
+
|
| 475 |
+
There is no explicit discount factor in the runtime. Time preference enters
|
| 476 |
+
through finite horizons and the efficiency reward:
|
| 477 |
+
|
| 478 |
+
```text
|
| 479 |
+
efficiency_t = q(1 - step_count_t / (max_steps + 1))
|
| 480 |
+
```
|
| 481 |
+
|
| 482 |
+
State transition is two-gated:
|
| 483 |
+
|
| 484 |
+
```text
|
| 485 |
+
if verifier(s_t, a_t).legal and not anti_cheat(s_t, a_t):
|
| 486 |
+
s_{t+1} = T(s_t, a_t)
|
| 487 |
+
else:
|
| 488 |
+
s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t)
|
| 489 |
+
```
|
| 490 |
+
|
| 491 |
+
Risk-like deltas become reward through:
|
| 492 |
+
|
| 493 |
+
```text
|
| 494 |
+
delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post))
|
| 495 |
+
```
|
| 496 |
+
|
| 497 |
+
For burden and contraindicated-pair improvement:
|
| 498 |
+
|
| 499 |
+
```text
|
| 500 |
+
burden_reward = delta_reward(pre_burden, post_burden)
|
| 501 |
+
pair_reward = delta_reward(pre_pairs, post_pairs)
|
| 502 |
+
|
| 503 |
+
safety_delta_score =
|
| 504 |
+
q(0.65 * pair_reward + 0.35 * burden_reward) if legal
|
| 505 |
+
0.001 otherwise
|
| 506 |
+
```
|
| 507 |
+
|
| 508 |
+
GRPO uses environment execution as the reward function. For each prompt, the
|
| 509 |
+
model emits candidate completions; PolyGuard parses the candidate id, resets a
|
| 510 |
+
deterministic environment using the recorded seed and scenario fields, executes
|
| 511 |
+
one step, and returns reward. The training reward combines environment reward
|
| 512 |
+
with a legality bonus:
|
| 513 |
+
|
| 514 |
+
```text
|
| 515 |
+
legal_bonus = 0.95 if action is legal else 0.05
|
| 516 |
+
|
| 517 |
+
R_GRPO = q(0.80 * R_env + 0.20 * legal_bonus)
|
| 518 |
+
```
|
| 519 |
+
|
| 520 |
+
Conceptually, GRPO forms a within-prompt advantage:
|
| 521 |
+
|
| 522 |
+
```text
|
| 523 |
+
A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon)
|
| 524 |
+
```
|
| 525 |
+
|
| 526 |
+
and optimizes a clipped policy-ratio objective with KL regularization. The
|
| 527 |
+
optimizer mechanics are TRL's; PolyGuard's contribution is the verifier-backed
|
| 528 |
+
reward function and the controlled action/state environment. The expanded
|
| 529 |
+
derivation is in
|
| 530 |
+
[polyguard-rl/docs/mathematics.md](polyguard-rl/docs/mathematics.md).
|
| 531 |
+
|
| 532 |
+
## Data and Dataset Pipeline
|
| 533 |
+
|
| 534 |
+
The data pipeline builds a compact medication-safety substrate from local drug
|
| 535 |
+
knowledge, synthetic patients, scenario files, retrieval text, and optional
|
| 536 |
+
external augmentation.
|
| 537 |
+
|
| 538 |
+

|
| 539 |
+
|
| 540 |
+
The dataset design is documented in
|
| 541 |
+
[docs/datasets.md](polyguard-rl/docs/datasets.md). The local generated
|
| 542 |
+
pipeline produces these processed artifacts and counts:
|
| 543 |
+
|
| 544 |
+
| Artifact | Count | Path |
|
| 545 |
+
| --- | ---: | --- |
|
| 546 |
+
| Normalized drug rows | 10 | `data/processed/normalized_drugs.parquet` |
|
| 547 |
+
| Drug class rows | 10 | `data/processed/drug_classes.parquet` |
|
| 548 |
+
| Interaction rows | 2 | `data/processed/interactions.parquet` |
|
| 549 |
+
| Graph edges | 18 | `data/processed/graph_edges.parquet` |
|
| 550 |
+
| Synthetic patients | 20 | `data/processed/patients_synthetic.parquet` |
|
| 551 |
+
| Retrieval documents | 8 | `data/processed/retrieval_corpus.jsonl` |
|
| 552 |
+
| Easy scenarios | 100 | [data/scenarios/easy/](polyguard-rl/data/scenarios/easy/) |
|
| 553 |
+
| Medium scenarios | 200 | [data/scenarios/medium/](polyguard-rl/data/scenarios/medium/) |
|
| 554 |
+
| Hard scenarios | 200 | [data/scenarios/hard/](polyguard-rl/data/scenarios/hard/) |
|
| 555 |
+
| Local small SFT rows | 80 | `data/processed/training_corpus_sft.jsonl` |
|
| 556 |
+
| Local small GRPO prompts | 80 | `data/processed/training_corpus_grpo_prompts.jsonl` |
|
| 557 |
+
|
| 558 |
+
The provenance manifest generated by the local pipeline records source policy
|
| 559 |
+
and counts at `data/processed/provenance_manifest.json`.
|
| 560 |
+
|
| 561 |
+
Additional data-governance and rule artifacts are produced or consumed by the
|
| 562 |
+
pipeline:
|
| 563 |
+
|
| 564 |
+
| Artifact | Why it matters |
|
| 565 |
+
| --- | --- |
|
| 566 |
+
| `data/processed/ingested_sources.json` | Source ingestion ledger used by the local build |
|
| 567 |
+
| `data/processed/feature_dictionary.json` | Names and meanings of structured model features |
|
| 568 |
+
| `data/processed/burden_rules.yaml` | Medication-burden and duplicate-therapy rules |
|
| 569 |
+
| `data/processed/substitution_rules.yaml` | Data-level safer-substitution rules |
|
| 570 |
+
| `data/processed/taper_rules.yaml` | Deprescribing and taper requirements |
|
| 571 |
+
| `data/retrieval_index/index.json` | Retrieval index over local evidence chunks |
|
| 572 |
+
|
| 573 |
+
The local knowledge seed is
|
| 574 |
+
[data/raw/knowledge/drug_knowledge.json](polyguard-rl/data/raw/knowledge/drug_knowledge.json).
|
| 575 |
+
It contains drug classes, example high-risk pairs, renal and hepatic flags,
|
| 576 |
+
side-effect tags, substitution rules, and taper requirements. The processed
|
| 577 |
+
tables then feed graph modeling, candidate generation, environment scenarios,
|
| 578 |
+
retrieval, SFT rows, and GRPO prompts.
|
| 579 |
+
|
| 580 |
+
The full training/evidence runs used 2,000 examples per Qwen model, recorded in
|
| 581 |
+
the final reports under
|
| 582 |
+
[docs/results/final_submission_evidence/reports/](polyguard-rl/docs/results/final_submission_evidence/reports/).
|
| 583 |
+
|
| 584 |
+
## Models Inside the Environment
|
| 585 |
+
|
| 586 |
+
PolyGuard combines learned and rule-backed components:
|
| 587 |
+
|
| 588 |
+
- Graph safety model:
|
| 589 |
+
[app/models/graph/](polyguard-rl/app/models/graph/) produces regimen
|
| 590 |
+
embeddings, pairwise DDI severity, severe-alert probability, and side-effect
|
| 591 |
+
tag probabilities.
|
| 592 |
+
- Tabular risk model:
|
| 593 |
+
[app/models/tabular/](polyguard-rl/app/models/tabular/) supports calibrated
|
| 594 |
+
patient/regimen risk heads and evaluation.
|
| 595 |
+
- Dosing model:
|
| 596 |
+
[app/models/dosing/](polyguard-rl/app/models/dosing/) models dose-sensitive
|
| 597 |
+
states with target attainment, toxicity, underdose risk, organ stress,
|
| 598 |
+
interaction load, and monitoring need.
|
| 599 |
+
- Retrieval:
|
| 600 |
+
[app/models/retrieval/](polyguard-rl/app/models/retrieval/) and
|
| 601 |
+
[app/knowledge/](polyguard-rl/app/knowledge/) provide local evidence chunks,
|
| 602 |
+
drug rules, renal/hepatic guardrails, duplicate therapy rules, substitution
|
| 603 |
+
rules, taper rules, burden scoring, and side-effect ontology.
|
| 604 |
+
- Active model runtime:
|
| 605 |
+
[app/models/policy/active_model.py](polyguard-rl/app/models/policy/active_model.py)
|
| 606 |
+
discovers activated artifacts from `checkpoints/active/active_model_manifest.json`;
|
| 607 |
+
the tracked evidence mirror includes
|
| 608 |
+
[docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json).
|
| 609 |
+
The provider load order prefers a GRPO adapter, then merged model, then SFT
|
| 610 |
+
adapter.
|
| 611 |
+
- Provider runtime:
|
| 612 |
+
[app/models/policy/provider_runtime.py](polyguard-rl/app/models/policy/provider_runtime.py)
|
| 613 |
+
is Transformers-first, with optional Ollama when enabled. If model loading is
|
| 614 |
+
unavailable, the runtime falls back to deterministic safety ranking.
|
| 615 |
+
|
| 616 |
+
Tracked support-model reports show that the environment is not only an LLM
|
| 617 |
+
wrapper:
|
| 618 |
+
|
| 619 |
+
| Component | Report | Current tracked result |
|
| 620 |
+
| --- | --- | --- |
|
| 621 |
+
| Graph model | [docs/results/graph_train.json](polyguard-rl/docs/results/graph_train.json) | `status: trained`, `num_samples: 180`, artifact path `outputs/models/graph_model.pkl` |
|
| 622 |
+
| Tabular risk model | [docs/results/risk_train.json](polyguard-rl/docs/results/risk_train.json) | `status: trained`, `dataset_size: 180`, `train_mae: 0.0033`, artifact path `outputs/models/tabular_risk.pkl` |
|
| 623 |
+
| Dose surrogate model | [docs/results/dose_train.json](polyguard-rl/docs/results/dose_train.json) | `status: trained`, `dataset_size: 120`, `train_mae: 0.0025`, artifact path `outputs/models/dose_model.pkl` |
|
| 624 |
+
|
| 625 |
+
The hard-coded contraindicated seed pairs in
|
| 626 |
+
[app/knowledge/ddi_knowledge.py](polyguard-rl/app/knowledge/ddi_knowledge.py)
|
| 627 |
+
include `warfarin_like` + `nsaid_like` and `benzodiazepine_like` +
|
| 628 |
+
`opioid_like`. Substitution rules in
|
| 629 |
+
[app/knowledge/substitution_rules.py](polyguard-rl/app/knowledge/substitution_rules.py)
|
| 630 |
+
include safer alternatives such as `nsaid_like -> acetaminophen_like`,
|
| 631 |
+
`nsaid_like -> topical_nsaid_like`, `benzodiazepine_like ->
|
| 632 |
+
non_benzo_sleep_support`, and `opioid_like -> non_opioid_analgesic`.
|
| 633 |
+
|
| 634 |
+
### Precision Dosing
|
| 635 |
+
|
| 636 |
+
Precision dosing uses sensitive classes such as anticoagulants, sedatives, and
|
| 637 |
+
glucose-lowering drugs. The dosing agent and surrogate model are implemented in
|
| 638 |
+
[app/agents/dosing_agent.py](polyguard-rl/app/agents/dosing_agent.py) and
|
| 639 |
+
[app/models/dosing/](polyguard-rl/app/models/dosing/).
|
| 640 |
+
|
| 641 |
+
The surrogate PK/PD transition in
|
| 642 |
+
[app/models/dosing/surrogate_pkpd.py](polyguard-rl/app/models/dosing/surrogate_pkpd.py)
|
| 643 |
+
uses effect, toxicity, underdose, organ stress, and interaction load:
|
| 644 |
+
|
| 645 |
+
```text
|
| 646 |
+
effective_delta = dose_delta * (1 - min(0.6, organ_factor * 0.4))
|
| 647 |
+
|
| 648 |
+
effect' =
|
| 649 |
+
clip(effect + 0.28 * effective_delta - 0.05 * interaction_factor, 0, 1)
|
| 650 |
+
|
| 651 |
+
toxicity_gain =
|
| 652 |
+
max(0, dose_delta) * (0.35 + 0.25 * organ_factor + 0.20 * interaction_factor)
|
| 653 |
+
|
| 654 |
+
toxicity' =
|
| 655 |
+
clip(0.85 * toxicity + toxicity_gain, 0, 1)
|
| 656 |
+
|
| 657 |
+
underdose' =
|
| 658 |
+
clip(1 - effect' + 0.15 * max(0, -dose_delta), 0, 1)
|
| 659 |
+
```
|
| 660 |
+
|
| 661 |
+
The higher-level dosing metrics use target attainment, toxicity avoidance,
|
| 662 |
+
underdose risk, and monitoring need:
|
| 663 |
+
|
| 664 |
+
```text
|
| 665 |
+
target_attainment = 1 - abs(effect_level - 0.62)
|
| 666 |
+
toxicity_proxy = toxicity_level + 0.20 * organ_stress + 0.12 * interaction_load
|
| 667 |
+
measurement_need = max(toxicity_proxy, underdose_proxy)
|
| 668 |
+
```
|
| 669 |
+
|
| 670 |
+
## Training and Post-Training
|
| 671 |
+
|
| 672 |
+
The training stack is deliberately staged:
|
| 673 |
+
|
| 674 |
+
1. Build structured data, scenarios, retrieval records, SFT examples, and GRPO
|
| 675 |
+
prompts.
|
| 676 |
+
2. Run SFT with TRL to teach the model the candidate-id format and obvious
|
| 677 |
+
clinical priors.
|
| 678 |
+
3. Run GRPO with environment-backed reward, where sampled candidate completions
|
| 679 |
+
are executed in PolyGuardEnv and scored by the verifier/reward router.
|
| 680 |
+
4. Track sampled generations, reward components, primary reward channels,
|
| 681 |
+
legality, anti-cheat events, and training curves.
|
| 682 |
+
5. Run policy-stack ablations and baseline comparisons.
|
| 683 |
+
6. Merge or export adapters safely.
|
| 684 |
+
7. Validate post-save inference from the saved artifact, not from an in-memory
|
| 685 |
+
training object.
|
| 686 |
+
8. Generate reports, charts, action traces, and final artifact manifests.
|
| 687 |
+
|
| 688 |
+
The relevant training source files are
|
| 689 |
+
[scripts/train_sft_trl.py](polyguard-rl/scripts/train_sft_trl.py),
|
| 690 |
+
[scripts/train_grpo_trl.py](polyguard-rl/scripts/train_grpo_trl.py),
|
| 691 |
+
[app/training/sft_trl.py](polyguard-rl/app/training/sft_trl.py),
|
| 692 |
+
[app/training/grpo_trl.py](polyguard-rl/app/training/grpo_trl.py),
|
| 693 |
+
[app/training/reward_functions.py](polyguard-rl/app/training/reward_functions.py),
|
| 694 |
+
[app/training/openenv_wrapper.py](polyguard-rl/app/training/openenv_wrapper.py),
|
| 695 |
+
and [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py).
|
| 696 |
+
|
| 697 |
+
The one-run notebook is
|
| 698 |
+
[polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
|
| 699 |
+
It is the accessible Colab/HF workflow for building data, running checks,
|
| 700 |
+
launching training, pulling reports, generating charts, validating inference,
|
| 701 |
+
activating a model, deploying the product Space, and running acceptance checks.
|
| 702 |
+
|
| 703 |
+
The modular notebook series is:
|
| 704 |
+
|
| 705 |
+
- [01_data_building.ipynb](polyguard-rl/notebooks/01_data_building.ipynb)
|
| 706 |
+
- [02_knowledge_graph.ipynb](polyguard-rl/notebooks/02_knowledge_graph.ipynb)
|
| 707 |
+
- [03_risk_models.ipynb](polyguard-rl/notebooks/03_risk_models.ipynb)
|
| 708 |
+
- [04_environment_validation.ipynb](polyguard-rl/notebooks/04_environment_validation.ipynb)
|
| 709 |
+
- [05_sft_debug.ipynb](polyguard-rl/notebooks/05_sft_debug.ipynb)
|
| 710 |
+
- [06_grpo_debug.ipynb](polyguard-rl/notebooks/06_grpo_debug.ipynb)
|
| 711 |
+
- [07_policy_analysis.ipynb](polyguard-rl/notebooks/07_policy_analysis.ipynb)
|
| 712 |
+
- [08_dosing_analysis.ipynb](polyguard-rl/notebooks/08_dosing_analysis.ipynb)
|
| 713 |
+
- [09_training_loop.ipynb](polyguard-rl/notebooks/09_training_loop.ipynb)
|
| 714 |
+
|
| 715 |
+
For exact local and remote execution details, use
|
| 716 |
+
[docs/training.md](polyguard-rl/docs/training.md) and
|
| 717 |
+
[docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
|
| 718 |
+
This blog focuses on architecture, data, evaluation, and evidence rather than
|
| 719 |
+
private or environment-specific commands.
|
| 720 |
+
|
| 721 |
+
## Training Curves and Model Results
|
| 722 |
+
|
| 723 |
+
The final curated evidence lives in
|
| 724 |
+
[polyguard-rl/docs/results/final_submission_evidence/](polyguard-rl/docs/results/final_submission_evidence/).
|
| 725 |
+
It replaces earlier smoke-run charts and older 0.5B/1.5B-only views.
|
| 726 |
+
|
| 727 |
+
### SFT Loss Across Qwen Runs
|
| 728 |
+
|
| 729 |
+

|
| 730 |
+
|
| 731 |
+
The SFT curves, post-save valid rates, and token-accuracy histories show that
|
| 732 |
+
the models learned the candidate-id output contract rather than only producing
|
| 733 |
+
unconstrained prose. The visible curves drop from roughly `3.0-3.6` initial
|
| 734 |
+
loss to low final loss across all three Qwen sizes.
|
| 735 |
+
|
| 736 |
+

|
| 737 |
+
|
| 738 |
+
The tracked per-model summaries are:
|
| 739 |
+
|
| 740 |
+
| Run | Model | Epochs | Final step | Runtime | Key SFT metrics |
|
| 741 |
+
| --- | --- | ---: | ---: | ---: | --- |
|
| 742 |
+
| `qwen-qwen2-5-0-5b-instruct` | `Qwen/Qwen2.5-0.5B-Instruct` | 2 | 2,000 | `234.6302s` | loss `3.0856 -> 0.0626`, best `0.0057`, train loss `0.1923`, token accuracy `0.9717`, valid rate `1.0`, avg env reward `0.726`, latency `1.839s` |
|
| 743 |
+
| `qwen-qwen2-5-1-5b-instruct` | `Qwen/Qwen2.5-1.5B-Instruct` | 2 | 4,000 | `483.7085s` | loss `2.9686 -> 0.0681`, best `0.0009`, train loss `0.1152`, token accuracy `0.9726`, valid rate `1.0`, avg env reward `0.726`, latency `2.158s` |
|
| 744 |
+
| `qwen-qwen2-5-3b-instruct` | `Qwen/Qwen2.5-3B-Instruct` | 2 | 2,000 | `715.2908s` | loss `3.5687 -> 0.054`, best `0.0022`, train loss `0.1569`, token accuracy `0.9750`, SFT avg env reward `0.781`, SFT latency `2.863s` |
|
| 745 |
+
|
| 746 |
+
Each SFT run used 2,000 examples. The 0.5B and 3B runs recorded 2,001 history
|
| 747 |
+
rows including the final trainer summary; the 1.5B run recorded 4,001 history
|
| 748 |
+
rows because its batch configuration produced 4,000 final steps.
|
| 749 |
+
|
| 750 |
+
### GRPO Reward Curve
|
| 751 |
+
|
| 752 |
+

|
| 753 |
+
|
| 754 |
+

|
| 755 |
+
|
| 756 |
+
The complete GRPO evidence is available for Qwen 3B:
|
| 757 |
+
|
| 758 |
+
- Backend: `trl_transformers`
|
| 759 |
+
- Model: `Qwen/Qwen2.5-3B-Instruct`
|
| 760 |
+
- Records: `2000`
|
| 761 |
+
- Epochs: `1.0`
|
| 762 |
+
- Final step: `2000`
|
| 763 |
+
- Runtime: `6873.9375s` (`1.91h`)
|
| 764 |
+
- Reward samples: `4000`
|
| 765 |
+
- GRPO average reward: `0.767`
|
| 766 |
+
- GRPO reward history: min `0.376`, max `0.880`, last `0.812`, average `0.76685`
|
| 767 |
+
- GRPO train loss: `0.000002665`
|
| 768 |
+
- Post-save GRPO valid rate: `1.0`
|
| 769 |
+
- Post-save GRPO average environment reward: `0.726`
|
| 770 |
+
- Post-save GRPO average latency: `3.681s`
|
| 771 |
+
- Artifact path recorded in the report: `checkpoints/sweeps/qwen-qwen2-5-3b-instruct/grpo_adapter`
|
| 772 |
+
|
| 773 |
+
Source reports:
|
| 774 |
+
[grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json),
|
| 775 |
+
[postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json),
|
| 776 |
+
and
|
| 777 |
+
[submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json).
|
| 778 |
+
|
| 779 |
+
### SFT vs GRPO by Model
|
| 780 |
+
|
| 781 |
+

|
| 782 |
+
|
| 783 |
+
This chart is intentionally transparent about artifact availability. Qwen 0.5B
|
| 784 |
+
and 1.5B have SFT reports/histories and post-save SFT evidence in the repo, but
|
| 785 |
+
their adapter directories were not present in the local/final artifact mirrors
|
| 786 |
+
at packaging time. Qwen 3B has the complete SFT plus GRPO artifact set.
|
| 787 |
+
|
| 788 |
+
The packaged manifest records Qwen 3B as complete with 125 checkpoint files
|
| 789 |
+
(`433,208,536` bytes), 11 SFT adapter files (`30,655,905` bytes), 11 GRPO
|
| 790 |
+
adapter files (`30,656,841` bytes), and 9 report files (`5,930,214` bytes).
|
| 791 |
+
Qwen 0.5B and 1.5B are retained as report/post-save evidence only.
|
| 792 |
+
|
| 793 |
+
Manifest:
|
| 794 |
+
[docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json).
|
| 795 |
+
|
| 796 |
+
### Product Pipeline vs Basic LLM Proxy
|
| 797 |
+
|
| 798 |
+

|
| 799 |
+
|
| 800 |
+
Matched-seed evaluation compares a basic LLM-style first-legal proxy, an
|
| 801 |
+
SFT-style safety ranker, and the full PolyGuard orchestrated pipeline. The same
|
| 802 |
+
PolyGuard verifier/reward system judges all three.
|
| 803 |
+
|
| 804 |
+
| Policy | Episodes | Avg reward | Legality rate | Failure/exploit rate | Candidate diversity |
|
| 805 |
+
| --- | ---: | ---: | ---: | ---: | ---: |
|
| 806 |
+
| Basic LLM proxy | 8 | `0.762` | `1.0` | `0.25` | 1 |
|
| 807 |
+
| SFT policy proxy | 8 | `0.818` | `1.0` | `0.0` | 2 |
|
| 808 |
+
| Full PolyGuard pipeline | 8 | `0.805` | `1.0` | `0.0` | 2 |
|
| 809 |
+
|
| 810 |
+
The full pipeline improves average verifier reward over the basic LLM proxy by
|
| 811 |
+
`+0.043` while reducing visible failure/exploit rate from `0.25` to `0.0`.
|
| 812 |
+
|
| 813 |
+

|
| 814 |
+
|
| 815 |
+
Two matched seeds expose the core failure mode: the basic policy repeatedly
|
| 816 |
+
kept a regimen despite the hidden `warfarin_like` + `nsaid_like` DDI holdout,
|
| 817 |
+
triggering `holdout_ddi_not_addressed`. The full pipeline selected safer dose
|
| 818 |
+
or hold candidates and avoided those failure reasons.
|
| 819 |
+
|
| 820 |
+
Source:
|
| 821 |
+
[basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json).
|
| 822 |
+
|
| 823 |
+
### Reward Components and Channels
|
| 824 |
+
|
| 825 |
+

|
| 826 |
+
|
| 827 |
+

|
| 828 |
+
|
| 829 |
+
The reward charts are as important as the scalar reward curve. They show
|
| 830 |
+
whether the model is improving by becoming safer and more process-faithful or
|
| 831 |
+
merely exploiting one easy component. The reports log the full 13-component
|
| 832 |
+
reward vector and the four primary channels for GRPO and evaluation runs.
|
| 833 |
+
|
| 834 |
+
For Qwen 3B GRPO, the tracked average primary channels are:
|
| 835 |
+
|
| 836 |
+
| Channel | Average |
|
| 837 |
+
| --- | ---: |
|
| 838 |
+
| `safety_legality` | `0.816` |
|
| 839 |
+
| `clinical_improvement` | `0.609` |
|
| 840 |
+
| `dosing_quality` | `0.543` |
|
| 841 |
+
| `process_integrity` | `0.875` |
|
| 842 |
+
|
| 843 |
+
### Post-Save Inference
|
| 844 |
+
|
| 845 |
+

|
| 846 |
+
|
| 847 |
+
Post-save inference is separate from training. The exported/activated artifact
|
| 848 |
+
is loaded and asked to choose candidate ids on held prompt samples. The Qwen 3B
|
| 849 |
+
GRPO adapter path produced:
|
| 850 |
+
|
| 851 |
+
- `model_source: adapter`
|
| 852 |
+
- `samples: 5`
|
| 853 |
+
- `valid_rate: 1.0`
|
| 854 |
+
- `avg_env_reward: 0.726`
|
| 855 |
+
- `avg_latency_seconds: 3.681`
|
| 856 |
+
|
| 857 |
+
The caveat matters: `valid_rate: 1.0` means the output was parseable and
|
| 858 |
+
executable as a candidate selection. In the five-sample Qwen 3B post-save GRPO
|
| 859 |
+
report, four valid samples still terminated with `exploit_detection`. That is
|
| 860 |
+
retained as safety evidence, because PolyGuard's job is to expose suspicious or
|
| 861 |
+
loop-like behavior instead of hiding it behind a clean parse metric.
|
| 862 |
+
|
| 863 |
+
## Agentic Evaluation
|
| 864 |
+
|
| 865 |
+
Evaluation is not one benchmark number. The evaluation stack under
|
| 866 |
+
[app/evaluation/](polyguard-rl/app/evaluation/) includes offline policy
|
| 867 |
+
evaluation, safety evaluation, dosing evaluation, robustness under missing labs
|
| 868 |
+
and noisy inputs, calibration and abstention evaluation, process fidelity,
|
| 869 |
+
subgroup summaries, explainability grounding, baseline comparison, policy
|
| 870 |
+
ablations, failure mining, and action traces.
|
| 871 |
+
|
| 872 |
+
The tracked benchmark report records:
|
| 873 |
+
|
| 874 |
+
| Metric family | Result |
|
| 875 |
+
| --- | --- |
|
| 876 |
+
| Offline avg reward | `0.772833` |
|
| 877 |
+
| Offline legal rate | `1.0` |
|
| 878 |
+
| Severe violation rate | `0.0` |
|
| 879 |
+
| Illegal step rate | `0.0` |
|
| 880 |
+
| Dosing target attainment | `0.75` |
|
| 881 |
+
| Dosing toxicity avoidance | `1.0` |
|
| 882 |
+
| Missing-labs safety rate | `0.666667` |
|
| 883 |
+
| Noisy-dose, conflicting-meds, alias-noise, hidden-duplicate, wrong-candidate-id, stale-evidence, delayed-ADE safety/resilience | `1.0` |
|
| 884 |
+
| Calibration ECE proxy | `0.08625` |
|
| 885 |
+
| Process fidelity | `0.92` |
|
| 886 |
+
| Explainability grounding | `0.8` |
|
| 887 |
+
|
| 888 |
+
Source:
|
| 889 |
+
[docs/results/benchmark_report.json](polyguard-rl/docs/results/benchmark_report.json).
|
| 890 |
+
|
| 891 |
+
The improvement gate compares baseline and candidate reports:
|
| 892 |
+
|
| 893 |
+
| Gate dimension | Delta |
|
| 894 |
+
| --- | ---: |
|
| 895 |
+
| Average reward | `+0.025833` |
|
| 896 |
+
| Legality rate | `0.0` non-regression |
|
| 897 |
+
| Success rate | `0.0` non-regression |
|
| 898 |
+
| Process fidelity | `+0.92` |
|
| 899 |
+
| Timeout rate | `0.0` non-regression |
|
| 900 |
+
| Failure visibility | `0.0` non-regression |
|
| 901 |
+
|
| 902 |
+
Source:
|
| 903 |
+
[docs/results/improvement_report.json](polyguard-rl/docs/results/improvement_report.json).
|
| 904 |
+
|
| 905 |
+
### Policy Ablation Results
|
| 906 |
+
|
| 907 |
+
| Stack | Avg reward | Legality | Visible failure rate | Exploit detections | Interpretation |
|
| 908 |
+
| --- | ---: | ---: | ---: | ---: | --- |
|
| 909 |
+
| `bandit_only` | `0.779625` | `1.0` | `0.0625` | 2 | Strong deterministic shortlist behavior with low failure visibility |
|
| 910 |
+
| `llm_only` | `0.772391` | `1.0` | `0.3043` | 7 | Legal, but more loop-like failure behavior |
|
| 911 |
+
| `llm+bandit` | `0.764739` | `1.0` | `0.3043` | 7 | Current combined stack needs tighter exploration/control in these ablation settings |
|
| 912 |
+
|
| 913 |
+

|
| 914 |
+
|
| 915 |
+
The point of these ablations is not to claim every combined policy is always
|
| 916 |
+
better. The point is that PolyGuard can localize behavior: legality remains
|
| 917 |
+
high, while failure mining shows whether a stack is looping, over-reviewing,
|
| 918 |
+
or selecting non-improving candidates.
|
| 919 |
+
|
| 920 |
+
Source:
|
| 921 |
+
[policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json).
|
| 922 |
+
|
| 923 |
+
## OpenEnv and Product Surfaces
|
| 924 |
+
|
| 925 |
+
The OpenEnv package is compact:
|
| 926 |
+
|
| 927 |
+
```yaml
|
| 928 |
+
spec_version: 1
|
| 929 |
+
name: polyguard-openenv
|
| 930 |
+
runtime: fastapi
|
| 931 |
+
app: app.env.fastapi_app:app
|
| 932 |
+
port: 8100
|
| 933 |
+
```
|
| 934 |
+
|
| 935 |
+
The OpenEnv runtime exposes `POST /reset`, `POST /step`, `GET /state`,
|
| 936 |
+
`GET /metadata`, `GET /schema`, `POST /mcp`, `GET /health`, `GET /ws`, and
|
| 937 |
+
backward-compatible `/env/*` routes.
|
| 938 |
+
|
| 939 |
+
The product API in [app/api/routes.py](polyguard-rl/app/api/routes.py) wraps
|
| 940 |
+
the environment, orchestrator, policy runtime, evaluation, evidence search,
|
| 941 |
+
cases, metrics, and medication-alternative tooling. Product-facing endpoints
|
| 942 |
+
include `/env/reset`, `/env/step_candidate`, `/agents/orchestrate`,
|
| 943 |
+
`/policy/infer`, `/policy/model_status`, `/eval/run_policy`,
|
| 944 |
+
`/metrics/training`, `/evidence/query`, and `/tools/medication_alternatives`.
|
| 945 |
+
|
| 946 |
+

|
| 947 |
+
|
| 948 |
+
## Operations and Deployment
|
| 949 |
+
|
| 950 |
+
The repository keeps deployment and artifact operations explicit:
|
| 951 |
+
|
| 952 |
+
| Surface | Files |
|
| 953 |
+
| --- | --- |
|
| 954 |
+
| Local/container runtime | [Dockerfile](polyguard-rl/Dockerfile), [Dockerfile.space](polyguard-rl/Dockerfile.space), [docker-compose.yml](polyguard-rl/docker-compose.yml), [requirements.txt](polyguard-rl/requirements.txt), [requirements-space.txt](polyguard-rl/requirements-space.txt) |
|
| 955 |
+
| Product Space/API deployment | [scripts/deploy_space.sh](polyguard-rl/scripts/deploy_space.sh), [scripts/deploy_space_api.py](polyguard-rl/scripts/deploy_space_api.py), [docs/deployment.md](polyguard-rl/docs/deployment.md) |
|
| 956 |
+
| Training and evidence Spaces | [scripts/deploy_training_space.py](polyguard-rl/scripts/deploy_training_space.py), [scripts/monitor_training_space_status.py](polyguard-rl/scripts/monitor_training_space_status.py), [app/hf_space/training_runner.py](polyguard-rl/app/hf_space/training_runner.py), [app/hf_space/evidence_runner.py](polyguard-rl/app/hf_space/evidence_runner.py) |
|
| 957 |
+
| Artifact packaging and activation | [scripts/deploy_final_artifact_space.py](polyguard-rl/scripts/deploy_final_artifact_space.py), [scripts/package_active_model_bundle.py](polyguard-rl/scripts/package_active_model_bundle.py), [scripts/install_hf_active_bundle.py](polyguard-rl/scripts/install_hf_active_bundle.py), [docs/results/active_model_manifest.json](polyguard-rl/docs/results/active_model_manifest.json) |
|
| 958 |
+
| Submission validation | [scripts/acceptance_gate.py](polyguard-rl/scripts/acceptance_gate.py), [scripts/validate_submission_links.py](polyguard-rl/scripts/validate_submission_links.py), [docs/submission_checklist.md](polyguard-rl/docs/submission_checklist.md), [docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md) |
|
| 959 |
+
|
| 960 |
+
The important operational distinction is that local smoke artifacts, remote
|
| 961 |
+
training-space logs, final artifact packaging, and active-model installation
|
| 962 |
+
are separate stages. Final claims are tied to the curated evidence bundle, not
|
| 963 |
+
to whichever intermediate output directory happens to exist in a checkout.
|
| 964 |
+
|
| 965 |
+
## The Workbench UI
|
| 966 |
+
|
| 967 |
+
The UI is a React 18 + Vite + TypeScript workbench under
|
| 968 |
+
[app/ui/frontend/](polyguard-rl/app/ui/frontend/). It is not the environment
|
| 969 |
+
itself; it is an operator surface over the API and OpenEnv runtime.
|
| 970 |
+
|
| 971 |
+
[Live workbench Space](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench)
|
| 972 |
+
|
| 973 |
+

|
| 974 |
+
|
| 975 |
+
The main views cover patient workbench, episode replay, policy comparison and
|
| 976 |
+
policy lab, precision dosing, training monitor, safety inspector, candidate
|
| 977 |
+
actions, reward panel, episode trace, and alternative medication search through
|
| 978 |
+
`/tools/medication_alternatives`.
|
| 979 |
+
|
| 980 |
+
The Patient Workbench shows the active model chip, current scenario, candidate
|
| 981 |
+
set, agent-vs-environment flow, reward breakdown, and action trace without
|
| 982 |
+
requiring the reader to inspect raw JSON. The UI is intentionally a workbench,
|
| 983 |
+
not a polished clinical application.
|
| 984 |
+
|
| 985 |
+
### UI Sequence
|
| 986 |
+
|
| 987 |
+
The five UI screenshots are checked in under `polyguard-rl/docs/UI Images/`.
|
| 988 |
+
|
| 989 |
+
1. The workbench opens with model truth, live episode context, scenario status,
|
| 990 |
+
candidate count, and reward state.
|
| 991 |
+
|
| 992 |
+

|
| 993 |
+
|
| 994 |
+
2. The episode panel makes the patient, task, difficulty, sub-environment, risk
|
| 995 |
+
delta, and candidate-action console visible without reading raw JSON.
|
| 996 |
+
|
| 997 |
+

|
| 998 |
+
|
| 999 |
+
3. Candidate selection is paired with reward-channel feedback, current
|
| 1000 |
+
medications, and blocked/available action visibility.
|
| 1001 |
+
|
| 1002 |
+

|
| 1003 |
+
|
| 1004 |
+
4. After an action, the workbench exposes history, warnings, decision payload,
|
| 1005 |
+
grounded facts, explanation, evidence, and event logs.
|
| 1006 |
+
|
| 1007 |
+

|
| 1008 |
+
|
| 1009 |
+
5. The alternatives tool surfaces medication substitutions from the current
|
| 1010 |
+
regimen and links out to source labels.
|
| 1011 |
+
|
| 1012 |
+

|
| 1013 |
+
|
| 1014 |
+
## Demo Videos
|
| 1015 |
+
|
| 1016 |
+
### [UI Walkthrough Video](https://drive.google.com/file/d/1YOzad5gvx-tSmGzJNuBgokBF4-dX2T2H/view?usp=sharing)
|
| 1017 |
+
|
| 1018 |
+
This walkthrough shows the deployed workbench surface, including the live model
|
| 1019 |
+
chip, episode context, candidate actions, reward panels, and evidence-oriented
|
| 1020 |
+
patient review flow.
|
| 1021 |
+
|
| 1022 |
+
### [Agent In Action: Action Button Demo](https://drive.google.com/file/d/1eHk1v0OYJRrLWVO97ZclN05MYHxmNnmc/view?usp=sharing)
|
| 1023 |
+
|
| 1024 |
+
This demo focuses on what the action button does: selecting a candidate,
|
| 1025 |
+
submitting it through the environment, producing a verifier-scored transition,
|
| 1026 |
+
and exposing the resulting reward, action history, warnings, and explanation.
|
| 1027 |
+
|
| 1028 |
+
### [World Model Tool: Tavily and OpenFDA Alternative Suggestions](https://drive.google.com/file/d/1GaUyyaXaBCHjhHFbpkprojNt5pLNAoYi/view?usp=sharing)
|
| 1029 |
+
|
| 1030 |
+
This tool demo shows the world-model support path for alternative medication
|
| 1031 |
+
suggestions, using Tavily and the OpenFDA government database to retrieve
|
| 1032 |
+
candidate alternatives and side-effect evidence for safer review.
|
| 1033 |
+
|
| 1034 |
+
## How a Reviewer Should Read the Repository
|
| 1035 |
+
|
| 1036 |
+
For a fresh reviewer, the intended path is:
|
| 1037 |
+
|
| 1038 |
+
1. Read the artifact index:
|
| 1039 |
+
[polyguard-rl/docs/submission_artifacts.md](polyguard-rl/docs/submission_artifacts.md).
|
| 1040 |
+
2. Inspect the final curated evidence:
|
| 1041 |
+
[polyguard-rl/docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md).
|
| 1042 |
+
3. Open the one-run notebook:
|
| 1043 |
+
[PolyGuard_SFT_GRPO_One_Run_Runner.ipynb](polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb).
|
| 1044 |
+
4. For local smoke work, follow [docs/training.md](polyguard-rl/docs/training.md)
|
| 1045 |
+
and the local scripts
|
| 1046 |
+
[scripts/run_env_local.sh](polyguard-rl/scripts/run_env_local.sh),
|
| 1047 |
+
[scripts/run_api_local.sh](polyguard-rl/scripts/run_api_local.sh), and
|
| 1048 |
+
[scripts/run_ui_local.sh](polyguard-rl/scripts/run_ui_local.sh).
|
| 1049 |
+
5. For full training/reproduction, use the notebook or training docs rather
|
| 1050 |
+
than copying private artifact commands out of old drafts.
|
| 1051 |
+
6. For final public artifacts, use the final artifact Space:
|
| 1052 |
+
[adithya9903/polyguard-openenv-final-artifacts](https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts).
|
| 1053 |
+
|
| 1054 |
+
## Evidence and Artifact Inventory
|
| 1055 |
+
|
| 1056 |
+
Important evidence paths:
|
| 1057 |
+
|
| 1058 |
+
- Final overview:
|
| 1059 |
+
[docs/results/final_submission_evidence/README.md](polyguard-rl/docs/results/final_submission_evidence/README.md)
|
| 1060 |
+
- Artifact manifest:
|
| 1061 |
+
[docs/results/final_submission_evidence/manifest.json](polyguard-rl/docs/results/final_submission_evidence/manifest.json)
|
| 1062 |
+
- Three-model summary:
|
| 1063 |
+
[docs/results/final_submission_evidence/reports/submission_summary.json](polyguard-rl/docs/results/final_submission_evidence/reports/submission_summary.json)
|
| 1064 |
+
- Qwen 3B GRPO report:
|
| 1065 |
+
[docs/results/final_submission_evidence/reports/grpo_trl_run.json](polyguard-rl/docs/results/final_submission_evidence/reports/grpo_trl_run.json)
|
| 1066 |
+
- Post-save GRPO inference:
|
| 1067 |
+
[docs/results/final_submission_evidence/reports/postsave_inference_grpo.json](polyguard-rl/docs/results/final_submission_evidence/reports/postsave_inference_grpo.json)
|
| 1068 |
+
- Basic LLM vs PolyGuard:
|
| 1069 |
+
[docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json)
|
| 1070 |
+
- Policy ablation:
|
| 1071 |
+
[docs/results/final_submission_evidence/reports/policy_ablation_report.json](polyguard-rl/docs/results/final_submission_evidence/reports/policy_ablation_report.json)
|
| 1072 |
+
- Action traces:
|
| 1073 |
+
[docs/results/final_submission_evidence/reports/action_traces.jsonl](polyguard-rl/docs/results/final_submission_evidence/reports/action_traces.jsonl)
|
| 1074 |
+
- Curated charts:
|
| 1075 |
+
[docs/results/final_submission_evidence/charts/curated/README.md](polyguard-rl/docs/results/final_submission_evidence/charts/curated/README.md)
|
| 1076 |
+
|
| 1077 |
+
Important tests:
|
| 1078 |
+
|
| 1079 |
+
| Category | Tests |
|
| 1080 |
+
| --- | --- |
|
| 1081 |
+
| Environment contract | [tests/test_openenv_contract.py](polyguard-rl/tests/test_openenv_contract.py), [tests/test_env_reset.py](polyguard-rl/tests/test_env_reset.py), [tests/test_env_step.py](polyguard-rl/tests/test_env_step.py), [tests/test_env_step_flow.py](polyguard-rl/tests/test_env_step_flow.py), [tests/test_future_subenvs.py](polyguard-rl/tests/test_future_subenvs.py) |
|
| 1082 |
+
| Reward and safety | [tests/test_reward_functions.py](polyguard-rl/tests/test_reward_functions.py), [tests/test_reward_range.py](polyguard-rl/tests/test_reward_range.py), [tests/test_reward_channels.py](polyguard-rl/tests/test_reward_channels.py), [tests/test_anti_cheat.py](polyguard-rl/tests/test_anti_cheat.py), [tests/test_constraints.py](polyguard-rl/tests/test_constraints.py), [tests/test_timeout_logic.py](polyguard-rl/tests/test_timeout_logic.py) |
|
| 1083 |
+
| Policy and runtime | [tests/test_agents.py](polyguard-rl/tests/test_agents.py), [tests/test_contextual_bandit.py](polyguard-rl/tests/test_contextual_bandit.py), [tests/test_policy_schema.py](polyguard-rl/tests/test_policy_schema.py), [tests/test_provider_runtime.py](polyguard-rl/tests/test_provider_runtime.py), [tests/test_postsave_inference.py](polyguard-rl/tests/test_postsave_inference.py), [tests/test_checkpoint_integrity.py](polyguard-rl/tests/test_checkpoint_integrity.py) |
|
| 1084 |
+
| API and product tooling | [tests/test_api.py](polyguard-rl/tests/test_api.py), [tests/test_medication_alternatives.py](polyguard-rl/tests/test_medication_alternatives.py), [tests/test_remote_env.py](polyguard-rl/tests/test_remote_env.py) |
|
| 1085 |
+
| Data and evidence | [tests/test_parser.py](polyguard-rl/tests/test_parser.py), [tests/test_dataops_parser.py](polyguard-rl/tests/test_dataops_parser.py), [tests/test_graph_infer.py](polyguard-rl/tests/test_graph_infer.py), [tests/test_submission_evidence.py](polyguard-rl/tests/test_submission_evidence.py) |
|
| 1086 |
+
| Submission, notebook, and HF flow | [tests/test_acceptance_gate.py](polyguard-rl/tests/test_acceptance_gate.py), [tests/test_runner_notebook.py](polyguard-rl/tests/test_runner_notebook.py), [tests/test_hf_training_sweep.py](polyguard-rl/tests/test_hf_training_sweep.py) |
|
| 1087 |
+
|
| 1088 |
+
Additional architecture diagrams:
|
| 1089 |
+
|
| 1090 |
+
- [System architecture](polyguard-rl/docs/assets/diagrams/system_architecture.png)
|
| 1091 |
+
- [Runtime step flow](polyguard-rl/docs/assets/diagrams/runtime_step_flow.png)
|
| 1092 |
+
- [Data and training pipeline](polyguard-rl/docs/assets/diagrams/data_training_pipeline.png)
|
| 1093 |
+
- [Multi-agent orchestration](polyguard-rl/docs/assets/diagrams/multi_agent_orchestration.png)
|
| 1094 |
+
- [Reward decomposition](polyguard-rl/docs/assets/diagrams/reward_decomposition.png)
|
| 1095 |
+
- [Episode state machine](polyguard-rl/docs/assets/diagrams/episode_state_machine.png)
|
| 1096 |
+
- [Evidence generation flow](polyguard-rl/docs/assets/diagrams/evidence_generation_flow.png)
|
| 1097 |
+
- [Deployment topology](polyguard-rl/docs/assets/diagrams/deployment_topology.png)
|
| 1098 |
+
- [Frontend runtime surface](polyguard-rl/docs/assets/diagrams/frontend_runtime_surface.png)
|
| 1099 |
+
|
| 1100 |
+
## Limitations
|
| 1101 |
+
|
| 1102 |
+
PolyGuard is a simulator and research environment. Its current data substrate
|
| 1103 |
+
is compact and intentionally inspectable, not a production clinical knowledge
|
| 1104 |
+
base. The final evidence set is strongest for Qwen 3B because that run has
|
| 1105 |
+
complete SFT, GRPO, post-save GRPO, policy-ablation, adapter, and checkpoint
|
| 1106 |
+
evidence. Qwen 0.5B and 1.5B have SFT reports/histories and post-save SFT
|
| 1107 |
+
evidence, but their adapter directories are marked `reports_only_or_partial` in
|
| 1108 |
+
the final manifest.
|
| 1109 |
+
|
| 1110 |
+
The reward model is hand-designed and auditable. That is a feature for this
|
| 1111 |
+
OpenEnv setting, but it also means reward-channel design should be
|
| 1112 |
+
stress-tested as the data grows. The current ablations show that contextual
|
| 1113 |
+
bandits are useful and inspectable, while the `llm+bandit` combined stack needs
|
| 1114 |
+
more tuning to avoid loop-like failure behavior in some settings.
|
| 1115 |
+
|
| 1116 |
+
The right conclusion is not "this is a clinical decision system." The right
|
| 1117 |
+
conclusion is that constrained environment feedback, verifier-backed rewards,
|
| 1118 |
+
agentic evaluation, and explicit failure mining are a better substrate for
|
| 1119 |
+
safety-critical medication-policy learning than free-form prompt responses.
|
| 1120 |
+
|
| 1121 |
+
## References
|
| 1122 |
+
|
| 1123 |
+
- Alexandre Larouche, Audrey Durand, Richard Khoury, Caroline Sirois.
|
| 1124 |
+
[Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://arxiv.org/abs/2212.05190).
|
| 1125 |
+
arXiv:2212.05190.
|
| 1126 |
+
- World Health Organization.
|
| 1127 |
+
[Medication Without Harm](https://www.who.int/initiatives/medication-without-harm).
|
| 1128 |
+
- CDC.
|
| 1129 |
+
[FastStats: Medication Safety Data](https://www.cdc.gov/medication-safety/data-research/facts-stats/index.html).
|
| 1130 |
+
- Shehab N, Lovegrove MC, Geller AI, Rose KO, Weidle NJ, Budnitz DS.
|
| 1131 |
+
[US Emergency Department Visits for Outpatient Adverse Drug Events, 2013-2014](https://jamanetwork.com/journals/jama/fullarticle/2585977).
|
| 1132 |
+
JAMA. 2016;316(20):2115-2125.
|
| 1133 |
+
- AHRQ / NCBI Bookshelf.
|
| 1134 |
+
[Deprescribing To Reduce Medication Harms in Older Adults](https://www.ncbi.nlm.nih.gov/books/NBK600387/).
|
| 1135 |
+
- American Geriatrics Society.
|
| 1136 |
+
[2023 updated AGS Beers Criteria for potentially inappropriate medication use in older adults](https://pmc.ncbi.nlm.nih.gov/articles/PMC12478568/).
|
| 1137 |
+
- O'Mahony et al.
|
| 1138 |
+
[STOPP/START criteria for potentially inappropriate prescribing in older people: version 3](https://pmc.ncbi.nlm.nih.gov/articles/PMC10447584/).
|
| 1139 |
+
|
| 1140 |
+
## License
|
| 1141 |
+
|
| 1142 |
+
The project package declares an MIT license in
|
| 1143 |
+
[polyguard-rl/pyproject.toml](polyguard-rl/pyproject.toml). See
|
| 1144 |
+
[polyguard-rl/LICENSE](polyguard-rl/LICENSE) for the license text.
|