| # Mathematics |
|
|
| ## POMDP Framing |
|
|
| PolyGuard can be viewed as a partially observable Markov decision process: |
|
|
| ```text |
| M = (S, A, O, T, R, gamma) |
| ``` |
|
|
| - `S`: latent patient/regimen state, including risks and unresolved conflicts. |
| - `A`: constrained medication actions emitted through `PolyGuardAction`. |
| - `O`: observable patient summary, medications, labs, warnings, candidate set, and uncertainty indicators. |
| - `T`: transition dynamics that apply medication changes, evidence updates, dosing holds, taper actions, and review escalation. |
| - `R`: decomposed reward over safety, clinical improvement, dosing quality, process integrity, and anti-cheat penalties. |
| - `gamma`: implicit finite-horizon discount through step budgets and efficiency reward. |
|
|
| ## Action Selection |
|
|
| The policy chooses a candidate action from the legal candidate set: |
|
|
| ```text |
| a_t = pi(o_t, C_t) |
| ``` |
|
|
| where `C_t` is generated from rule-based clinical candidates and filtered by legality checks. The contextual bandit can rerank candidates before planner selection. |
|
|
| ## Reward Aggregation |
|
|
| Reward components are scaled, clamped, and aggregated: |
|
|
| ```text |
| r_t = clamp(sum_i w_i c_i, 0.001, 0.999) |
| ``` |
|
|
| Primary channels are averages over semantically related component scores. This keeps reward debugging possible when total reward rises for the wrong reason. |
|
|
| ## Anti-Cheat Penalty |
|
|
| If the anti-cheat detector flags an exploit, `anti_cheat_score` becomes near zero and the episode can terminate with `exploit_detection`. |
|
|
| ```text |
| anti_cheat_score = 0.001 if exploit else 0.999 |
| ``` |
|
|
| ## Uncertainty And Abstention |
|
|
| Uncertainty is estimated from missing data, conflicts, candidate risk, and environment state. Review escalation is rewarded when uncertainty is high and penalized when used as a repeated escape hatch. |
|
|
| ```text |
| calibration = clamp(1 - |confidence - (1 - uncertainty)|) |
| ``` |
|
|
| ## Curriculum |
|
|
| Difficulty progresses from short-horizon easy cases to medium and hard cases with more medications, conflicts, missing data, and specialized sub-environments. This keeps the probability of non-zero reward above zero during early training. |
|
|