File size: 2,104 Bytes
21c7db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Mathematics

## POMDP Framing

PolyGuard can be viewed as a partially observable Markov decision process:

```text
M = (S, A, O, T, R, gamma)
```

- `S`: latent patient/regimen state, including risks and unresolved conflicts.
- `A`: constrained medication actions emitted through `PolyGuardAction`.
- `O`: observable patient summary, medications, labs, warnings, candidate set, and uncertainty indicators.
- `T`: transition dynamics that apply medication changes, evidence updates, dosing holds, taper actions, and review escalation.
- `R`: decomposed reward over safety, clinical improvement, dosing quality, process integrity, and anti-cheat penalties.
- `gamma`: implicit finite-horizon discount through step budgets and efficiency reward.

## Action Selection

The policy chooses a candidate action from the legal candidate set:

```text
a_t = pi(o_t, C_t)
```

where `C_t` is generated from rule-based clinical candidates and filtered by legality checks. The contextual bandit can rerank candidates before planner selection.

## Reward Aggregation

Reward components are scaled, clamped, and aggregated:

```text
r_t = clamp(sum_i w_i c_i, 0.001, 0.999)
```

Primary channels are averages over semantically related component scores. This keeps reward debugging possible when total reward rises for the wrong reason.

## Anti-Cheat Penalty

If the anti-cheat detector flags an exploit, `anti_cheat_score` becomes near zero and the episode can terminate with `exploit_detection`.

```text
anti_cheat_score = 0.001 if exploit else 0.999
```

## Uncertainty And Abstention

Uncertainty is estimated from missing data, conflicts, candidate risk, and environment state. Review escalation is rewarded when uncertainty is high and penalized when used as a repeated escape hatch.

```text
calibration = clamp(1 - |confidence - (1 - uncertainty)|)
```

## Curriculum

Difficulty progresses from short-horizon easy cases to medium and hard cases with more medications, conflicts, missing data, and specialized sub-environments. This keeps the probability of non-zero reward above zero during early training.