Spaces:

TheJackBright
/

polyguard-openenv-workbench

Sleeping

App Files Files Community

polyguard-openenv-workbench / polyguard-rl /docs /math.md

TheJackBright

Deploy GitHub root master to Space

c296d62 11 days ago

preview code

raw

history blame contribute delete

2.1 kB

	# Mathematics

	## POMDP Framing

	PolyGuard can be viewed as a partially observable Markov decision process:

	```text
	M = (S, A, O, T, R, gamma)
	```

	- `S`: latent patient/regimen state, including risks and unresolved conflicts.
	- `A`: constrained medication actions emitted through `PolyGuardAction`.
	- `O`: observable patient summary, medications, labs, warnings, candidate set, and uncertainty indicators.
	- `T`: transition dynamics that apply medication changes, evidence updates, dosing holds, taper actions, and review escalation.
	- `R`: decomposed reward over safety, clinical improvement, dosing quality, process integrity, and anti-cheat penalties.
	- `gamma`: implicit finite-horizon discount through step budgets and efficiency reward.

	## Action Selection

	The policy chooses a candidate action from the legal candidate set:

	```text
	a_t = pi(o_t, C_t)
	```

	where `C_t` is generated from rule-based clinical candidates and filtered by legality checks. The contextual bandit can rerank candidates before planner selection.

	## Reward Aggregation

	Reward components are scaled, clamped, and aggregated:

	```text
	r_t = clamp(sum_i w_i c_i, 0.001, 0.999)
	```

	Primary channels are averages over semantically related component scores. This keeps reward debugging possible when total reward rises for the wrong reason.

	## Anti-Cheat Penalty

	If the anti-cheat detector flags an exploit, `anti_cheat_score` becomes near zero and the episode can terminate with `exploit_detection`.

	```text
	anti_cheat_score = 0.001 if exploit else 0.999
	```

	## Uncertainty And Abstention

	Uncertainty is estimated from missing data, conflicts, candidate risk, and environment state. Review escalation is rewarded when uncertainty is high and penalized when used as a repeated escape hatch.

	```text
	calibration = clamp(1 - \|confidence - (1 - uncertainty)\|)
	```

	## Curriculum

	Difficulty progresses from short-horizon easy cases to medium and hard cases with more medications, conflicts, missing data, and specialized sub-environments. This keeps the probability of non-zero reward above zero during early training.