Spaces:
Sleeping
Sleeping
Commit ·
1bb11d9
1
Parent(s): c314a65
Fix API reward clamp to (0.001, 0.999) and update README
Browse files- app.py: align API-layer reward clamp to 0.001/0.999
- README: update reward table to reflect all-positive clamped rewards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md +12 -12
- backend/src/polypharmacy_env/api/app.py +4 -1
README.md
CHANGED
|
@@ -38,7 +38,7 @@ PolypharmacyEnv frames medication review as a **Markov Decision Process (MDP)**:
|
|
| 38 |
|
| 39 |
- **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history
|
| 40 |
- **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review`
|
| 41 |
-
- **Reward**: Shaped, dense signal at every step (not sparse end-of-episode)
|
| 42 |
- **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem.
|
| 43 |
|
| 44 |
This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints — a sequential decision problem that RL is designed to solve.
|
|
@@ -171,23 +171,23 @@ Each observation contains the full patient context:
|
|
| 171 |
- **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency
|
| 172 |
- **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution
|
| 173 |
|
| 174 |
-
All graders are deterministic, producing scores in `
|
| 175 |
|
| 176 |
---
|
| 177 |
|
| 178 |
## Reward Function Design
|
| 179 |
|
| 180 |
-
The shaped reward provides signal at every step (not just episode end):
|
| 181 |
|
| 182 |
-
| Event |
|
| 183 |
-
|---|---|
|
| 184 |
-
| DDI query (
|
| 185 |
-
| Discovering a severe DDI | +
|
| 186 |
-
| Discovering a moderate DDI | +
|
| 187 |
-
| Successful intervention |
|
| 188 |
-
| Invalid action |
|
| 189 |
-
| Episode timeout |
|
| 190 |
-
| Finish review |
|
| 191 |
|
| 192 |
**Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`.
|
| 193 |
|
|
|
|
| 38 |
|
| 39 |
- **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history
|
| 40 |
- **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review`
|
| 41 |
+
- **Reward**: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. `finish_review` triggers a grader returning a terminal score in (0.001, 0.999).
|
| 42 |
- **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem.
|
| 43 |
|
| 44 |
This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints — a sequential decision problem that RL is designed to solve.
|
|
|
|
| 171 |
- **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency
|
| 172 |
- **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution
|
| 173 |
|
| 174 |
+
All graders are deterministic, producing scores strictly in `(0.001, 0.999)`.
|
| 175 |
|
| 176 |
---
|
| 177 |
|
| 178 |
## Reward Function Design
|
| 179 |
|
| 180 |
+
The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range **(0.001, 0.999)**:
|
| 181 |
|
| 182 |
+
| Event | Raw Signal | Clamped Output |
|
| 183 |
+
|---|---|---|
|
| 184 |
+
| DDI query (no finding) | small cost | 0.001 (floor) |
|
| 185 |
+
| Discovering a severe DDI | cost + bonus | ~0.035 |
|
| 186 |
+
| Discovering a moderate DDI | cost + bonus | ~0.005 |
|
| 187 |
+
| Successful intervention | risk_reduction - cost | proportional to risk improvement |
|
| 188 |
+
| Invalid action | penalty | 0.001 (floor) |
|
| 189 |
+
| Episode timeout | penalty | 0.001 (floor) |
|
| 190 |
+
| Finish review | grader_score | 0.001–0.999 |
|
| 191 |
|
| 192 |
**Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`.
|
| 193 |
|
backend/src/polypharmacy_env/api/app.py
CHANGED
|
@@ -103,9 +103,12 @@ def create_polypharmacy_app():
|
|
| 103 |
obs_data = _serialize_obs(obs)
|
| 104 |
# Extract metadata for top-level info
|
| 105 |
metadata = obs_data.get("metadata", {}) or {}
|
|
|
|
|
|
|
|
|
|
| 106 |
return {
|
| 107 |
"observation": obs_data,
|
| 108 |
-
"reward":
|
| 109 |
"done": obs_data.get("done", False),
|
| 110 |
"info": metadata,
|
| 111 |
}
|
|
|
|
| 103 |
obs_data = _serialize_obs(obs)
|
| 104 |
# Extract metadata for top-level info
|
| 105 |
metadata = obs_data.get("metadata", {}) or {}
|
| 106 |
+
raw_reward = obs_data.get("shaped_reward", 0.001)
|
| 107 |
+
# Clamp reward to strict (0.001, 0.999) bounds
|
| 108 |
+
clamped_reward = max(0.001, min(0.999, float(raw_reward)))
|
| 109 |
return {
|
| 110 |
"observation": obs_data,
|
| 111 |
+
"reward": clamped_reward,
|
| 112 |
"done": obs_data.get("done", False),
|
| 113 |
"info": metadata,
|
| 114 |
}
|