TheJackBright Claude Opus 4.6 commited on
Commit
1bb11d9
·
1 Parent(s): c314a65

Fix API reward clamp to (0.001, 0.999) and update README

Browse files

- app.py: align API-layer reward clamp to 0.001/0.999
- README: update reward table to reflect all-positive clamped rewards

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

README.md CHANGED
@@ -38,7 +38,7 @@ PolypharmacyEnv frames medication review as a **Markov Decision Process (MDP)**:
38
 
39
  - **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history
40
  - **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review`
41
- - **Reward**: Shaped, dense signal at every step (not sparse end-of-episode). Queries cost budget (-0.015), discovering severe DDIs earns bonus (+0.05), successful interventions earn proportional risk reduction minus cost, invalid actions are penalized (-0.15), and `finish_review` triggers a grader that returns a terminal score in [0.0, 1.0].
42
  - **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem.
43
 
44
  This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints — a sequential decision problem that RL is designed to solve.
@@ -171,23 +171,23 @@ Each observation contains the full patient context:
171
  - **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency
172
  - **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution
173
 
174
- All graders are deterministic, producing scores in `[0.0, 1.0]`.
175
 
176
  ---
177
 
178
  ## Reward Function Design
179
 
180
- The shaped reward provides signal at every step (not just episode end):
181
 
182
- | Event | Reward |
183
- |---|---|
184
- | DDI query (any) | -0.015 (budget cost) |
185
- | Discovering a severe DDI | +0.05 bonus |
186
- | Discovering a moderate DDI | +0.02 bonus |
187
- | Successful intervention | +(risk_reduction) - 0.025 cost |
188
- | Invalid action | -0.15 penalty |
189
- | Episode timeout | -0.25 penalty |
190
- | Finish review | +grader_score (0.01.0) |
191
 
192
  **Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`.
193
 
 
38
 
39
  - **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history
40
  - **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review`
41
+ - **Reward**: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. `finish_review` triggers a grader returning a terminal score in (0.001, 0.999).
42
  - **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem.
43
 
44
  This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints — a sequential decision problem that RL is designed to solve.
 
171
  - **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency
172
  - **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution
173
 
174
+ All graders are deterministic, producing scores strictly in `(0.001, 0.999)`.
175
 
176
  ---
177
 
178
  ## Reward Function Design
179
 
180
+ The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range **(0.001, 0.999)**:
181
 
182
+ | Event | Raw Signal | Clamped Output |
183
+ |---|---|---|
184
+ | DDI query (no finding) | small cost | 0.001 (floor) |
185
+ | Discovering a severe DDI | cost + bonus | ~0.035 |
186
+ | Discovering a moderate DDI | cost + bonus | ~0.005 |
187
+ | Successful intervention | risk_reduction - cost | proportional to risk improvement |
188
+ | Invalid action | penalty | 0.001 (floor) |
189
+ | Episode timeout | penalty | 0.001 (floor) |
190
+ | Finish review | grader_score | 0.001–0.999 |
191
 
192
  **Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`.
193
 
backend/src/polypharmacy_env/api/app.py CHANGED
@@ -103,9 +103,12 @@ def create_polypharmacy_app():
103
  obs_data = _serialize_obs(obs)
104
  # Extract metadata for top-level info
105
  metadata = obs_data.get("metadata", {}) or {}
 
 
 
106
  return {
107
  "observation": obs_data,
108
- "reward": obs_data.get("shaped_reward", 0.0),
109
  "done": obs_data.get("done", False),
110
  "info": metadata,
111
  }
 
103
  obs_data = _serialize_obs(obs)
104
  # Extract metadata for top-level info
105
  metadata = obs_data.get("metadata", {}) or {}
106
+ raw_reward = obs_data.get("shaped_reward", 0.001)
107
+ # Clamp reward to strict (0.001, 0.999) bounds
108
+ clamped_reward = max(0.001, min(0.999, float(raw_reward)))
109
  return {
110
  "observation": obs_data,
111
+ "reward": clamped_reward,
112
  "done": obs_data.get("done", False),
113
  "info": metadata,
114
  }