polypharmacy / PROMPT.md
adithya9903's picture
Flatten project to root for OpenEnv submission readiness.
fa51dd9
You are an expert Python backend, ML, and infrastructure engineer.
Your task is to implement a complete, production-ready OpenEnv environment called **PolypharmacyEnv** for training and evaluating agentic RL policies that act as an "elderly polypharmacy safety agent" (clinical pharmacist assistant).
The deliverable MUST satisfy all of the following:
- Fully compliant with the OpenEnv spec (typed models, `step()` / `reset()` / `state()`, `openenv.yaml`, HTTP server, Dockerfile).
- Simulates a realistic healthcare workflow around elderly polypharmacy and dangerous drug combinations.
- Defines at least **3 tasks** (easy → medium → hard) with deterministic agent graders producing scores in (0.0, 1.0).
- Provides shaped rewards over the trajectory (not just sparse terminal rewards).
- Includes a baseline LLM-based inference script `inference.py` in the repo root, following the evaluation requirements:
- Uses the OpenAI Python client.
- Reads `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment.
- Emits structured stdout logs in the exact `[START]`, `[STEP]`, `[END]` format from the OpenEnv sample inference script.
- Is containerized and deployable as a **Hugging Face Space** tagged with `openenv` that responds to OpenEnv-style `reset` / `step` / `state` HTTP calls.
Implement everything described below.
=================================================
1. Repository and folder structure
=================================================
Create a Python package repository with this structure (names are important unless clearly labeled as examples):
- `openenv-polypharmacy/`
- `openenv.yaml`
- `README.md`
- `requirements.txt`
- `Dockerfile`
- `inference.py` # baseline LLM agent per spec
- `pyproject.toml` or `setup.cfg` (optional but recommended)
- `src/`
- `polypharmacy_env/`
- `__init__.py`
- `config.py`
- `models.py` # Action, Observation, State, helper models
- `env_core.py` # PolypharmacyEnv implementation
- `tasks.py` # task setup utilities
- `graders.py` # deterministic graders for each task
- `rewards.py` # reward shaping logic
- `data_loader.py` # load/preprocess patient and lookup data
- `ddi_simulator.py` # local DDI / guideline simulator
- `api/`
- `__init__.py`
- `schemas.py` # HTTP request/response schemas
- `server.py` # FastAPI app exposing OpenEnv endpoints
- `baselines/`
- `__init__.py`
- `heuristic_agent.py` # simple rule-based baseline agent
- `random_agent.py` # trivial random baseline (optional)
- `tests/`
- `__init__.py`
- `test_env_core.py`
- `test_api.py`
- `data/`
- `raw/` # placeholder for real/synthetic source data
- `processed/`
- `lookups/`
- `ddi_rules.csv`
- `beers_criteria.csv`
- `drug_metadata.csv`
- `scripts/`
- `preprocess_data.py`
- `run_validation.sh` # optional; runs OpenEnv validator, tests, etc.
Use Python 3.10+ with full type hints, and keep the code black/isort-compatible.
=================================================
2. Domain, data, and clinical abstraction
=================================================
2.1. Core scenario
Model an elderly patient (age ≥ 65) with:
- Demographics: age, sex.
- Comorbidities: e.g., hypertension, diabetes, heart failure, CKD, dementia.
- Basic labs: kidney function (eGFR category), liver function category.
- A current medication list (polypharmacy, e.g., 3–15 drugs depending on task).
Each **episode** is one medication-review session where the agent:
- Observes patient info and current meds.
- Optionally **queries** a DDI/guideline tool for specific drug pairs.
- Proposes **interventions**:
- `stop`: discontinue a drug.
- `dose_reduce`: lower dose of a drug.
- `substitute`: swap to a safer alternative.
- `add_monitoring`: keep the drug but flag extra monitoring.
- Calls `finish_review` when it decides the regimen is acceptable or budgets are exhausted.
No external PHI, EHRs, or online APIs: all data is **synthetic** or de-identified and local to the container (CSV files).
2.2. Data files and CSV schemas
Implement local CSVs under `data/lookups/`:
**`drug_metadata.csv`**
- `drug_id` (string; unique key)
- `generic_name` (string)
- `atc_class` (string)
- `is_high_risk_elderly` (0/1)
- `default_dose_mg` (float)
- `min_dose_mg` (float)
- `max_dose_mg` (float)
**`beers_criteria.csv`**
- `drug_id` (string)
- `criterion_type` (enum string: `avoid`, `caution`, `dose_adjust`, `avoid_in_condition`)
- `condition` (nullable string; e.g., `CKD`, `dementia`)
- `rationale` (brief text)
**`ddi_rules.csv`**
- `drug_id_1` (string; normalized so `drug_id_1 < drug_id_2` lexicographically)
- `drug_id_2` (string)
- `severity` (enum string: `mild`, `moderate`, `severe`)
- `mechanism` (short text)
- `recommendation` (enum string: `avoid_combination`, `monitor_closely`, `dose_adjust`, `no_action`)
- `base_risk_score` (float in [0.0, 1.0])
Implement a synthetic patient-episode dataset under `data/processed/`:
**`patients_polypharmacy.csv`**
- `episode_id` (string)
- `age` (int)
- `sex` (enum: `M`, `F`, `O`)
- `conditions` (semicolon-separated; e.g., `HTN;DM;CKD`)
- `eGFR_category` (enum: `normal`, `mild`, `moderate`, `severe`)
- `liver_function_category` (enum: `normal`, `impaired`)
- `medication_ids` (semicolon-separated list of `drug_id`)
- `baseline_risk_score` (float in [0.0, 1.0])
2.3. Preprocessing script
In `scripts/preprocess_data.py`:
- If real data is not provided, procedurally generate synthetic but plausible data using:
- Random combinations of conditions and drugs constrained by simple rules (e.g., CKD + renally-cleared drugs).
- Controlled distribution of high-risk DDIs and Beers violations.
- Explicitly tag episodes as easy/medium/hard (e.g., via number of drugs, number/severity of DDIs, and number of Beers issues).
- Save `patients_polypharmacy.csv` ready for the environment to consume.
=================================================
3. OpenEnv models and environment implementation
=================================================
3.1. Models
In `models.py`, define dataclasses or Pydantic models that extend the appropriate OpenEnv base types (`Action`, `Observation`, `State`) and are JSON-compatible.
Auxiliary models:
**`MedicationEntry`**
- `drug_id: str`
- `generic_name: str`
- `atc_class: str`
- `dose_mg: float`
- `frequency: str` # e.g., `qd`, `bid`
- `route: str` # e.g., `po`
- `is_high_risk_elderly: bool`
- `beers_flags: list[str]` # e.g., `["avoid", "dose_adjust_CKD"]`
**`InteractionQueryRecord`**
- `drug_id_1: str`
- `drug_id_2: str`
- `severity: str | None`
- `recommendation: str | None`
- `risk_score: float | None`
- `step_index: int`
**`InterventionRecord`**
- `target_drug_id: str`
- `action_type: Literal["stop", "dose_reduce", "substitute", "add_monitoring"]`
- `proposed_new_drug_id: str | None`
- `rationale: str`
- `step_index: int`
Core wire models:
**`PolypharmacyObservation`** (extends OpenEnv `Observation`)
- `episode_id: str`
- `task_id: Literal["easy_screening", "budgeted_screening", "complex_tradeoff"]`
- `age: int`
- `sex: str`
- `conditions: list[str]`
- `eGFR_category: str`
- `liver_function_category: str`
- `current_medications: list[MedicationEntry]`
- `interaction_queries: list[InteractionQueryRecord]`
- `interventions: list[InterventionRecord]`
- `step_index: int`
- `remaining_query_budget: int`
- `remaining_intervention_budget: int`
- `shaped_reward: float` # reward from last step
- `done: bool`
**`PolypharmacyAction`** (extends OpenEnv `Action`)
- `action_type: Literal["query_ddi", "propose_intervention", "finish_review"]`
- `drug_id_1: str | None` # for DDI queries or some interventions
- `drug_id_2: str | None` # for DDI queries
- `target_drug_id: str | None` # for interventions
- `intervention_type: Literal["stop", "dose_reduce", "substitute", "add_monitoring", "none"] | None`
- `proposed_new_drug_id: str | None`
- `rationale: str | None`
**`PolypharmacyState`** (extends OpenEnv `State`)
- `episode_id: str`
- `task_id: str`
- `step_count: int`
- `max_steps: int`
- `num_query_actions: int`
- `num_interventions: int`
3.2. Environment core
In `env_core.py`, implement `PolypharmacyEnv` extending the appropriate OpenEnv environment base class. It must implement:
**`reset(task_id: str | None = None) -> PolypharmacyObservation`**
- If `task_id` is `None`, default to medium (`budgeted_screening`).
- Sample an episode from `patients_polypharmacy.csv` filtered by difficulty.
- Initialize:
- `episode_id`
- `step_count = 0`
- task-specific budgets (query, interventions, max_steps)
- baseline regime and risk
- empty `interaction_queries` and `interventions`
- Return the initial `PolypharmacyObservation` with:
- `step_index = 0`
- `shaped_reward = 0.0`
- `done = False`
**`step(action: PolypharmacyAction) -> dict`**
- Validate the action; if invalid:
- Apply a negative reward.
- Do not modify regimen, but log error in `info`.
- If `action_type == "query_ddi"`:
- If query budget exhausted, apply penalty and do not query.
- Else:
- Use `ddi_simulator.lookup_ddi(drug_id_1, drug_id_2)` to get severity, recommendation, base_risk_score.
- Append an `InteractionQueryRecord`.
- Apply a small negative reward for query cost.
- If `action_type == "propose_intervention"`:
- If intervention budget exhausted, apply penalty and ignore change.
- Else:
- Update `current_medications` according to `intervention_type`:
- `stop`: remove medication.
- `dose_reduce`: adjust dose downward within [min_dose_mg, default_dose_mg].
- `substitute`: replace with a safer alternative from same `atc_class`.
- `add_monitoring`: keep drug but tag in internal state.
- Append an `InterventionRecord`.
- Recompute current regimen risk using the risk model (see 3.3).
- Compute shaped reward = (previous_risk - new_risk) - small intervention cost.
- If `action_type == "finish_review"`:
- Mark `done = True`.
- Call the task’s grader to get episode-level score in [0.0, 1.0].
- Add this as a terminal bonus to the current step reward.
- In all cases:
- Increment `step_count`.
- Check `max_steps`; if exceeded, auto-terminate:
- `done = True`
- apply time-out penalty
- call grader with current trajectory for a final score if appropriate.
- Construct next `PolypharmacyObservation` with updated fields.
- Return a dict:
- `observation`: `PolypharmacyObservation`
- `reward`: float shaped reward for this step
- `done`: bool
- `info`: dict with fields like `current_risk`, `baseline_risk`, `grader_score_if_terminal`, and debug flags.
**`state` property**
- Returns `PolypharmacyState` reflecting the current internal state.
3.3. DDI simulator and risk model
In `ddi_simulator.py`:
- Load `ddi_rules.csv` once via `data_loader`.
- Implement `lookup_ddi(drug_id_1, drug_id_2) -> tuple[severity, recommendation, base_risk_score]`:
- Normalize the pair ordering.
- Look up row; if missing, return:
- severity = `"none"`
- recommendation = `"no_action"`
- base_risk_score = 0.0
In `rewards.py` (or a dedicated module), implement:
- `compute_regimen_risk(current_drug_ids, patient_context, ddi_rules, beers_rules, drug_metadata) -> float`
- Aggregate contributions from:
- Beers violations (weighted by `criterion_type` and relevant conditions).
- DDI base risk scores for all present drug pairs.
- High-risk elderly drugs.
- Normalize and clip to [0.0, 1.0].
Use this function to compute:
- `baseline_risk` at episode start.
- Risk after each intervention step.
Also implement:
- `compute_shaped_reward(previous_risk, new_risk, action, context, partial_metrics) -> float`
- Positive component: `previous_risk - new_risk`.
- Negative components: per-query cost, per-intervention cost, invalid-action penalty, time-out penalty.
=================================================
4. Tasks and graders (3 difficulty levels)
=================================================
Define three task IDs and semantics in `tasks.py` and `graders.py`:
Task IDs:
- `easy_screening`
- `budgeted_screening`
- `complex_tradeoff`
4.1. `easy_screening` (easy)
- Small regimen: 3–5 drugs.
- Exactly one **severe** DDI pair and possibly one simple Beers violation.
- Budgets:
- query_budget ≈ 4
- intervention_budget ≈ 2
- max_steps ≈ 10
Grader:
- Input: full trajectory, baseline risk, final risk, list of interventions.
- Compute:
- `risk_reduction = max(0.0, baseline_risk - final_risk) / max(baseline_risk, ε)` (normalized).
- `targeted_intervention_flag = 1.0` if at least one intervention affects one of the drugs in the known severe DDI pair, else 0.0.
- Score:
- `score = 0.5 * risk_reduction + 0.5 * targeted_intervention_flag`
- Clip to [0.0, 1.0].
4.2. `budgeted_screening` (medium)
- Medium regimen: 6–10 drugs.
- Multiple DDIs (mild/moderate/severe) and multiple Beers issues.
- Budgets:
- query_budget ≈ 8
- intervention_budget ≈ 3
- max_steps ≈ 20
Grader:
- Compute:
- `risk_reduction_score` as normalized risk drop.
- `intervention_precision_score` = fraction of interventions that actually reduce risk or fix guideline violations.
- `query_efficiency_score` = (number of severe/moderate DDIs discovered) / (number of queries used), normalized.
- Weighted score, for example:
- `score = 0.5 * risk_reduction_score + 0.3 * intervention_precision_score + 0.2 * query_efficiency_score`
- Clip to [0.0, 1.0].
4.3. `complex_tradeoff` (hard)
- Larger regimen: 10–15 drugs.
- Some drugs are **clinically critical** (e.g., anticoagulants, insulin analogues) and encoded as such in `drug_metadata` or a small internal map.
- Episodes contain:
- multiple DDIs and Beers issues, including ones involving critical drugs.
- safer substitutes for some risky drugs.
Budgets:
- query_budget ≈ 12
- intervention_budget ≈ 5
- max_steps ≈ 30
Grader adds a **regimen disruption penalty** component:
- Metrics:
- `risk_reduction_score` (as above).
- `critical_drug_penalty` = penalty if a critical drug is stopped without substitution to another suitable agent.
- `total_drug_changes` = number of drugs stopped or substituted.
- `regimen_disruption_penalty` derived from `total_drug_changes` and `critical_drug_penalty`.
Example scoring:
- `base = risk_reduction_score`
- `penalty = α * regimen_disruption_penalty`
- `score = clamp(base - penalty, 0.0, 1.0)`
4.4. Reward shaping
In `rewards.py`, define a consistent shaping scheme:
- On each query:
- Small negative reward (e.g., −0.01) plus any small bonus if it discovers a severe DDI, if desired.
- On each intervention:
- Reward ≈ (previous_risk - new_risk) − small intervention cost.
- On invalid actions:
- Larger negative reward (e.g., −0.1) and no state change.
- On `finish_review`:
- Add the task-level `score` ∈ [0.0, 1.0] from the corresponding grader to that step’s shaped reward.
Ensure the sum of step rewards per episode remains in a reasonable numeric range (e.g., roughly -5 to +5) while still allowing meaningful differentiation by graders.
=================================================
5. HTTP API server and openenv.yaml
=================================================
5.1. HTTP server (FastAPI)
In `api/server.py`:
- Implement a FastAPI app that maintains a `PolypharmacyEnv` instance (or a multiplexing scheme if needed).
- Endpoints:
- `POST /reset`:
- Request body: may include `task_id` (string).
- Response: serialized `PolypharmacyObservation`.
- `POST /step`:
- Request body: serialized `PolypharmacyAction`.
- Response: dict with:
- `observation`: `PolypharmacyObservation`
- `reward`: float
- `done`: bool
- `info`: dict
- `GET /state`:
- Response: `PolypharmacyState`.
Provide a module-level `app = FastAPI(...)` object for use with uvicorn and Hugging Face Spaces. Ensure the JSON schema is consistent with OpenEnv clients (simple, flat JSON for observation/action/state).
5.2. `openenv.yaml`
At repo root, define `openenv.yaml` consistent with the latest OpenEnv spec. At minimum, include:
- `name`: `polypharmacy_env`
- `version`: e.g., `0.1.0`
- `description`: human-readable description.
- `author`: your details.
- `tags`: e.g., `["healthcare", "polypharmacy", "openenv"]`
- `tasks`:
- One entry per task:
- `id`: `"easy_screening"` / `"budgeted_screening"` / `"complex_tradeoff"`
- `description`: one-line description
- `difficulty`: `"easy"`, `"medium"`, `"hard"`
Ensure `openenv validate` (or equivalent validator) passes once implemented.
=================================================
6. Baseline heuristic (non-LLM) agent
=================================================
In `baselines/heuristic_agent.py`, implement a simple, deterministic baseline agent that:
For each episode:
- Iterates through all unordered medication pairs within query budget:
- Calls `query_ddi` via the environment for each pair until the query budget is exhausted or all pairs are examined.
- Records severe and moderate interactions.
- After querying:
- For each severe DDI pair:
- Try `substitute` one of the drugs using `drug_metadata`:
- Prefer substitute within same `atc_class` that:
- is not marked high-risk elderly.
- does not participate in known severe DDIs with the rest of the regimen.
- If no substitute exists, propose `stop` for the higher-risk drug.
- Respect intervention budget limits.
- Finally, call `finish_review`.
This baseline should be callable as a simple Python function that interacts with `PolypharmacyEnv` directly (without HTTP).
=================================================
7. Baseline LLM inference script (inference.py)
=================================================
At repo root, create `inference.py` that:
7.1. Uses the OpenAI Python client
- Import and configure the official OpenAI Python client.
- Read environment variables:
- `OPENAI_API_KEY` (required).
- `API_BASE_URL` (base URL for LLM; default to OpenAI standard if not set).
- `MODEL_NAME` (e.g., `gpt-4.1` or similar).
- `HF_TOKEN` (if needed for HF auth; do not hardcode).
- Read `POLYPHARMACY_ENV_URL` (or similar) for the environment’s HTTP base URL.
7.2. Implements the required logging format
- For each **run** across all tasks:
- Emit a `[START]` line with a JSON payload exactly matching the evaluation specification:
- Fields such as `run_id`, `task_id`, `model`, etc., in the same order and naming as the sample OpenEnv inference script.
- For each **step** in an episode:
- Emit a `[STEP]` line with JSON fields including:
- `run_id`
- `task_id`
- `episode_id`
- `step_index`
- `observation_summary` (brief, machine-readable summary)
- `action_payload` (the action sent to the env)
- `reward`
- `done`
- After finishing an episode for a task:
- Emit an `[END]` line summarizing:
- `run_id`
- `task_id`
- per-episode statistics (e.g., total reward, grader score from last step’s `info`).
- The stdout format MUST follow the sample exactly:
- Same tags: `[START]`, `[STEP]`, `[END]`.
- Same JSON field names and ordering as the provided reference.
- No extra prints except these structured logs (and necessary error messages to stderr).
7.3. LLM agent loop
- For each task (`easy_screening`, `budgeted_screening`, `complex_tradeoff`):
- Run a fixed small number of episodes (e.g., 5–10 per task) for baseline scoring.
- For each episode:
- Call `/reset` with the task id.
- At each step:
- Summarize the observation into a concise prompt for the LLM:
- Include age, sex, conditions, high-risk flags, budgets, and a compressed view of meds and previous actions.
- Ask the model to output a **strict JSON** representing `PolypharmacyAction` fields.
- Parse and validate the JSON; if invalid, fall back to a safe default (e.g., `finish_review` or a no-op) and penalize in evaluation.
- Send this action to `/step` and log `[STEP]`.
- End when `done=True` or max_steps is reached.
- At the end, print aggregate scores per task and overall.
Make sure runtime < 20 minutes and that the script can run within 2 vCPUs and 8 GB RAM.
=================================================
8. Dockerfile and Hugging Face Space
=================================================
8.1. Dockerfile
Create a `Dockerfile` that:
- Starts from a slim Python image (e.g., `python:3.11-slim`).
- Installs system dependencies as needed (e.g., `build-essential`, `curl`).
- Copies the project into the container.
- Installs Python dependencies from `requirements.txt`.
- Sets appropriate environment variables for the app (e.g., `PORT=7860`).
- Exposes port 7860.
- Uses a `CMD` or `ENTRYPOINT` that runs the FastAPI server, for example:
- `uvicorn polypharmacy_env.api.server:app --host 0.0.0.0 --port 7860`
8.2. Hugging Face Space
Ensure the repository is ready to be used as a Hugging Face Space:
- Space type: `docker`.
- Tag: `openenv`.
- On container start, the server must listen on the correct port and respond to:
- `POST /reset`
- `POST /step`
- `GET /state`
- The environment must start cleanly with `docker build` + `docker run` locally.
=================================================
9. README and documentation
=================================================
In `README.md`, include:
- **Environment description & motivation**:
- What PolypharmacyEnv simulates.
- Why elderly polypharmacy safety matters.
- **Action and observation spaces**:
- Describe `PolypharmacyAction`, `PolypharmacyObservation`, and `PolypharmacyState` fields and semantics.
- **Task descriptions**:
- `easy_screening`, `budgeted_screening`, `complex_tradeoff`, their difficulty and goals.
- **Reward structure**:
- Summarize shaping and terminal rewards.
- **Setup & usage**:
- How to install dependencies.
- How to run the API server locally (uvicorn command).
- How to run the heuristic baseline.
- How to run `inference.py` with environment variables.
- **Baseline scores**:
- Document reproducible baseline scores for each task (heuristic agent, and LLM baseline if available).
=================================================
10. Validation and quality gates
=================================================
- Ensure:
- `openenv.yaml` and the HTTP server pass the OpenEnv validation script.
- `docker build` and `docker run` work without errors.
- `inference.py` completes under 20 minutes, within 2 vCPUs / 8 GB RAM.
- All graders:
- Are deterministic.
- Return scores strictly in [0.0, 1.0].
- No grader returns a constant score irrespective of behavior.
Aim for clean, well-structured, well-documented code with clear separation of concerns between:
- Data loading,
- Environment state & dynamics,
- Reward/grade logic,
- HTTP serving,
- Baseline agents and inference.