You are an expert Python backend, ML, and infrastructure engineer. Your task is to implement a complete, production-ready OpenEnv environment called **PolypharmacyEnv** for training and evaluating agentic RL policies that act as an "elderly polypharmacy safety agent" (clinical pharmacist assistant). The deliverable MUST satisfy all of the following: - Fully compliant with the OpenEnv spec (typed models, `step()` / `reset()` / `state()`, `openenv.yaml`, HTTP server, Dockerfile). - Simulates a realistic healthcare workflow around elderly polypharmacy and dangerous drug combinations. - Defines at least **3 tasks** (easy → medium → hard) with deterministic agent graders producing scores in (0.0, 1.0). - Provides shaped rewards over the trajectory (not just sparse terminal rewards). - Includes a baseline LLM-based inference script `inference.py` in the repo root, following the evaluation requirements: - Uses the OpenAI Python client. - Reads `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment. - Emits structured stdout logs in the exact `[START]`, `[STEP]`, `[END]` format from the OpenEnv sample inference script. - Is containerized and deployable as a **Hugging Face Space** tagged with `openenv` that responds to OpenEnv-style `reset` / `step` / `state` HTTP calls. Implement everything described below. ================================================= 1. Repository and folder structure ================================================= Create a Python package repository with this structure (names are important unless clearly labeled as examples): - `openenv-polypharmacy/` - `openenv.yaml` - `README.md` - `requirements.txt` - `Dockerfile` - `inference.py` # baseline LLM agent per spec - `pyproject.toml` or `setup.cfg` (optional but recommended) - `src/` - `polypharmacy_env/` - `__init__.py` - `config.py` - `models.py` # Action, Observation, State, helper models - `env_core.py` # PolypharmacyEnv implementation - `tasks.py` # task setup utilities - `graders.py` # deterministic graders for each task - `rewards.py` # reward shaping logic - `data_loader.py` # load/preprocess patient and lookup data - `ddi_simulator.py` # local DDI / guideline simulator - `api/` - `__init__.py` - `schemas.py` # HTTP request/response schemas - `server.py` # FastAPI app exposing OpenEnv endpoints - `baselines/` - `__init__.py` - `heuristic_agent.py` # simple rule-based baseline agent - `random_agent.py` # trivial random baseline (optional) - `tests/` - `__init__.py` - `test_env_core.py` - `test_api.py` - `data/` - `raw/` # placeholder for real/synthetic source data - `processed/` - `lookups/` - `ddi_rules.csv` - `beers_criteria.csv` - `drug_metadata.csv` - `scripts/` - `preprocess_data.py` - `run_validation.sh` # optional; runs OpenEnv validator, tests, etc. Use Python 3.10+ with full type hints, and keep the code black/isort-compatible. ================================================= 2. Domain, data, and clinical abstraction ================================================= 2.1. Core scenario Model an elderly patient (age ≥ 65) with: - Demographics: age, sex. - Comorbidities: e.g., hypertension, diabetes, heart failure, CKD, dementia. - Basic labs: kidney function (eGFR category), liver function category. - A current medication list (polypharmacy, e.g., 3–15 drugs depending on task). Each **episode** is one medication-review session where the agent: - Observes patient info and current meds. - Optionally **queries** a DDI/guideline tool for specific drug pairs. - Proposes **interventions**: - `stop`: discontinue a drug. - `dose_reduce`: lower dose of a drug. - `substitute`: swap to a safer alternative. - `add_monitoring`: keep the drug but flag extra monitoring. - Calls `finish_review` when it decides the regimen is acceptable or budgets are exhausted. No external PHI, EHRs, or online APIs: all data is **synthetic** or de-identified and local to the container (CSV files). 2.2. Data files and CSV schemas Implement local CSVs under `data/lookups/`: **`drug_metadata.csv`** - `drug_id` (string; unique key) - `generic_name` (string) - `atc_class` (string) - `is_high_risk_elderly` (0/1) - `default_dose_mg` (float) - `min_dose_mg` (float) - `max_dose_mg` (float) **`beers_criteria.csv`** - `drug_id` (string) - `criterion_type` (enum string: `avoid`, `caution`, `dose_adjust`, `avoid_in_condition`) - `condition` (nullable string; e.g., `CKD`, `dementia`) - `rationale` (brief text) **`ddi_rules.csv`** - `drug_id_1` (string; normalized so `drug_id_1 < drug_id_2` lexicographically) - `drug_id_2` (string) - `severity` (enum string: `mild`, `moderate`, `severe`) - `mechanism` (short text) - `recommendation` (enum string: `avoid_combination`, `monitor_closely`, `dose_adjust`, `no_action`) - `base_risk_score` (float in [0.0, 1.0]) Implement a synthetic patient-episode dataset under `data/processed/`: **`patients_polypharmacy.csv`** - `episode_id` (string) - `age` (int) - `sex` (enum: `M`, `F`, `O`) - `conditions` (semicolon-separated; e.g., `HTN;DM;CKD`) - `eGFR_category` (enum: `normal`, `mild`, `moderate`, `severe`) - `liver_function_category` (enum: `normal`, `impaired`) - `medication_ids` (semicolon-separated list of `drug_id`) - `baseline_risk_score` (float in [0.0, 1.0]) 2.3. Preprocessing script In `scripts/preprocess_data.py`: - If real data is not provided, procedurally generate synthetic but plausible data using: - Random combinations of conditions and drugs constrained by simple rules (e.g., CKD + renally-cleared drugs). - Controlled distribution of high-risk DDIs and Beers violations. - Explicitly tag episodes as easy/medium/hard (e.g., via number of drugs, number/severity of DDIs, and number of Beers issues). - Save `patients_polypharmacy.csv` ready for the environment to consume. ================================================= 3. OpenEnv models and environment implementation ================================================= 3.1. Models In `models.py`, define dataclasses or Pydantic models that extend the appropriate OpenEnv base types (`Action`, `Observation`, `State`) and are JSON-compatible. Auxiliary models: **`MedicationEntry`** - `drug_id: str` - `generic_name: str` - `atc_class: str` - `dose_mg: float` - `frequency: str` # e.g., `qd`, `bid` - `route: str` # e.g., `po` - `is_high_risk_elderly: bool` - `beers_flags: list[str]` # e.g., `["avoid", "dose_adjust_CKD"]` **`InteractionQueryRecord`** - `drug_id_1: str` - `drug_id_2: str` - `severity: str | None` - `recommendation: str | None` - `risk_score: float | None` - `step_index: int` **`InterventionRecord`** - `target_drug_id: str` - `action_type: Literal["stop", "dose_reduce", "substitute", "add_monitoring"]` - `proposed_new_drug_id: str | None` - `rationale: str` - `step_index: int` Core wire models: **`PolypharmacyObservation`** (extends OpenEnv `Observation`) - `episode_id: str` - `task_id: Literal["easy_screening", "budgeted_screening", "complex_tradeoff"]` - `age: int` - `sex: str` - `conditions: list[str]` - `eGFR_category: str` - `liver_function_category: str` - `current_medications: list[MedicationEntry]` - `interaction_queries: list[InteractionQueryRecord]` - `interventions: list[InterventionRecord]` - `step_index: int` - `remaining_query_budget: int` - `remaining_intervention_budget: int` - `shaped_reward: float` # reward from last step - `done: bool` **`PolypharmacyAction`** (extends OpenEnv `Action`) - `action_type: Literal["query_ddi", "propose_intervention", "finish_review"]` - `drug_id_1: str | None` # for DDI queries or some interventions - `drug_id_2: str | None` # for DDI queries - `target_drug_id: str | None` # for interventions - `intervention_type: Literal["stop", "dose_reduce", "substitute", "add_monitoring", "none"] | None` - `proposed_new_drug_id: str | None` - `rationale: str | None` **`PolypharmacyState`** (extends OpenEnv `State`) - `episode_id: str` - `task_id: str` - `step_count: int` - `max_steps: int` - `num_query_actions: int` - `num_interventions: int` 3.2. Environment core In `env_core.py`, implement `PolypharmacyEnv` extending the appropriate OpenEnv environment base class. It must implement: **`reset(task_id: str | None = None) -> PolypharmacyObservation`** - If `task_id` is `None`, default to medium (`budgeted_screening`). - Sample an episode from `patients_polypharmacy.csv` filtered by difficulty. - Initialize: - `episode_id` - `step_count = 0` - task-specific budgets (query, interventions, max_steps) - baseline regime and risk - empty `interaction_queries` and `interventions` - Return the initial `PolypharmacyObservation` with: - `step_index = 0` - `shaped_reward = 0.0` - `done = False` **`step(action: PolypharmacyAction) -> dict`** - Validate the action; if invalid: - Apply a negative reward. - Do not modify regimen, but log error in `info`. - If `action_type == "query_ddi"`: - If query budget exhausted, apply penalty and do not query. - Else: - Use `ddi_simulator.lookup_ddi(drug_id_1, drug_id_2)` to get severity, recommendation, base_risk_score. - Append an `InteractionQueryRecord`. - Apply a small negative reward for query cost. - If `action_type == "propose_intervention"`: - If intervention budget exhausted, apply penalty and ignore change. - Else: - Update `current_medications` according to `intervention_type`: - `stop`: remove medication. - `dose_reduce`: adjust dose downward within [min_dose_mg, default_dose_mg]. - `substitute`: replace with a safer alternative from same `atc_class`. - `add_monitoring`: keep drug but tag in internal state. - Append an `InterventionRecord`. - Recompute current regimen risk using the risk model (see 3.3). - Compute shaped reward = (previous_risk - new_risk) - small intervention cost. - If `action_type == "finish_review"`: - Mark `done = True`. - Call the task’s grader to get episode-level score in [0.0, 1.0]. - Add this as a terminal bonus to the current step reward. - In all cases: - Increment `step_count`. - Check `max_steps`; if exceeded, auto-terminate: - `done = True` - apply time-out penalty - call grader with current trajectory for a final score if appropriate. - Construct next `PolypharmacyObservation` with updated fields. - Return a dict: - `observation`: `PolypharmacyObservation` - `reward`: float shaped reward for this step - `done`: bool - `info`: dict with fields like `current_risk`, `baseline_risk`, `grader_score_if_terminal`, and debug flags. **`state` property** - Returns `PolypharmacyState` reflecting the current internal state. 3.3. DDI simulator and risk model In `ddi_simulator.py`: - Load `ddi_rules.csv` once via `data_loader`. - Implement `lookup_ddi(drug_id_1, drug_id_2) -> tuple[severity, recommendation, base_risk_score]`: - Normalize the pair ordering. - Look up row; if missing, return: - severity = `"none"` - recommendation = `"no_action"` - base_risk_score = 0.0 In `rewards.py` (or a dedicated module), implement: - `compute_regimen_risk(current_drug_ids, patient_context, ddi_rules, beers_rules, drug_metadata) -> float` - Aggregate contributions from: - Beers violations (weighted by `criterion_type` and relevant conditions). - DDI base risk scores for all present drug pairs. - High-risk elderly drugs. - Normalize and clip to [0.0, 1.0]. Use this function to compute: - `baseline_risk` at episode start. - Risk after each intervention step. Also implement: - `compute_shaped_reward(previous_risk, new_risk, action, context, partial_metrics) -> float` - Positive component: `previous_risk - new_risk`. - Negative components: per-query cost, per-intervention cost, invalid-action penalty, time-out penalty. ================================================= 4. Tasks and graders (3 difficulty levels) ================================================= Define three task IDs and semantics in `tasks.py` and `graders.py`: Task IDs: - `easy_screening` - `budgeted_screening` - `complex_tradeoff` 4.1. `easy_screening` (easy) - Small regimen: 3–5 drugs. - Exactly one **severe** DDI pair and possibly one simple Beers violation. - Budgets: - query_budget ≈ 4 - intervention_budget ≈ 2 - max_steps ≈ 10 Grader: - Input: full trajectory, baseline risk, final risk, list of interventions. - Compute: - `risk_reduction = max(0.0, baseline_risk - final_risk) / max(baseline_risk, ε)` (normalized). - `targeted_intervention_flag = 1.0` if at least one intervention affects one of the drugs in the known severe DDI pair, else 0.0. - Score: - `score = 0.5 * risk_reduction + 0.5 * targeted_intervention_flag` - Clip to [0.0, 1.0]. 4.2. `budgeted_screening` (medium) - Medium regimen: 6–10 drugs. - Multiple DDIs (mild/moderate/severe) and multiple Beers issues. - Budgets: - query_budget ≈ 8 - intervention_budget ≈ 3 - max_steps ≈ 20 Grader: - Compute: - `risk_reduction_score` as normalized risk drop. - `intervention_precision_score` = fraction of interventions that actually reduce risk or fix guideline violations. - `query_efficiency_score` = (number of severe/moderate DDIs discovered) / (number of queries used), normalized. - Weighted score, for example: - `score = 0.5 * risk_reduction_score + 0.3 * intervention_precision_score + 0.2 * query_efficiency_score` - Clip to [0.0, 1.0]. 4.3. `complex_tradeoff` (hard) - Larger regimen: 10–15 drugs. - Some drugs are **clinically critical** (e.g., anticoagulants, insulin analogues) and encoded as such in `drug_metadata` or a small internal map. - Episodes contain: - multiple DDIs and Beers issues, including ones involving critical drugs. - safer substitutes for some risky drugs. Budgets: - query_budget ≈ 12 - intervention_budget ≈ 5 - max_steps ≈ 30 Grader adds a **regimen disruption penalty** component: - Metrics: - `risk_reduction_score` (as above). - `critical_drug_penalty` = penalty if a critical drug is stopped without substitution to another suitable agent. - `total_drug_changes` = number of drugs stopped or substituted. - `regimen_disruption_penalty` derived from `total_drug_changes` and `critical_drug_penalty`. Example scoring: - `base = risk_reduction_score` - `penalty = α * regimen_disruption_penalty` - `score = clamp(base - penalty, 0.0, 1.0)` 4.4. Reward shaping In `rewards.py`, define a consistent shaping scheme: - On each query: - Small negative reward (e.g., −0.01) plus any small bonus if it discovers a severe DDI, if desired. - On each intervention: - Reward ≈ (previous_risk - new_risk) − small intervention cost. - On invalid actions: - Larger negative reward (e.g., −0.1) and no state change. - On `finish_review`: - Add the task-level `score` ∈ [0.0, 1.0] from the corresponding grader to that step’s shaped reward. Ensure the sum of step rewards per episode remains in a reasonable numeric range (e.g., roughly -5 to +5) while still allowing meaningful differentiation by graders. ================================================= 5. HTTP API server and openenv.yaml ================================================= 5.1. HTTP server (FastAPI) In `api/server.py`: - Implement a FastAPI app that maintains a `PolypharmacyEnv` instance (or a multiplexing scheme if needed). - Endpoints: - `POST /reset`: - Request body: may include `task_id` (string). - Response: serialized `PolypharmacyObservation`. - `POST /step`: - Request body: serialized `PolypharmacyAction`. - Response: dict with: - `observation`: `PolypharmacyObservation` - `reward`: float - `done`: bool - `info`: dict - `GET /state`: - Response: `PolypharmacyState`. Provide a module-level `app = FastAPI(...)` object for use with uvicorn and Hugging Face Spaces. Ensure the JSON schema is consistent with OpenEnv clients (simple, flat JSON for observation/action/state). 5.2. `openenv.yaml` At repo root, define `openenv.yaml` consistent with the latest OpenEnv spec. At minimum, include: - `name`: `polypharmacy_env` - `version`: e.g., `0.1.0` - `description`: human-readable description. - `author`: your details. - `tags`: e.g., `["healthcare", "polypharmacy", "openenv"]` - `tasks`: - One entry per task: - `id`: `"easy_screening"` / `"budgeted_screening"` / `"complex_tradeoff"` - `description`: one-line description - `difficulty`: `"easy"`, `"medium"`, `"hard"` Ensure `openenv validate` (or equivalent validator) passes once implemented. ================================================= 6. Baseline heuristic (non-LLM) agent ================================================= In `baselines/heuristic_agent.py`, implement a simple, deterministic baseline agent that: For each episode: - Iterates through all unordered medication pairs within query budget: - Calls `query_ddi` via the environment for each pair until the query budget is exhausted or all pairs are examined. - Records severe and moderate interactions. - After querying: - For each severe DDI pair: - Try `substitute` one of the drugs using `drug_metadata`: - Prefer substitute within same `atc_class` that: - is not marked high-risk elderly. - does not participate in known severe DDIs with the rest of the regimen. - If no substitute exists, propose `stop` for the higher-risk drug. - Respect intervention budget limits. - Finally, call `finish_review`. This baseline should be callable as a simple Python function that interacts with `PolypharmacyEnv` directly (without HTTP). ================================================= 7. Baseline LLM inference script (inference.py) ================================================= At repo root, create `inference.py` that: 7.1. Uses the OpenAI Python client - Import and configure the official OpenAI Python client. - Read environment variables: - `OPENAI_API_KEY` (required). - `API_BASE_URL` (base URL for LLM; default to OpenAI standard if not set). - `MODEL_NAME` (e.g., `gpt-4.1` or similar). - `HF_TOKEN` (if needed for HF auth; do not hardcode). - Read `POLYPHARMACY_ENV_URL` (or similar) for the environment’s HTTP base URL. 7.2. Implements the required logging format - For each **run** across all tasks: - Emit a `[START]` line with a JSON payload exactly matching the evaluation specification: - Fields such as `run_id`, `task_id`, `model`, etc., in the same order and naming as the sample OpenEnv inference script. - For each **step** in an episode: - Emit a `[STEP]` line with JSON fields including: - `run_id` - `task_id` - `episode_id` - `step_index` - `observation_summary` (brief, machine-readable summary) - `action_payload` (the action sent to the env) - `reward` - `done` - After finishing an episode for a task: - Emit an `[END]` line summarizing: - `run_id` - `task_id` - per-episode statistics (e.g., total reward, grader score from last step’s `info`). - The stdout format MUST follow the sample exactly: - Same tags: `[START]`, `[STEP]`, `[END]`. - Same JSON field names and ordering as the provided reference. - No extra prints except these structured logs (and necessary error messages to stderr). 7.3. LLM agent loop - For each task (`easy_screening`, `budgeted_screening`, `complex_tradeoff`): - Run a fixed small number of episodes (e.g., 5–10 per task) for baseline scoring. - For each episode: - Call `/reset` with the task id. - At each step: - Summarize the observation into a concise prompt for the LLM: - Include age, sex, conditions, high-risk flags, budgets, and a compressed view of meds and previous actions. - Ask the model to output a **strict JSON** representing `PolypharmacyAction` fields. - Parse and validate the JSON; if invalid, fall back to a safe default (e.g., `finish_review` or a no-op) and penalize in evaluation. - Send this action to `/step` and log `[STEP]`. - End when `done=True` or max_steps is reached. - At the end, print aggregate scores per task and overall. Make sure runtime < 20 minutes and that the script can run within 2 vCPUs and 8 GB RAM. ================================================= 8. Dockerfile and Hugging Face Space ================================================= 8.1. Dockerfile Create a `Dockerfile` that: - Starts from a slim Python image (e.g., `python:3.11-slim`). - Installs system dependencies as needed (e.g., `build-essential`, `curl`). - Copies the project into the container. - Installs Python dependencies from `requirements.txt`. - Sets appropriate environment variables for the app (e.g., `PORT=7860`). - Exposes port 7860. - Uses a `CMD` or `ENTRYPOINT` that runs the FastAPI server, for example: - `uvicorn polypharmacy_env.api.server:app --host 0.0.0.0 --port 7860` 8.2. Hugging Face Space Ensure the repository is ready to be used as a Hugging Face Space: - Space type: `docker`. - Tag: `openenv`. - On container start, the server must listen on the correct port and respond to: - `POST /reset` - `POST /step` - `GET /state` - The environment must start cleanly with `docker build` + `docker run` locally. ================================================= 9. README and documentation ================================================= In `README.md`, include: - **Environment description & motivation**: - What PolypharmacyEnv simulates. - Why elderly polypharmacy safety matters. - **Action and observation spaces**: - Describe `PolypharmacyAction`, `PolypharmacyObservation`, and `PolypharmacyState` fields and semantics. - **Task descriptions**: - `easy_screening`, `budgeted_screening`, `complex_tradeoff`, their difficulty and goals. - **Reward structure**: - Summarize shaping and terminal rewards. - **Setup & usage**: - How to install dependencies. - How to run the API server locally (uvicorn command). - How to run the heuristic baseline. - How to run `inference.py` with environment variables. - **Baseline scores**: - Document reproducible baseline scores for each task (heuristic agent, and LLM baseline if available). ================================================= 10. Validation and quality gates ================================================= - Ensure: - `openenv.yaml` and the HTTP server pass the OpenEnv validation script. - `docker build` and `docker run` work without errors. - `inference.py` completes under 20 minutes, within 2 vCPUs / 8 GB RAM. - All graders: - Are deterministic. - Return scores strictly in [0.0, 1.0]. - No grader returns a constant score irrespective of behavior. Aim for clean, well-structured, well-documented code with clear separation of concerns between: - Data loading, - Environment state & dynamics, - Reward/grade logic, - HTTP serving, - Baseline agents and inference.