---
title: PolypharmacyEnv
emoji: 💊
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - healthcare
  - polypharmacy
pinned: false
---

# PolypharmacyEnv — Elderly Medication Safety via Reinforcement Learning

An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment that simulates **elderly polypharmacy medication review**. An RL agent acts as a clinical pharmacist assistant: it queries drug-drug interactions (DDIs), identifies Beers-criteria violations, and proposes safe interventions — all under resource-constrained budgets.

Built for the **PyTorch OpenEnv Hackathon** to demonstrate how clinical decision support for polypharmacy can be framed as a sequential RL problem and served as a reusable environment through the OpenEnv hub.

---

## Why This Matters

Polypharmacy — the simultaneous use of five or more medications — affects the majority of adults over 65. Elderly patients often see multiple specialists who may not be aware of each other's prescriptions, leading to dangerous drug combinations. Studies report that **adverse drug events from polypharmacy contribute to 100,000+ hospitalizations annually** in the US alone.

Current solutions use static risk scoring. PolypharmacyEnv goes further by framing medication review as a **sequential decision problem**, where an RL agent must strategically allocate limited query and intervention budgets to maximize patient safety — exactly the kind of resource-constrained optimization that reinforcement learning excels at.

**Reference**: Larouche, A., Durand, A., Khoury, R. & Sirois, C. (2023). [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://link.springer.com/chapter/10.1007/978-3-031-36938-4_5). *Advances in Artificial Intelligence*, Springer.

---

## How OpenEnv & RL Power This

### The RL Formulation

PolypharmacyEnv frames medication review as a **Markov Decision Process (MDP)**:

- **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history
- **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review`
- **Reward**: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. `finish_review` triggers a grader returning a terminal score in (0.001, 0.999).
- **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem.

This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints — a sequential decision problem that RL is designed to solve.

### OpenEnv Interface

PolypharmacyEnv implements the full **OpenEnv standard**:

- **`reset()`** — Generates a new patient scenario (age, conditions, medication list)
- **`step(action)`** — Processes an agent action, updates regimen state, returns shaped reward
- **`state()`** — Returns the current episode snapshot

All models use typed Pydantic classes extending OpenEnv base types (`PolypharmacyAction`, `PolypharmacyObservation`, `PolypharmacyState`).

### What the Environment Enables

The shaped reward function provides continuous signal over the full trajectory, making this environment compatible with standard RL training approaches:

- **Policy gradient methods** (REINFORCE, PPO, GRPO): The per-step reward signal allows policy networks to learn query prioritization and intervention strategies.
- **OpenEnv training pipeline**: Through OpenEnv's `step()`/`reset()` HTTP interface, external RL training loops can connect to this environment and train policies without modification.
- **Neural Bandits (OptimNeuralTS)**: The budget-constrained query selection implements the OptimNeuralTS approach from the reference paper — Neural Thompson Sampling combined with Differential Evolution for efficient search.

### Included Agents

The repository ships with multiple agent implementations spanning rule-based, RL-trained, bandit-based, and LLM-based approaches:

- **OptimNeuralTS bandit** (`train_bandit.py`, `neural_bandits.py`): Implements the paper's core algorithm — Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations. Builds an ensemble of models across training steps for high-precision predictions.
- **REINFORCE-trained policy** (`train_rl.py`): A neural network policy trained via REINFORCE with learned baseline against the environment's shaped reward. Demonstrates that the MDP formulation and reward shaping enable genuine policy improvement through RL training.
- **Heuristic agent** (`baselines/heuristic_agent.py`): Deterministic rule-based strategy that queries high-risk drug pairs first, then intervenes on severe DDIs. Serves as a strong domain-knowledge baseline.
- **LLM agent** (`inference.py`): Uses an LLM (Qwen2.5-72B via OpenAI-compatible API) for zero-shot action generation. Demonstrates baseline LLM performance without RL fine-tuning.
- **AI suggestion endpoint** (`/agent/suggest`): LLM-powered action suggestions with rule-based guardrails for the interactive UI.

---

## Repository Structure

```
├── backend/
│   ├── main.py                        # ASGI entrypoint (uvicorn target)
│   ├── requirements.txt               # Python dependencies
│   └── src/polypharmacy_env/
│       ├── env_core.py                # OpenEnv environment: reset/step/state
│       ├── models.py                  # Typed Pydantic models (Action, Observation, State)
│       ├── rewards.py                 # Shaped reward function & regimen risk computation
│       ├── graders.py                 # Deterministic graders for 3 task difficulties
│       ├── tasks.py                   # Task configuration & episode sampling
│       ├── config.py                  # Reward hyperparameters & task parameters
│       ├── data_loader.py            # CSV data loading with caching
│       ├── ddi_simulator.py          # DDI lookup, Beers flags, drug substitution
│       ├── neural_bandits.py         # NeuralTS + Differential Evolution + OptimNeuralTS
│       ├── api/
│       │   ├── app.py                # FastAPI app factory via OpenEnv create_app
│       │   └── routes/agent.py       # POST /agent/suggest (AI-assisted actions)
│       │              bandit.py      # POST /bandit/predict, /bandit/screen
│       ├── baselines/
│       │   ├── heuristic_agent.py    # Deterministic baseline agent
│       │   └── random_agent.py       # Random baseline agent
│       ├── services/
│       │   └── groq_agent.py         # LLM-powered action suggestions
│       └── tests/
│           ├── test_env_core.py      # Environment unit tests
│           └── test_api.py           # HTTP + WebSocket integration tests
├── frontend/
│   ├── src/
│   │   ├── App.jsx                   # React control center UI
│   │   └── styles.css                # Production-quality dark theme
│   ├── package.json
│   └── vite.config.js
├── data/
│   ├── lookups/                      # drug_metadata.csv, ddi_rules.csv, beers_criteria.csv
│   └── processed/                    # patients_polypharmacy.csv (120 episodes)
├── scripts/
│   ├── preprocess_data.py            # Synthetic data generation
│   ├── dev_backend.sh                # Local backend runner
│   ├── dev_frontend.sh               # Local frontend runner
│   └── run_validation.sh             # Automated test + baseline validation
├── Dockerfile                         # Production multi-stage build (frontend + backend)
├── docker-compose.yml                # Development orchestration
├── inference.py                      # Submission baseline inference script
├── train_rl.py                       # REINFORCE RL training script (PyTorch)
├── train_bandit.py                   # OptimNeuralTS neural bandit training
├── openenv.yaml                      # OpenEnv manifest
└── .env.example                      # Environment variable template
```

---

## Action & Observation Spaces

### Actions

| Action Type | Parameters | Description |
|---|---|---|
| `query_ddi` | `drug_id_1`, `drug_id_2` | Check a drug pair for interactions. Returns severity, recommendation, and risk score. Costs 1 query budget. |
| `propose_intervention` | `target_drug_id`, `intervention_type`, `proposed_new_drug_id` (opt), `rationale` (opt) | Modify the medication regimen. Types: `stop`, `dose_reduce`, `substitute`, `add_monitoring`. Costs 1 intervention budget. |
| `finish_review` | — | End the episode. Triggers grader evaluation and returns final score. |

### Observations

Each observation contains the full patient context:

| Field | Type | Description |
|---|---|---|
| `episode_id` | string | Unique episode identifier |
| `task_id` | string | Current task (easy_screening / budgeted_screening / complex_tradeoff) |
| `age`, `sex` | int, string | Patient demographics |
| `conditions` | list[string] | Active medical conditions |
| `eGFR_category`, `liver_function_category` | string | Organ function status |
| `current_medications` | list[MedicationEntry] | Active drugs with dose, ATC class, Beers flags |
| `interaction_queries` | list[InteractionQueryRecord] | History of DDI queries and results |
| `interventions` | list[InterventionRecord] | History of proposed interventions |
| `remaining_query_budget` | int | Remaining DDI query budget |
| `remaining_intervention_budget` | int | Remaining intervention budget |
| `shaped_reward` | float | Step reward signal |
| `done` | bool | Whether the episode has ended |

---

## Tasks & Difficulty Progression

| Task | Difficulty | Drugs | Query Budget | Intervention Budget | Max Steps | Description |
|---|---|---|---|---|---|---|
| **Easy Screening** | Easy | 3–5 | 4 | 2 | 10 | Small regimen with one severe DDI. Identify and resolve it. |
| **Budgeted Screening** | Medium | 6–10 | 8 | 3 | 20 | Multiple DDIs and Beers issues under tighter budgets. Must prioritize effectively. |
| **Complex Tradeoff** | Hard | 10–15 | 12 | 5 | 30 | Large regimen with critical drugs (warfarin, insulin). Balance risk reduction against regimen disruption. |

### Grading Criteria

- **Easy**: 50% risk reduction + 50% targeted intervention on severe DDI drugs
- **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency
- **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution

All graders are deterministic, producing scores strictly in `(0.001, 0.999)`.

---

## Reward Function Design

The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range **(0.001, 0.999)**:

| Event | Raw Signal | Clamped Output |
|---|---|---|
| DDI query (no finding) | small cost | 0.001 (floor) |
| Discovering a severe DDI | cost + bonus | ~0.035 |
| Discovering a moderate DDI | cost + bonus | ~0.005 |
| Successful intervention | risk_reduction - cost | proportional to risk improvement |
| Invalid action | penalty | 0.001 (floor) |
| Episode timeout | penalty | 0.001 (floor) |
| Finish review | grader_score | 0.001–0.999 |

**Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`.

---

## Prerequisites

- **Python** 3.10+
- **Node.js** 18+ (20+ recommended)
- **Docker** + Docker Compose (for containerized runs)

---

## Setup & Local Development

### 1. Clone and configure

```bash
git clone <repo-url>
cd PolypharmacyEnv
cp .env.example .env
# Edit .env with your API keys if using the AI suggestion feature
```

### 2. Install dependencies

```bash
# Backend
pip install -r backend/requirements.txt

# Frontend
cd frontend && npm install && cd ..
```

### 3. Generate synthetic data (if not already present)

```bash
python scripts/preprocess_data.py
```

### 4. Start services

**Terminal 1 — Backend** (port 7860):
```bash
./scripts/dev_backend.sh
```

**Terminal 2 — Frontend** (port 5173):
```bash
./scripts/dev_frontend.sh
```

### 5. Open the application

- **Frontend UI**: [http://localhost:5173](http://localhost:5173)
- **Backend health check**: [http://localhost:7860/health](http://localhost:7860/health)

---

## Docker Deployment

### Build and run (single container — production mode)

```bash
docker build -t polypharmacy-env .
docker run -p 7860:7860 polypharmacy-env
```

The UI and API are both served from port 7860.

### Development mode (separate services)

```bash
docker compose up --build
```

- Backend: port 7860
- Frontend: port 5173

---

## Hugging Face Spaces Deployment

### 1. Create a new Space

- Go to [Hugging Face Spaces](https://huggingface.co/new-space)
- Choose **Docker** SDK
- Tag the Space with `openenv`

### 2. Set secrets and variables

In Space Settings → Variables and Secrets:

| Type | Key | Value |
|---|---|---|
| Secret | `HF_TOKEN` | Your Hugging Face API token |
| Variable | `API_BASE_URL` | `https://router.huggingface.co/v1` |
| Variable | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` |

### 3. Push the repository to the Space

```bash
git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
git push space master
```

### 4. Verify

- Space root URL loads the React UI
- `/health` returns healthy status
- `/reset`, `/step`, `/state` respond to API calls

---

## API Reference

### OpenEnv Endpoints

| Method | Path | Description |
|---|---|---|
| `POST` | `/reset` | Start a new episode. Body: `{ "task_id": "easy_screening" }` |
| `POST` | `/step` | Execute an action. Body: `{ "action": { "action_type": "query_ddi", ... } }` |
| `GET` | `/state` | Get current episode state |
| `GET` | `/health` | Health check |
| `GET` | `/schema` | Action/observation schema |
| `WS` | `/ws` | WebSocket for stateful multi-step sessions |

### Additional Endpoints

| Method | Path | Description |
|---|---|---|
| `POST` | `/agent/suggest` | AI-powered action suggestion. Body: `{ "observation": {...} }` |

---

## Running the Baseline Inference

```bash
# Set required environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"

# Start the environment server (in another terminal)
./scripts/dev_backend.sh

# Run inference
python inference.py
```

The inference script runs all 3 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs for the evaluator.

---

## RL Training (REINFORCE with Learned Baseline)

The repository includes `train_rl.py` — a complete **REINFORCE policy gradient** training loop that trains a neural network policy directly against the environment's shaped reward signal.

### How It Works

| Component | Description |
|---|---|
| **State encoder** | 16-dimensional feature vector: med count, high-risk drug count, Beers-flagged drugs, budget utilization, query outcomes (severe/moderate fractions), step progress, pair coverage |
| **Policy network** | 3-layer MLP (16 → 128 → 128 → 166) with ReLU, outputs masked logits over discrete action space |
| **Value baseline** | 3-layer MLP (16 → 128 → 64 → 1) trained with MSE against discounted returns |
| **Action space** | 166 discrete actions: 105 query_ddi pairs (C(15,2)), 60 interventions (4 types × 15 slots), 1 finish_review |
| **Action masking** | Invalid actions (exhausted budgets, already-queried pairs, empty drug slots) are masked to `-inf` before softmax |
| **Optimization** | REINFORCE with advantage (return - baseline), entropy bonus for exploration, gradient clipping |

### Training

```bash
# Install PyTorch (CPU is sufficient)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# Train on easy task (fast, ~30s)
python train_rl.py --task easy_screening --episodes 200

# Train on medium task
python train_rl.py --task budgeted_screening --episodes 500

# Train on hard task (longer episodes)
python train_rl.py --task complex_tradeoff --episodes 500 --batch-size 10

# Full options
python train_rl.py --task easy_screening --episodes 200 \
  --lr 0.0003 --gamma 0.99 --entropy-coeff 0.02 \
  --hidden-dim 128 --batch-size 5 --print-every 10
```

**Outputs:**
- Policy checkpoints: `backend/src/polypharmacy_env/checkpoints/best_{task}.pt` and `final_{task}.pt`
- Training metrics: `training_metrics.json` (per-episode rewards, grader scores, losses)

### Observed Training Results

| Task | Episodes | Greedy Eval (Grader Score) | Stochastic Eval |
|---|---|---|---|
| Easy Screening | 200 | **0.698** | 0.475 |
| Budgeted Screening | 200 | **0.195** | 0.170 |
| Complex Tradeoff | 200 | **0.040** | 0.035 |

The easy task shows clear policy improvement. Medium and hard tasks benefit from more episodes (500+) and hyperparameter tuning — the larger action spaces and longer episodes create a harder credit assignment problem, exactly as designed.

### Integration with OpenEnv Training Pipeline

For production-scale training, this environment is compatible with **TRL's `GRPOTrainer`** via OpenEnv's standard interface:

```python
# Conceptual integration with TRL GRPO
from trl import GRPOTrainer
from openenv import GenericEnvClient

def rollout_func(prompts, trainer):
    env = GenericEnvClient("ws://localhost:7860/ws")
    # ... collect trajectories with token-level logprobs
    # ... return prompt_ids, completion_ids, logprobs, rewards

trainer = GRPOTrainer(model, rollout_function=rollout_func, ...)
trainer.train()
```

The included `train_rl.py` demonstrates the core RL loop with a lightweight MLP policy. For LLM-based policies, connect TRL/veRL/SkyRL to this environment via the WebSocket or HTTP interface.

---

## Neural Bandit Training (OptimNeuralTS)

The repository implements the **OptimNeuralTS** algorithm from the reference paper. This combines Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations in a large combinatorial space.

### How OptimNeuralTS Works

| Phase | What Happens |
|---|---|
| **Warm-up** | Randomly sample drug combinations and observe their risk scores to initialize the model's understanding |
| **Neural Thompson Sampling** | A neural network predicts risk for any drug combination, while gradient-based uncertainty drives exploration toward combinations that could be dangerous |
| **Differential Evolution** | Evolves a population of candidate drug combinations, guided by the neural network, to propose new combinations worth investigating |
| **Nearest-neighbor mapping** | Since DE can suggest combinations not in the dataset, we map to the closest real combination using Hamming distance |
| **Ensemble building** | Each training step saves a model snapshot; the final ensemble combines all snapshots for high-precision predictions |

### Key Components (in `neural_bandits.py`)

| Component | Description |
|---|---|
| `RewardNetwork` | Neural network that predicts the Relative Risk (RR) for a multi-hot drug combination vector |
| `NeuralTS` | Thompson Sampling agent using gradient-based uncertainty: `s_t(x) = sqrt(λ · g(x)^T · U^{-1} · g(x))` |
| `differential_evolution()` | DE best/1/bin optimization over multi-hot feature space |
| `OptimNeuralTS` | Full pipeline: warm-up → NeuralTS+DE exploration → ensemble building |

### Training

```bash
# Quick run (small dataset, fast)
python train_bandit.py --total-steps 500 --warmup-steps 100

# Full training (closer to paper settings)
python train_bandit.py --total-steps 3000 --warmup-steps 500 --n-combinations 10000

# Custom DE parameters
python train_bandit.py --de-population 32 --de-steps 16 --de-crossover 0.9

# All options
python train_bandit.py --help
```

**Outputs:**
- Ensemble model: `backend/src/polypharmacy_env/checkpoints/bandit_ensemble.pt`
- Training metrics: `bandit_metrics.json` (precision, recall, patterns detected at each eval step)

### API Endpoints

The trained ensemble is also accessible via API:

| Method | Path | Description |
|---|---|---|
| `POST` | `/bandit/predict` | Predict risk for a single drug combination |
| `POST` | `/bandit/screen` | Screen multiple combinations in bulk |
| `GET` | `/bandit/metrics` | Get current bandit training metrics |

---

## Testing & Validation

```bash
# Unit tests
python -m pytest backend/src/polypharmacy_env/tests -v

# Full validation (tests + heuristic baseline)
./scripts/run_validation.sh

# OpenEnv spec validation
openenv validate
```

---

## Data Sources & Future Plans

### Current Implementation

- **Drug interaction data**: Currently extracted from curated clinical databases and research literature, generating 24 DDI pairs across 33 drugs, 15 Beers criteria entries, and 120 patient episodes across 3 difficulty levels. Data is stored as CSV for deterministic, reproducible evaluation.
- **RL training**: A lightweight REINFORCE policy gradient training loop (`train_rl.py`) trains a neural network policy (MLP) directly against the environment's shaped reward signal. This validates the MDP formulation and demonstrates that the reward shaping enables genuine policy improvement. The trained policy achieves a 0.698 grader score on easy screening after 200 episodes.

### Planned Enhancements

- **Full-scale GRPO training on GPU**: We are provisioning AWS GPU resources (A100/H100 instances) to run full-scale GRPO (Group Relative Policy Optimization) training using TRL's `GRPOTrainer` with LLM-based policies. This will train language models to generate optimal clinical actions by collecting batched rollouts against the environment and computing policy gradient updates on token-level log-probabilities. The OpenEnv WebSocket interface enables high-throughput parallel rollout collection needed for efficient GRPO training.
- **LLM fine-tuning via OpenEnv training pipeline**: Integrate with TRL, veRL, and SkyRL frameworks to fine-tune open-weight LLMs (Llama 3, Qwen 2.5) using the environment's shaped reward as the RL training signal, producing specialized clinical pharmacist agents.
- **Live drug database integration**: Connect directly to established drug interaction databases (DrugBank, RxNorm, FDA Adverse Event Reporting System) for real-time DDI lookup instead of static CSV files, enabling the environment to scale to thousands of drug combinations.
- **EHR integration pipeline**: Develop FHIR-compatible data ingestion so the environment can accept de-identified electronic health record data, making it applicable to real hospital deployments.
- **Multi-agent training**: Extend the environment to support multi-agent scenarios where specialist agents (cardiologist, endocrinologist, etc.) must coordinate on a shared patient regimen.
- **Pharmacogenomics layer**: Incorporate genetic variant data (CYP450 metabolizer status) to personalize drug interaction severity, adding a pharmacogenomics dimension to the RL training signal.

---

## Architecture & Design Decisions

- **OpenEnv compliance**: Full typed Pydantic models for Action, Observation, and State. Environment extends `openenv.core.env_server.interfaces.Environment`.
- **Shaped rewards**: Continuous reward signal at every step to enable efficient RL training (not sparse end-of-episode only).
- **Budget constraints**: Query and intervention budgets create a resource-allocation problem that makes the RL optimization non-trivial.
- **Critical drug handling**: The hard task penalizes stopping critical medications (warfarin, insulin, etc.) without substitution, teaching the agent about real-world clinical constraints.
- **Deterministic graders**: All graders produce reproducible scores for consistent evaluation.

---

## Troubleshooting

| Issue | Solution |
|---|---|
| `ModuleNotFoundError: polypharmacy_env` | Start backend via `./scripts/dev_backend.sh` from repo root |
| `/agent/suggest` returns errors | Check `.env` for valid API keys, restart backend |
| UI shows stale data | Hard refresh browser (Ctrl+Shift+R), click Reset Episode |
| Docker build fails | Ensure Docker has at least 4GB memory allocated |
| WebSocket connection refused | Verify backend is running on port 7860 |

---

## License

MIT