--- title: PolypharmacyEnv emoji: ๐Ÿ’Š colorFrom: blue colorTo: green sdk: docker app_port: 7860 tags: - openenv - healthcare - polypharmacy pinned: false --- # PolypharmacyEnv โ€” Elderly Medication Safety via Reinforcement Learning An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment that simulates **elderly polypharmacy medication review**. An RL agent acts as a clinical pharmacist assistant: it queries drug-drug interactions (DDIs), identifies Beers-criteria violations, and proposes safe interventions โ€” all under resource-constrained budgets. Built for the **PyTorch OpenEnv Hackathon** to demonstrate how clinical decision support for polypharmacy can be framed as a sequential RL problem and served as a reusable environment through the OpenEnv hub. --- ## Why This Matters Polypharmacy โ€” the simultaneous use of five or more medications โ€” affects the majority of adults over 65. Elderly patients often see multiple specialists who may not be aware of each other's prescriptions, leading to dangerous drug combinations. Studies report that **adverse drug events from polypharmacy contribute to 100,000+ hospitalizations annually** in the US alone. Current solutions use static risk scoring. PolypharmacyEnv goes further by framing medication review as a **sequential decision problem**, where an RL agent must strategically allocate limited query and intervention budgets to maximize patient safety โ€” exactly the kind of resource-constrained optimization that reinforcement learning excels at. **Reference**: Larouche, A., Durand, A., Khoury, R. & Sirois, C. (2023). [Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy](https://link.springer.com/chapter/10.1007/978-3-031-36938-4_5). *Advances in Artificial Intelligence*, Springer. --- ## How OpenEnv & RL Power This ### The RL Formulation PolypharmacyEnv frames medication review as a **Markov Decision Process (MDP)**: - **State**: Patient profile (age, conditions, organ function) + current medication list + interaction history - **Action space**: `query_ddi(drug_i, drug_j)` | `propose_intervention(target, type)` | `finish_review` - **Reward**: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. `finish_review` triggers a grader returning a terminal score in (0.001, 0.999). - **Constraint**: Finite query and intervention budgets, creating a resource-allocation optimization problem. This MDP is what makes the problem fundamentally different from static risk scoring: the agent must **decide what information to acquire** (which drug pairs to query) and **which interventions to prioritize**, all under budget constraints โ€” a sequential decision problem that RL is designed to solve. ### OpenEnv Interface PolypharmacyEnv implements the full **OpenEnv standard**: - **`reset()`** โ€” Generates a new patient scenario (age, conditions, medication list) - **`step(action)`** โ€” Processes an agent action, updates regimen state, returns shaped reward - **`state()`** โ€” Returns the current episode snapshot All models use typed Pydantic classes extending OpenEnv base types (`PolypharmacyAction`, `PolypharmacyObservation`, `PolypharmacyState`). ### What the Environment Enables The shaped reward function provides continuous signal over the full trajectory, making this environment compatible with standard RL training approaches: - **Policy gradient methods** (REINFORCE, PPO, GRPO): The per-step reward signal allows policy networks to learn query prioritization and intervention strategies. - **OpenEnv training pipeline**: Through OpenEnv's `step()`/`reset()` HTTP interface, external RL training loops can connect to this environment and train policies without modification. - **Neural Bandits (OptimNeuralTS)**: The budget-constrained query selection implements the OptimNeuralTS approach from the reference paper โ€” Neural Thompson Sampling combined with Differential Evolution for efficient search. ### Included Agents The repository ships with multiple agent implementations spanning rule-based, RL-trained, bandit-based, and LLM-based approaches: - **OptimNeuralTS bandit** (`train_bandit.py`, `neural_bandits.py`): Implements the paper's core algorithm โ€” Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations. Builds an ensemble of models across training steps for high-precision predictions. - **REINFORCE-trained policy** (`train_rl.py`): A neural network policy trained via REINFORCE with learned baseline against the environment's shaped reward. Demonstrates that the MDP formulation and reward shaping enable genuine policy improvement through RL training. - **Heuristic agent** (`baselines/heuristic_agent.py`): Deterministic rule-based strategy that queries high-risk drug pairs first, then intervenes on severe DDIs. Serves as a strong domain-knowledge baseline. - **LLM agent** (`inference.py`): Uses an LLM (Qwen2.5-72B via OpenAI-compatible API) for zero-shot action generation. Demonstrates baseline LLM performance without RL fine-tuning. - **AI suggestion endpoint** (`/agent/suggest`): LLM-powered action suggestions with rule-based guardrails for the interactive UI. --- ## Repository Structure ``` โ”œโ”€โ”€ backend/ โ”‚ โ”œโ”€โ”€ main.py # ASGI entrypoint (uvicorn target) โ”‚ โ”œโ”€โ”€ requirements.txt # Python dependencies โ”‚ โ””โ”€โ”€ src/polypharmacy_env/ โ”‚ โ”œโ”€โ”€ env_core.py # OpenEnv environment: reset/step/state โ”‚ โ”œโ”€โ”€ models.py # Typed Pydantic models (Action, Observation, State) โ”‚ โ”œโ”€โ”€ rewards.py # Shaped reward function & regimen risk computation โ”‚ โ”œโ”€โ”€ graders.py # Deterministic graders for 3 task difficulties โ”‚ โ”œโ”€โ”€ tasks.py # Task configuration & episode sampling โ”‚ โ”œโ”€โ”€ config.py # Reward hyperparameters & task parameters โ”‚ โ”œโ”€โ”€ data_loader.py # CSV data loading with caching โ”‚ โ”œโ”€โ”€ ddi_simulator.py # DDI lookup, Beers flags, drug substitution โ”‚ โ”œโ”€โ”€ neural_bandits.py # NeuralTS + Differential Evolution + OptimNeuralTS โ”‚ โ”œโ”€โ”€ api/ โ”‚ โ”‚ โ”œโ”€โ”€ app.py # FastAPI app factory via OpenEnv create_app โ”‚ โ”‚ โ””โ”€โ”€ routes/agent.py # POST /agent/suggest (AI-assisted actions) โ”‚ โ”‚ bandit.py # POST /bandit/predict, /bandit/screen โ”‚ โ”œโ”€โ”€ baselines/ โ”‚ โ”‚ โ”œโ”€โ”€ heuristic_agent.py # Deterministic baseline agent โ”‚ โ”‚ โ””โ”€โ”€ random_agent.py # Random baseline agent โ”‚ โ”œโ”€โ”€ services/ โ”‚ โ”‚ โ””โ”€โ”€ groq_agent.py # LLM-powered action suggestions โ”‚ โ””โ”€โ”€ tests/ โ”‚ โ”œโ”€โ”€ test_env_core.py # Environment unit tests โ”‚ โ””โ”€โ”€ test_api.py # HTTP + WebSocket integration tests โ”œโ”€โ”€ frontend/ โ”‚ โ”œโ”€โ”€ src/ โ”‚ โ”‚ โ”œโ”€โ”€ App.jsx # React control center UI โ”‚ โ”‚ โ””โ”€โ”€ styles.css # Production-quality dark theme โ”‚ โ”œโ”€โ”€ package.json โ”‚ โ””โ”€โ”€ vite.config.js โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ lookups/ # drug_metadata.csv, ddi_rules.csv, beers_criteria.csv โ”‚ โ””โ”€โ”€ processed/ # patients_polypharmacy.csv (120 episodes) โ”œโ”€โ”€ scripts/ โ”‚ โ”œโ”€โ”€ preprocess_data.py # Synthetic data generation โ”‚ โ”œโ”€โ”€ dev_backend.sh # Local backend runner โ”‚ โ”œโ”€โ”€ dev_frontend.sh # Local frontend runner โ”‚ โ””โ”€โ”€ run_validation.sh # Automated test + baseline validation โ”œโ”€โ”€ Dockerfile # Production multi-stage build (frontend + backend) โ”œโ”€โ”€ docker-compose.yml # Development orchestration โ”œโ”€โ”€ inference.py # Submission baseline inference script โ”œโ”€โ”€ train_rl.py # REINFORCE RL training script (PyTorch) โ”œโ”€โ”€ train_bandit.py # OptimNeuralTS neural bandit training โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ””โ”€โ”€ .env.example # Environment variable template ``` --- ## Action & Observation Spaces ### Actions | Action Type | Parameters | Description | |---|---|---| | `query_ddi` | `drug_id_1`, `drug_id_2` | Check a drug pair for interactions. Returns severity, recommendation, and risk score. Costs 1 query budget. | | `propose_intervention` | `target_drug_id`, `intervention_type`, `proposed_new_drug_id` (opt), `rationale` (opt) | Modify the medication regimen. Types: `stop`, `dose_reduce`, `substitute`, `add_monitoring`. Costs 1 intervention budget. | | `finish_review` | โ€” | End the episode. Triggers grader evaluation and returns final score. | ### Observations Each observation contains the full patient context: | Field | Type | Description | |---|---|---| | `episode_id` | string | Unique episode identifier | | `task_id` | string | Current task (easy_screening / budgeted_screening / complex_tradeoff) | | `age`, `sex` | int, string | Patient demographics | | `conditions` | list[string] | Active medical conditions | | `eGFR_category`, `liver_function_category` | string | Organ function status | | `current_medications` | list[MedicationEntry] | Active drugs with dose, ATC class, Beers flags | | `interaction_queries` | list[InteractionQueryRecord] | History of DDI queries and results | | `interventions` | list[InterventionRecord] | History of proposed interventions | | `remaining_query_budget` | int | Remaining DDI query budget | | `remaining_intervention_budget` | int | Remaining intervention budget | | `shaped_reward` | float | Step reward signal | | `done` | bool | Whether the episode has ended | --- ## Tasks & Difficulty Progression | Task | Difficulty | Drugs | Query Budget | Intervention Budget | Max Steps | Description | |---|---|---|---|---|---|---| | **Easy Screening** | Easy | 3โ€“5 | 4 | 2 | 10 | Small regimen with one severe DDI. Identify and resolve it. | | **Budgeted Screening** | Medium | 6โ€“10 | 8 | 3 | 20 | Multiple DDIs and Beers issues under tighter budgets. Must prioritize effectively. | | **Complex Tradeoff** | Hard | 10โ€“15 | 12 | 5 | 30 | Large regimen with critical drugs (warfarin, insulin). Balance risk reduction against regimen disruption. | ### Grading Criteria - **Easy**: 50% risk reduction + 50% targeted intervention on severe DDI drugs - **Medium**: 50% risk reduction + 30% intervention precision + 20% query efficiency - **Hard**: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution All graders are deterministic, producing scores strictly in `(0.001, 0.999)`. --- ## Reward Function Design The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range **(0.001, 0.999)**: | Event | Raw Signal | Clamped Output | |---|---|---| | DDI query (no finding) | small cost | 0.001 (floor) | | Discovering a severe DDI | cost + bonus | ~0.035 | | Discovering a moderate DDI | cost + bonus | ~0.005 | | Successful intervention | risk_reduction - cost | proportional to risk improvement | | Invalid action | penalty | 0.001 (floor) | | Episode timeout | penalty | 0.001 (floor) | | Finish review | grader_score | 0.001โ€“0.999 | **Regimen risk** aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to `[0.0, 1.0]`. --- ## Prerequisites - **Python** 3.10+ - **Node.js** 18+ (20+ recommended) - **Docker** + Docker Compose (for containerized runs) --- ## Setup & Local Development ### 1. Clone and configure ```bash git clone cd PolypharmacyEnv cp .env.example .env # Edit .env with your API keys if using the AI suggestion feature ``` ### 2. Install dependencies ```bash # Backend pip install -r backend/requirements.txt # Frontend cd frontend && npm install && cd .. ``` ### 3. Generate synthetic data (if not already present) ```bash python scripts/preprocess_data.py ``` ### 4. Start services **Terminal 1 โ€” Backend** (port 7860): ```bash ./scripts/dev_backend.sh ``` **Terminal 2 โ€” Frontend** (port 5173): ```bash ./scripts/dev_frontend.sh ``` ### 5. Open the application - **Frontend UI**: [http://localhost:5173](http://localhost:5173) - **Backend health check**: [http://localhost:7860/health](http://localhost:7860/health) --- ## Docker Deployment ### Build and run (single container โ€” production mode) ```bash docker build -t polypharmacy-env . docker run -p 7860:7860 polypharmacy-env ``` The UI and API are both served from port 7860. ### Development mode (separate services) ```bash docker compose up --build ``` - Backend: port 7860 - Frontend: port 5173 --- ## Hugging Face Spaces Deployment ### 1. Create a new Space - Go to [Hugging Face Spaces](https://huggingface.co/new-space) - Choose **Docker** SDK - Tag the Space with `openenv` ### 2. Set secrets and variables In Space Settings โ†’ Variables and Secrets: | Type | Key | Value | |---|---|---| | Secret | `HF_TOKEN` | Your Hugging Face API token | | Variable | `API_BASE_URL` | `https://router.huggingface.co/v1` | | Variable | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | ### 3. Push the repository to the Space ```bash git remote add space https://huggingface.co/spaces// git push space master ``` ### 4. Verify - Space root URL loads the React UI - `/health` returns healthy status - `/reset`, `/step`, `/state` respond to API calls --- ## API Reference ### OpenEnv Endpoints | Method | Path | Description | |---|---|---| | `POST` | `/reset` | Start a new episode. Body: `{ "task_id": "easy_screening" }` | | `POST` | `/step` | Execute an action. Body: `{ "action": { "action_type": "query_ddi", ... } }` | | `GET` | `/state` | Get current episode state | | `GET` | `/health` | Health check | | `GET` | `/schema` | Action/observation schema | | `WS` | `/ws` | WebSocket for stateful multi-step sessions | ### Additional Endpoints | Method | Path | Description | |---|---|---| | `POST` | `/agent/suggest` | AI-powered action suggestion. Body: `{ "observation": {...} }` | --- ## Running the Baseline Inference ```bash # Set required environment variables export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" export HF_TOKEN="your-token" # Start the environment server (in another terminal) ./scripts/dev_backend.sh # Run inference python inference.py ``` The inference script runs all 3 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs for the evaluator. --- ## RL Training (REINFORCE with Learned Baseline) The repository includes `train_rl.py` โ€” a complete **REINFORCE policy gradient** training loop that trains a neural network policy directly against the environment's shaped reward signal. ### How It Works | Component | Description | |---|---| | **State encoder** | 16-dimensional feature vector: med count, high-risk drug count, Beers-flagged drugs, budget utilization, query outcomes (severe/moderate fractions), step progress, pair coverage | | **Policy network** | 3-layer MLP (16 โ†’ 128 โ†’ 128 โ†’ 166) with ReLU, outputs masked logits over discrete action space | | **Value baseline** | 3-layer MLP (16 โ†’ 128 โ†’ 64 โ†’ 1) trained with MSE against discounted returns | | **Action space** | 166 discrete actions: 105 query_ddi pairs (C(15,2)), 60 interventions (4 types ร— 15 slots), 1 finish_review | | **Action masking** | Invalid actions (exhausted budgets, already-queried pairs, empty drug slots) are masked to `-inf` before softmax | | **Optimization** | REINFORCE with advantage (return - baseline), entropy bonus for exploration, gradient clipping | ### Training ```bash # Install PyTorch (CPU is sufficient) pip install torch --index-url https://download.pytorch.org/whl/cpu # Train on easy task (fast, ~30s) python train_rl.py --task easy_screening --episodes 200 # Train on medium task python train_rl.py --task budgeted_screening --episodes 500 # Train on hard task (longer episodes) python train_rl.py --task complex_tradeoff --episodes 500 --batch-size 10 # Full options python train_rl.py --task easy_screening --episodes 200 \ --lr 0.0003 --gamma 0.99 --entropy-coeff 0.02 \ --hidden-dim 128 --batch-size 5 --print-every 10 ``` **Outputs:** - Policy checkpoints: `backend/src/polypharmacy_env/checkpoints/best_{task}.pt` and `final_{task}.pt` - Training metrics: `training_metrics.json` (per-episode rewards, grader scores, losses) ### Observed Training Results | Task | Episodes | Greedy Eval (Grader Score) | Stochastic Eval | |---|---|---|---| | Easy Screening | 200 | **0.698** | 0.475 | | Budgeted Screening | 200 | **0.195** | 0.170 | | Complex Tradeoff | 200 | **0.040** | 0.035 | The easy task shows clear policy improvement. Medium and hard tasks benefit from more episodes (500+) and hyperparameter tuning โ€” the larger action spaces and longer episodes create a harder credit assignment problem, exactly as designed. ### Integration with OpenEnv Training Pipeline For production-scale training, this environment is compatible with **TRL's `GRPOTrainer`** via OpenEnv's standard interface: ```python # Conceptual integration with TRL GRPO from trl import GRPOTrainer from openenv import GenericEnvClient def rollout_func(prompts, trainer): env = GenericEnvClient("ws://localhost:7860/ws") # ... collect trajectories with token-level logprobs # ... return prompt_ids, completion_ids, logprobs, rewards trainer = GRPOTrainer(model, rollout_function=rollout_func, ...) trainer.train() ``` The included `train_rl.py` demonstrates the core RL loop with a lightweight MLP policy. For LLM-based policies, connect TRL/veRL/SkyRL to this environment via the WebSocket or HTTP interface. --- ## Neural Bandit Training (OptimNeuralTS) The repository implements the **OptimNeuralTS** algorithm from the reference paper. This combines Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations in a large combinatorial space. ### How OptimNeuralTS Works | Phase | What Happens | |---|---| | **Warm-up** | Randomly sample drug combinations and observe their risk scores to initialize the model's understanding | | **Neural Thompson Sampling** | A neural network predicts risk for any drug combination, while gradient-based uncertainty drives exploration toward combinations that could be dangerous | | **Differential Evolution** | Evolves a population of candidate drug combinations, guided by the neural network, to propose new combinations worth investigating | | **Nearest-neighbor mapping** | Since DE can suggest combinations not in the dataset, we map to the closest real combination using Hamming distance | | **Ensemble building** | Each training step saves a model snapshot; the final ensemble combines all snapshots for high-precision predictions | ### Key Components (in `neural_bandits.py`) | Component | Description | |---|---| | `RewardNetwork` | Neural network that predicts the Relative Risk (RR) for a multi-hot drug combination vector | | `NeuralTS` | Thompson Sampling agent using gradient-based uncertainty: `s_t(x) = sqrt(ฮป ยท g(x)^T ยท U^{-1} ยท g(x))` | | `differential_evolution()` | DE best/1/bin optimization over multi-hot feature space | | `OptimNeuralTS` | Full pipeline: warm-up โ†’ NeuralTS+DE exploration โ†’ ensemble building | ### Training ```bash # Quick run (small dataset, fast) python train_bandit.py --total-steps 500 --warmup-steps 100 # Full training (closer to paper settings) python train_bandit.py --total-steps 3000 --warmup-steps 500 --n-combinations 10000 # Custom DE parameters python train_bandit.py --de-population 32 --de-steps 16 --de-crossover 0.9 # All options python train_bandit.py --help ``` **Outputs:** - Ensemble model: `backend/src/polypharmacy_env/checkpoints/bandit_ensemble.pt` - Training metrics: `bandit_metrics.json` (precision, recall, patterns detected at each eval step) ### API Endpoints The trained ensemble is also accessible via API: | Method | Path | Description | |---|---|---| | `POST` | `/bandit/predict` | Predict risk for a single drug combination | | `POST` | `/bandit/screen` | Screen multiple combinations in bulk | | `GET` | `/bandit/metrics` | Get current bandit training metrics | --- ## Testing & Validation ```bash # Unit tests python -m pytest backend/src/polypharmacy_env/tests -v # Full validation (tests + heuristic baseline) ./scripts/run_validation.sh # OpenEnv spec validation openenv validate ``` --- ## Data Sources & Future Plans ### Current Implementation - **Drug interaction data**: Currently extracted from curated clinical databases and research literature, generating 24 DDI pairs across 33 drugs, 15 Beers criteria entries, and 120 patient episodes across 3 difficulty levels. Data is stored as CSV for deterministic, reproducible evaluation. - **RL training**: A lightweight REINFORCE policy gradient training loop (`train_rl.py`) trains a neural network policy (MLP) directly against the environment's shaped reward signal. This validates the MDP formulation and demonstrates that the reward shaping enables genuine policy improvement. The trained policy achieves a 0.698 grader score on easy screening after 200 episodes. ### Planned Enhancements - **Full-scale GRPO training on GPU**: We are provisioning AWS GPU resources (A100/H100 instances) to run full-scale GRPO (Group Relative Policy Optimization) training using TRL's `GRPOTrainer` with LLM-based policies. This will train language models to generate optimal clinical actions by collecting batched rollouts against the environment and computing policy gradient updates on token-level log-probabilities. The OpenEnv WebSocket interface enables high-throughput parallel rollout collection needed for efficient GRPO training. - **LLM fine-tuning via OpenEnv training pipeline**: Integrate with TRL, veRL, and SkyRL frameworks to fine-tune open-weight LLMs (Llama 3, Qwen 2.5) using the environment's shaped reward as the RL training signal, producing specialized clinical pharmacist agents. - **Live drug database integration**: Connect directly to established drug interaction databases (DrugBank, RxNorm, FDA Adverse Event Reporting System) for real-time DDI lookup instead of static CSV files, enabling the environment to scale to thousands of drug combinations. - **EHR integration pipeline**: Develop FHIR-compatible data ingestion so the environment can accept de-identified electronic health record data, making it applicable to real hospital deployments. - **Multi-agent training**: Extend the environment to support multi-agent scenarios where specialist agents (cardiologist, endocrinologist, etc.) must coordinate on a shared patient regimen. - **Pharmacogenomics layer**: Incorporate genetic variant data (CYP450 metabolizer status) to personalize drug interaction severity, adding a pharmacogenomics dimension to the RL training signal. --- ## Architecture & Design Decisions - **OpenEnv compliance**: Full typed Pydantic models for Action, Observation, and State. Environment extends `openenv.core.env_server.interfaces.Environment`. - **Shaped rewards**: Continuous reward signal at every step to enable efficient RL training (not sparse end-of-episode only). - **Budget constraints**: Query and intervention budgets create a resource-allocation problem that makes the RL optimization non-trivial. - **Critical drug handling**: The hard task penalizes stopping critical medications (warfarin, insulin, etc.) without substitution, teaching the agent about real-world clinical constraints. - **Deterministic graders**: All graders produce reproducible scores for consistent evaluation. --- ## Troubleshooting | Issue | Solution | |---|---| | `ModuleNotFoundError: polypharmacy_env` | Start backend via `./scripts/dev_backend.sh` from repo root | | `/agent/suggest` returns errors | Check `.env` for valid API keys, restart backend | | UI shows stale data | Hard refresh browser (Ctrl+Shift+R), click Reset Episode | | Docker build fails | Ensure Docker has at least 4GB memory allocated | | WebSocket connection refused | Verify backend is running on port 7860 | --- ## License MIT