polypharmacy-env / README.md
TheJackBright's picture
Fix API reward clamp to (0.001, 0.999) and update README
1bb11d9
metadata
title: PolypharmacyEnv
emoji: πŸ’Š
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - healthcare
  - polypharmacy
pinned: false

PolypharmacyEnv β€” Elderly Medication Safety via Reinforcement Learning

An OpenEnv-compliant environment that simulates elderly polypharmacy medication review. An RL agent acts as a clinical pharmacist assistant: it queries drug-drug interactions (DDIs), identifies Beers-criteria violations, and proposes safe interventions β€” all under resource-constrained budgets.

Built for the PyTorch OpenEnv Hackathon to demonstrate how clinical decision support for polypharmacy can be framed as a sequential RL problem and served as a reusable environment through the OpenEnv hub.


Why This Matters

Polypharmacy β€” the simultaneous use of five or more medications β€” affects the majority of adults over 65. Elderly patients often see multiple specialists who may not be aware of each other's prescriptions, leading to dangerous drug combinations. Studies report that adverse drug events from polypharmacy contribute to 100,000+ hospitalizations annually in the US alone.

Current solutions use static risk scoring. PolypharmacyEnv goes further by framing medication review as a sequential decision problem, where an RL agent must strategically allocate limited query and intervention budgets to maximize patient safety β€” exactly the kind of resource-constrained optimization that reinforcement learning excels at.

Reference: Larouche, A., Durand, A., Khoury, R. & Sirois, C. (2023). Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy. Advances in Artificial Intelligence, Springer.


How OpenEnv & RL Power This

The RL Formulation

PolypharmacyEnv frames medication review as a Markov Decision Process (MDP):

  • State: Patient profile (age, conditions, organ function) + current medication list + interaction history
  • Action space: query_ddi(drug_i, drug_j) | propose_intervention(target, type) | finish_review
  • Reward: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. finish_review triggers a grader returning a terminal score in (0.001, 0.999).
  • Constraint: Finite query and intervention budgets, creating a resource-allocation optimization problem.

This MDP is what makes the problem fundamentally different from static risk scoring: the agent must decide what information to acquire (which drug pairs to query) and which interventions to prioritize, all under budget constraints β€” a sequential decision problem that RL is designed to solve.

OpenEnv Interface

PolypharmacyEnv implements the full OpenEnv standard:

  • reset() β€” Generates a new patient scenario (age, conditions, medication list)
  • step(action) β€” Processes an agent action, updates regimen state, returns shaped reward
  • state() β€” Returns the current episode snapshot

All models use typed Pydantic classes extending OpenEnv base types (PolypharmacyAction, PolypharmacyObservation, PolypharmacyState).

What the Environment Enables

The shaped reward function provides continuous signal over the full trajectory, making this environment compatible with standard RL training approaches:

  • Policy gradient methods (REINFORCE, PPO, GRPO): The per-step reward signal allows policy networks to learn query prioritization and intervention strategies.
  • OpenEnv training pipeline: Through OpenEnv's step()/reset() HTTP interface, external RL training loops can connect to this environment and train policies without modification.
  • Neural Bandits (OptimNeuralTS): The budget-constrained query selection implements the OptimNeuralTS approach from the reference paper β€” Neural Thompson Sampling combined with Differential Evolution for efficient search.

Included Agents

The repository ships with multiple agent implementations spanning rule-based, RL-trained, bandit-based, and LLM-based approaches:

  • OptimNeuralTS bandit (train_bandit.py, neural_bandits.py): Implements the paper's core algorithm β€” Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations. Builds an ensemble of models across training steps for high-precision predictions.
  • REINFORCE-trained policy (train_rl.py): A neural network policy trained via REINFORCE with learned baseline against the environment's shaped reward. Demonstrates that the MDP formulation and reward shaping enable genuine policy improvement through RL training.
  • Heuristic agent (baselines/heuristic_agent.py): Deterministic rule-based strategy that queries high-risk drug pairs first, then intervenes on severe DDIs. Serves as a strong domain-knowledge baseline.
  • LLM agent (inference.py): Uses an LLM (Qwen2.5-72B via OpenAI-compatible API) for zero-shot action generation. Demonstrates baseline LLM performance without RL fine-tuning.
  • AI suggestion endpoint (/agent/suggest): LLM-powered action suggestions with rule-based guardrails for the interactive UI.

Repository Structure

β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                        # ASGI entrypoint (uvicorn target)
β”‚   β”œβ”€β”€ requirements.txt               # Python dependencies
β”‚   └── src/polypharmacy_env/
β”‚       β”œβ”€β”€ env_core.py                # OpenEnv environment: reset/step/state
β”‚       β”œβ”€β”€ models.py                  # Typed Pydantic models (Action, Observation, State)
β”‚       β”œβ”€β”€ rewards.py                 # Shaped reward function & regimen risk computation
β”‚       β”œβ”€β”€ graders.py                 # Deterministic graders for 3 task difficulties
β”‚       β”œβ”€β”€ tasks.py                   # Task configuration & episode sampling
β”‚       β”œβ”€β”€ config.py                  # Reward hyperparameters & task parameters
β”‚       β”œβ”€β”€ data_loader.py            # CSV data loading with caching
β”‚       β”œβ”€β”€ ddi_simulator.py          # DDI lookup, Beers flags, drug substitution
β”‚       β”œβ”€β”€ neural_bandits.py         # NeuralTS + Differential Evolution + OptimNeuralTS
β”‚       β”œβ”€β”€ api/
β”‚       β”‚   β”œβ”€β”€ app.py                # FastAPI app factory via OpenEnv create_app
β”‚       β”‚   └── routes/agent.py       # POST /agent/suggest (AI-assisted actions)
β”‚       β”‚              bandit.py      # POST /bandit/predict, /bandit/screen
β”‚       β”œβ”€β”€ baselines/
β”‚       β”‚   β”œβ”€β”€ heuristic_agent.py    # Deterministic baseline agent
β”‚       β”‚   └── random_agent.py       # Random baseline agent
β”‚       β”œβ”€β”€ services/
β”‚       β”‚   └── groq_agent.py         # LLM-powered action suggestions
β”‚       └── tests/
β”‚           β”œβ”€β”€ test_env_core.py      # Environment unit tests
β”‚           └── test_api.py           # HTTP + WebSocket integration tests
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.jsx                   # React control center UI
β”‚   β”‚   └── styles.css                # Production-quality dark theme
β”‚   β”œβ”€β”€ package.json
β”‚   └── vite.config.js
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ lookups/                      # drug_metadata.csv, ddi_rules.csv, beers_criteria.csv
β”‚   └── processed/                    # patients_polypharmacy.csv (120 episodes)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ preprocess_data.py            # Synthetic data generation
β”‚   β”œβ”€β”€ dev_backend.sh                # Local backend runner
β”‚   β”œβ”€β”€ dev_frontend.sh               # Local frontend runner
β”‚   └── run_validation.sh             # Automated test + baseline validation
β”œβ”€β”€ Dockerfile                         # Production multi-stage build (frontend + backend)
β”œβ”€β”€ docker-compose.yml                # Development orchestration
β”œβ”€β”€ inference.py                      # Submission baseline inference script
β”œβ”€β”€ train_rl.py                       # REINFORCE RL training script (PyTorch)
β”œβ”€β”€ train_bandit.py                   # OptimNeuralTS neural bandit training
β”œβ”€β”€ openenv.yaml                      # OpenEnv manifest
└── .env.example                      # Environment variable template

Action & Observation Spaces

Actions

Action Type Parameters Description
query_ddi drug_id_1, drug_id_2 Check a drug pair for interactions. Returns severity, recommendation, and risk score. Costs 1 query budget.
propose_intervention target_drug_id, intervention_type, proposed_new_drug_id (opt), rationale (opt) Modify the medication regimen. Types: stop, dose_reduce, substitute, add_monitoring. Costs 1 intervention budget.
finish_review β€” End the episode. Triggers grader evaluation and returns final score.

Observations

Each observation contains the full patient context:

Field Type Description
episode_id string Unique episode identifier
task_id string Current task (easy_screening / budgeted_screening / complex_tradeoff)
age, sex int, string Patient demographics
conditions list[string] Active medical conditions
eGFR_category, liver_function_category string Organ function status
current_medications list[MedicationEntry] Active drugs with dose, ATC class, Beers flags
interaction_queries list[InteractionQueryRecord] History of DDI queries and results
interventions list[InterventionRecord] History of proposed interventions
remaining_query_budget int Remaining DDI query budget
remaining_intervention_budget int Remaining intervention budget
shaped_reward float Step reward signal
done bool Whether the episode has ended

Tasks & Difficulty Progression

Task Difficulty Drugs Query Budget Intervention Budget Max Steps Description
Easy Screening Easy 3–5 4 2 10 Small regimen with one severe DDI. Identify and resolve it.
Budgeted Screening Medium 6–10 8 3 20 Multiple DDIs and Beers issues under tighter budgets. Must prioritize effectively.
Complex Tradeoff Hard 10–15 12 5 30 Large regimen with critical drugs (warfarin, insulin). Balance risk reduction against regimen disruption.

Grading Criteria

  • Easy: 50% risk reduction + 50% targeted intervention on severe DDI drugs
  • Medium: 50% risk reduction + 30% intervention precision + 20% query efficiency
  • Hard: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution

All graders are deterministic, producing scores strictly in (0.001, 0.999).


Reward Function Design

The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range (0.001, 0.999):

Event Raw Signal Clamped Output
DDI query (no finding) small cost 0.001 (floor)
Discovering a severe DDI cost + bonus ~0.035
Discovering a moderate DDI cost + bonus ~0.005
Successful intervention risk_reduction - cost proportional to risk improvement
Invalid action penalty 0.001 (floor)
Episode timeout penalty 0.001 (floor)
Finish review grader_score 0.001–0.999

Regimen risk aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to [0.0, 1.0].


Prerequisites

  • Python 3.10+
  • Node.js 18+ (20+ recommended)
  • Docker + Docker Compose (for containerized runs)

Setup & Local Development

1. Clone and configure

git clone <repo-url>
cd PolypharmacyEnv
cp .env.example .env
# Edit .env with your API keys if using the AI suggestion feature

2. Install dependencies

# Backend
pip install -r backend/requirements.txt

# Frontend
cd frontend && npm install && cd ..

3. Generate synthetic data (if not already present)

python scripts/preprocess_data.py

4. Start services

Terminal 1 β€” Backend (port 7860):

./scripts/dev_backend.sh

Terminal 2 β€” Frontend (port 5173):

./scripts/dev_frontend.sh

5. Open the application


Docker Deployment

Build and run (single container β€” production mode)

docker build -t polypharmacy-env .
docker run -p 7860:7860 polypharmacy-env

The UI and API are both served from port 7860.

Development mode (separate services)

docker compose up --build
  • Backend: port 7860
  • Frontend: port 5173

Hugging Face Spaces Deployment

1. Create a new Space

2. Set secrets and variables

In Space Settings β†’ Variables and Secrets:

Type Key Value
Secret HF_TOKEN Your Hugging Face API token
Variable API_BASE_URL https://router.huggingface.co/v1
Variable MODEL_NAME Qwen/Qwen2.5-72B-Instruct

3. Push the repository to the Space

git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
git push space master

4. Verify

  • Space root URL loads the React UI
  • /health returns healthy status
  • /reset, /step, /state respond to API calls

API Reference

OpenEnv Endpoints

Method Path Description
POST /reset Start a new episode. Body: { "task_id": "easy_screening" }
POST /step Execute an action. Body: { "action": { "action_type": "query_ddi", ... } }
GET /state Get current episode state
GET /health Health check
GET /schema Action/observation schema
WS /ws WebSocket for stateful multi-step sessions

Additional Endpoints

Method Path Description
POST /agent/suggest AI-powered action suggestion. Body: { "observation": {...} }

Running the Baseline Inference

# Set required environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"

# Start the environment server (in another terminal)
./scripts/dev_backend.sh

# Run inference
python inference.py

The inference script runs all 3 tasks and emits structured [START], [STEP], [END] logs for the evaluator.


RL Training (REINFORCE with Learned Baseline)

The repository includes train_rl.py β€” a complete REINFORCE policy gradient training loop that trains a neural network policy directly against the environment's shaped reward signal.

How It Works

Component Description
State encoder 16-dimensional feature vector: med count, high-risk drug count, Beers-flagged drugs, budget utilization, query outcomes (severe/moderate fractions), step progress, pair coverage
Policy network 3-layer MLP (16 β†’ 128 β†’ 128 β†’ 166) with ReLU, outputs masked logits over discrete action space
Value baseline 3-layer MLP (16 β†’ 128 β†’ 64 β†’ 1) trained with MSE against discounted returns
Action space 166 discrete actions: 105 query_ddi pairs (C(15,2)), 60 interventions (4 types Γ— 15 slots), 1 finish_review
Action masking Invalid actions (exhausted budgets, already-queried pairs, empty drug slots) are masked to -inf before softmax
Optimization REINFORCE with advantage (return - baseline), entropy bonus for exploration, gradient clipping

Training

# Install PyTorch (CPU is sufficient)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# Train on easy task (fast, ~30s)
python train_rl.py --task easy_screening --episodes 200

# Train on medium task
python train_rl.py --task budgeted_screening --episodes 500

# Train on hard task (longer episodes)
python train_rl.py --task complex_tradeoff --episodes 500 --batch-size 10

# Full options
python train_rl.py --task easy_screening --episodes 200 \
  --lr 0.0003 --gamma 0.99 --entropy-coeff 0.02 \
  --hidden-dim 128 --batch-size 5 --print-every 10

Outputs:

  • Policy checkpoints: backend/src/polypharmacy_env/checkpoints/best_{task}.pt and final_{task}.pt
  • Training metrics: training_metrics.json (per-episode rewards, grader scores, losses)

Observed Training Results

Task Episodes Greedy Eval (Grader Score) Stochastic Eval
Easy Screening 200 0.698 0.475
Budgeted Screening 200 0.195 0.170
Complex Tradeoff 200 0.040 0.035

The easy task shows clear policy improvement. Medium and hard tasks benefit from more episodes (500+) and hyperparameter tuning β€” the larger action spaces and longer episodes create a harder credit assignment problem, exactly as designed.

Integration with OpenEnv Training Pipeline

For production-scale training, this environment is compatible with TRL's GRPOTrainer via OpenEnv's standard interface:

# Conceptual integration with TRL GRPO
from trl import GRPOTrainer
from openenv import GenericEnvClient

def rollout_func(prompts, trainer):
    env = GenericEnvClient("ws://localhost:7860/ws")
    # ... collect trajectories with token-level logprobs
    # ... return prompt_ids, completion_ids, logprobs, rewards

trainer = GRPOTrainer(model, rollout_function=rollout_func, ...)
trainer.train()

The included train_rl.py demonstrates the core RL loop with a lightweight MLP policy. For LLM-based policies, connect TRL/veRL/SkyRL to this environment via the WebSocket or HTTP interface.


Neural Bandit Training (OptimNeuralTS)

The repository implements the OptimNeuralTS algorithm from the reference paper. This combines Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations in a large combinatorial space.

How OptimNeuralTS Works

Phase What Happens
Warm-up Randomly sample drug combinations and observe their risk scores to initialize the model's understanding
Neural Thompson Sampling A neural network predicts risk for any drug combination, while gradient-based uncertainty drives exploration toward combinations that could be dangerous
Differential Evolution Evolves a population of candidate drug combinations, guided by the neural network, to propose new combinations worth investigating
Nearest-neighbor mapping Since DE can suggest combinations not in the dataset, we map to the closest real combination using Hamming distance
Ensemble building Each training step saves a model snapshot; the final ensemble combines all snapshots for high-precision predictions

Key Components (in neural_bandits.py)

Component Description
RewardNetwork Neural network that predicts the Relative Risk (RR) for a multi-hot drug combination vector
NeuralTS Thompson Sampling agent using gradient-based uncertainty: s_t(x) = sqrt(Ξ» Β· g(x)^T Β· U^{-1} Β· g(x))
differential_evolution() DE best/1/bin optimization over multi-hot feature space
OptimNeuralTS Full pipeline: warm-up β†’ NeuralTS+DE exploration β†’ ensemble building

Training

# Quick run (small dataset, fast)
python train_bandit.py --total-steps 500 --warmup-steps 100

# Full training (closer to paper settings)
python train_bandit.py --total-steps 3000 --warmup-steps 500 --n-combinations 10000

# Custom DE parameters
python train_bandit.py --de-population 32 --de-steps 16 --de-crossover 0.9

# All options
python train_bandit.py --help

Outputs:

  • Ensemble model: backend/src/polypharmacy_env/checkpoints/bandit_ensemble.pt
  • Training metrics: bandit_metrics.json (precision, recall, patterns detected at each eval step)

API Endpoints

The trained ensemble is also accessible via API:

Method Path Description
POST /bandit/predict Predict risk for a single drug combination
POST /bandit/screen Screen multiple combinations in bulk
GET /bandit/metrics Get current bandit training metrics

Testing & Validation

# Unit tests
python -m pytest backend/src/polypharmacy_env/tests -v

# Full validation (tests + heuristic baseline)
./scripts/run_validation.sh

# OpenEnv spec validation
openenv validate

Data Sources & Future Plans

Current Implementation

  • Drug interaction data: Currently extracted from curated clinical databases and research literature, generating 24 DDI pairs across 33 drugs, 15 Beers criteria entries, and 120 patient episodes across 3 difficulty levels. Data is stored as CSV for deterministic, reproducible evaluation.
  • RL training: A lightweight REINFORCE policy gradient training loop (train_rl.py) trains a neural network policy (MLP) directly against the environment's shaped reward signal. This validates the MDP formulation and demonstrates that the reward shaping enables genuine policy improvement. The trained policy achieves a 0.698 grader score on easy screening after 200 episodes.

Planned Enhancements

  • Full-scale GRPO training on GPU: We are provisioning AWS GPU resources (A100/H100 instances) to run full-scale GRPO (Group Relative Policy Optimization) training using TRL's GRPOTrainer with LLM-based policies. This will train language models to generate optimal clinical actions by collecting batched rollouts against the environment and computing policy gradient updates on token-level log-probabilities. The OpenEnv WebSocket interface enables high-throughput parallel rollout collection needed for efficient GRPO training.
  • LLM fine-tuning via OpenEnv training pipeline: Integrate with TRL, veRL, and SkyRL frameworks to fine-tune open-weight LLMs (Llama 3, Qwen 2.5) using the environment's shaped reward as the RL training signal, producing specialized clinical pharmacist agents.
  • Live drug database integration: Connect directly to established drug interaction databases (DrugBank, RxNorm, FDA Adverse Event Reporting System) for real-time DDI lookup instead of static CSV files, enabling the environment to scale to thousands of drug combinations.
  • EHR integration pipeline: Develop FHIR-compatible data ingestion so the environment can accept de-identified electronic health record data, making it applicable to real hospital deployments.
  • Multi-agent training: Extend the environment to support multi-agent scenarios where specialist agents (cardiologist, endocrinologist, etc.) must coordinate on a shared patient regimen.
  • Pharmacogenomics layer: Incorporate genetic variant data (CYP450 metabolizer status) to personalize drug interaction severity, adding a pharmacogenomics dimension to the RL training signal.

Architecture & Design Decisions

  • OpenEnv compliance: Full typed Pydantic models for Action, Observation, and State. Environment extends openenv.core.env_server.interfaces.Environment.
  • Shaped rewards: Continuous reward signal at every step to enable efficient RL training (not sparse end-of-episode only).
  • Budget constraints: Query and intervention budgets create a resource-allocation problem that makes the RL optimization non-trivial.
  • Critical drug handling: The hard task penalizes stopping critical medications (warfarin, insulin, etc.) without substitution, teaching the agent about real-world clinical constraints.
  • Deterministic graders: All graders produce reproducible scores for consistent evaluation.

Troubleshooting

Issue Solution
ModuleNotFoundError: polypharmacy_env Start backend via ./scripts/dev_backend.sh from repo root
/agent/suggest returns errors Check .env for valid API keys, restart backend
UI shows stale data Hard refresh browser (Ctrl+Shift+R), click Reset Episode
Docker build fails Ensure Docker has at least 4GB memory allocated
WebSocket connection refused Verify backend is running on port 7860

License

MIT