Spaces:

TheJackBright
/

polypharmacy-env

Sleeping

App Files Files Community

polypharmacy-env / README.md

TheJackBright

Fix API reward clamp to (0.001, 0.999) and update README

1bb11d9 29 days ago

preview code

raw

history blame contribute delete

24.8 kB

metadata

title: PolypharmacyEnv
emoji: 💊
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - healthcare
  - polypharmacy
pinned: false

PolypharmacyEnv — Elderly Medication Safety via Reinforcement Learning

An OpenEnv-compliant environment that simulates elderly polypharmacy medication review. An RL agent acts as a clinical pharmacist assistant: it queries drug-drug interactions (DDIs), identifies Beers-criteria violations, and proposes safe interventions — all under resource-constrained budgets.

Built for the PyTorch OpenEnv Hackathon to demonstrate how clinical decision support for polypharmacy can be framed as a sequential RL problem and served as a reusable environment through the OpenEnv hub.

Why This Matters

Polypharmacy — the simultaneous use of five or more medications — affects the majority of adults over 65. Elderly patients often see multiple specialists who may not be aware of each other's prescriptions, leading to dangerous drug combinations. Studies report that adverse drug events from polypharmacy contribute to 100,000+ hospitalizations annually in the US alone.

Current solutions use static risk scoring. PolypharmacyEnv goes further by framing medication review as a sequential decision problem, where an RL agent must strategically allocate limited query and intervention budgets to maximize patient safety — exactly the kind of resource-constrained optimization that reinforcement learning excels at.

Reference: Larouche, A., Durand, A., Khoury, R. & Sirois, C. (2023). Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy. Advances in Artificial Intelligence, Springer.

How OpenEnv & RL Power This

The RL Formulation

PolypharmacyEnv frames medication review as a Markov Decision Process (MDP):

State: Patient profile (age, conditions, organ function) + current medication list + interaction history
Action space: query_ddi(drug_i, drug_j) | propose_intervention(target, type) | finish_review
Reward: Shaped, dense signal at every step (not sparse end-of-episode), strictly in the range (0.001, 0.999). Queries have a small cost, but discovering severe DDIs earns a larger bonus. Successful interventions earn proportional risk reduction. Invalid actions and timeouts are penalized but all values are clamped to positive. finish_review triggers a grader returning a terminal score in (0.001, 0.999).
Constraint: Finite query and intervention budgets, creating a resource-allocation optimization problem.

This MDP is what makes the problem fundamentally different from static risk scoring: the agent must decide what information to acquire (which drug pairs to query) and which interventions to prioritize, all under budget constraints — a sequential decision problem that RL is designed to solve.

OpenEnv Interface

PolypharmacyEnv implements the full OpenEnv standard:

reset() — Generates a new patient scenario (age, conditions, medication list)
step(action) — Processes an agent action, updates regimen state, returns shaped reward
state() — Returns the current episode snapshot

All models use typed Pydantic classes extending OpenEnv base types (PolypharmacyAction, PolypharmacyObservation, PolypharmacyState).

What the Environment Enables

The shaped reward function provides continuous signal over the full trajectory, making this environment compatible with standard RL training approaches:

Policy gradient methods (REINFORCE, PPO, GRPO): The per-step reward signal allows policy networks to learn query prioritization and intervention strategies.
OpenEnv training pipeline: Through OpenEnv's step()/reset() HTTP interface, external RL training loops can connect to this environment and train policies without modification.
Neural Bandits (OptimNeuralTS): The budget-constrained query selection implements the OptimNeuralTS approach from the reference paper — Neural Thompson Sampling combined with Differential Evolution for efficient search.

Included Agents

The repository ships with multiple agent implementations spanning rule-based, RL-trained, bandit-based, and LLM-based approaches:

OptimNeuralTS bandit (train_bandit.py, neural_bandits.py): Implements the paper's core algorithm — Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations. Builds an ensemble of models across training steps for high-precision predictions.
REINFORCE-trained policy (train_rl.py): A neural network policy trained via REINFORCE with learned baseline against the environment's shaped reward. Demonstrates that the MDP formulation and reward shaping enable genuine policy improvement through RL training.
Heuristic agent (baselines/heuristic_agent.py): Deterministic rule-based strategy that queries high-risk drug pairs first, then intervenes on severe DDIs. Serves as a strong domain-knowledge baseline.
LLM agent (inference.py): Uses an LLM (Qwen2.5-72B via OpenAI-compatible API) for zero-shot action generation. Demonstrates baseline LLM performance without RL fine-tuning.
AI suggestion endpoint (/agent/suggest): LLM-powered action suggestions with rule-based guardrails for the interactive UI.

Repository Structure

├── backend/
│   ├── main.py                        # ASGI entrypoint (uvicorn target)
│   ├── requirements.txt               # Python dependencies
│   └── src/polypharmacy_env/
│       ├── env_core.py                # OpenEnv environment: reset/step/state
│       ├── models.py                  # Typed Pydantic models (Action, Observation, State)
│       ├── rewards.py                 # Shaped reward function & regimen risk computation
│       ├── graders.py                 # Deterministic graders for 3 task difficulties
│       ├── tasks.py                   # Task configuration & episode sampling
│       ├── config.py                  # Reward hyperparameters & task parameters
│       ├── data_loader.py            # CSV data loading with caching
│       ├── ddi_simulator.py          # DDI lookup, Beers flags, drug substitution
│       ├── neural_bandits.py         # NeuralTS + Differential Evolution + OptimNeuralTS
│       ├── api/
│       │   ├── app.py                # FastAPI app factory via OpenEnv create_app
│       │   └── routes/agent.py       # POST /agent/suggest (AI-assisted actions)
│       │              bandit.py      # POST /bandit/predict, /bandit/screen
│       ├── baselines/
│       │   ├── heuristic_agent.py    # Deterministic baseline agent
│       │   └── random_agent.py       # Random baseline agent
│       ├── services/
│       │   └── groq_agent.py         # LLM-powered action suggestions
│       └── tests/
│           ├── test_env_core.py      # Environment unit tests
│           └── test_api.py           # HTTP + WebSocket integration tests
├── frontend/
│   ├── src/
│   │   ├── App.jsx                   # React control center UI
│   │   └── styles.css                # Production-quality dark theme
│   ├── package.json
│   └── vite.config.js
├── data/
│   ├── lookups/                      # drug_metadata.csv, ddi_rules.csv, beers_criteria.csv
│   └── processed/                    # patients_polypharmacy.csv (120 episodes)
├── scripts/
│   ├── preprocess_data.py            # Synthetic data generation
│   ├── dev_backend.sh                # Local backend runner
│   ├── dev_frontend.sh               # Local frontend runner
│   └── run_validation.sh             # Automated test + baseline validation
├── Dockerfile                         # Production multi-stage build (frontend + backend)
├── docker-compose.yml                # Development orchestration
├── inference.py                      # Submission baseline inference script
├── train_rl.py                       # REINFORCE RL training script (PyTorch)
├── train_bandit.py                   # OptimNeuralTS neural bandit training
├── openenv.yaml                      # OpenEnv manifest
└── .env.example                      # Environment variable template

Action & Observation Spaces

Actions

Action Type	Parameters	Description
`query_ddi`	`drug_id_1`, `drug_id_2`	Check a drug pair for interactions. Returns severity, recommendation, and risk score. Costs 1 query budget.
`propose_intervention`	`target_drug_id`, `intervention_type`, `proposed_new_drug_id` (opt), `rationale` (opt)	Modify the medication regimen. Types: `stop`, `dose_reduce`, `substitute`, `add_monitoring`. Costs 1 intervention budget.
`finish_review`	—	End the episode. Triggers grader evaluation and returns final score.

Observations

Each observation contains the full patient context:

Field	Type	Description
`episode_id`	string	Unique episode identifier
`task_id`	string	Current task (easy_screening / budgeted_screening / complex_tradeoff)
`age`, `sex`	int, string	Patient demographics
`conditions`	list[string]	Active medical conditions
`eGFR_category`, `liver_function_category`	string	Organ function status
`current_medications`	list[MedicationEntry]	Active drugs with dose, ATC class, Beers flags
`interaction_queries`	list[InteractionQueryRecord]	History of DDI queries and results
`interventions`	list[InterventionRecord]	History of proposed interventions
`remaining_query_budget`	int	Remaining DDI query budget
`remaining_intervention_budget`	int	Remaining intervention budget
`shaped_reward`	float	Step reward signal
`done`	bool	Whether the episode has ended

Tasks & Difficulty Progression

Task	Difficulty	Drugs	Query Budget	Intervention Budget	Max Steps	Description
Easy Screening	Easy	3–5	4	2	10	Small regimen with one severe DDI. Identify and resolve it.
Budgeted Screening	Medium	6–10	8	3	20	Multiple DDIs and Beers issues under tighter budgets. Must prioritize effectively.
Complex Tradeoff	Hard	10–15	12	5	30	Large regimen with critical drugs (warfarin, insulin). Balance risk reduction against regimen disruption.

Grading Criteria

Easy: 50% risk reduction + 50% targeted intervention on severe DDI drugs
Medium: 50% risk reduction + 30% intervention precision + 20% query efficiency
Hard: Risk reduction minus penalties for excessive drug changes and stopping critical medications without substitution

All graders are deterministic, producing scores strictly in (0.001, 0.999).

Reward Function Design

The shaped reward provides signal at every step (not just episode end). All rewards are strictly positive, clamped to the range (0.001, 0.999):

Event	Raw Signal	Clamped Output
DDI query (no finding)	small cost	0.001 (floor)
Discovering a severe DDI	cost + bonus	~0.035
Discovering a moderate DDI	cost + bonus	~0.005
Successful intervention	risk_reduction - cost	proportional to risk improvement
Invalid action	penalty	0.001 (floor)
Episode timeout	penalty	0.001 (floor)
Finish review	grader_score	0.001–0.999

Regimen risk aggregates DDI pairwise scores, Beers-criteria violation weights, and high-risk elderly drug penalties, normalized by regimen size and clipped to [0.0, 1.0].

Prerequisites

Python 3.10+
Node.js 18+ (20+ recommended)
Docker + Docker Compose (for containerized runs)

Setup & Local Development

1. Clone and configure

git clone <repo-url>
cd PolypharmacyEnv
cp .env.example .env
# Edit .env with your API keys if using the AI suggestion feature

2. Install dependencies

# Backend
pip install -r backend/requirements.txt

# Frontend
cd frontend && npm install && cd ..

3. Generate synthetic data (if not already present)

python scripts/preprocess_data.py

4. Start services

Terminal 1 — Backend (port 7860):

./scripts/dev_backend.sh

Terminal 2 — Frontend (port 5173):

./scripts/dev_frontend.sh

5. Open the application

Frontend UI: http://localhost:5173
Backend health check: http://localhost:7860/health

Docker Deployment

Build and run (single container — production mode)

docker build -t polypharmacy-env .
docker run -p 7860:7860 polypharmacy-env

The UI and API are both served from port 7860.

Development mode (separate services)

docker compose up --build

Backend: port 7860
Frontend: port 5173

Hugging Face Spaces Deployment

1. Create a new Space

Go to Hugging Face Spaces
Choose Docker SDK
Tag the Space with openenv

2. Set secrets and variables

In Space Settings → Variables and Secrets:

Type	Key	Value
Secret	`HF_TOKEN`	Your Hugging Face API token
Variable	`API_BASE_URL`	`https://router.huggingface.co/v1`
Variable	`MODEL_NAME`	`Qwen/Qwen2.5-72B-Instruct`

3. Push the repository to the Space

git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
git push space master

4. Verify

Space root URL loads the React UI
/health returns healthy status
/reset, /step, /state respond to API calls

API Reference

OpenEnv Endpoints

Method	Path	Description
`POST`	`/reset`	Start a new episode. Body: `{ "task_id": "easy_screening" }`
`POST`	`/step`	Execute an action. Body: `{ "action": { "action_type": "query_ddi", ... } }`
`GET`	`/state`	Get current episode state
`GET`	`/health`	Health check
`GET`	`/schema`	Action/observation schema
`WS`	`/ws`	WebSocket for stateful multi-step sessions

Additional Endpoints

Method	Path	Description
`POST`	`/agent/suggest`	AI-powered action suggestion. Body: `{ "observation": {...} }`

Running the Baseline Inference

# Set required environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"

# Start the environment server (in another terminal)
./scripts/dev_backend.sh

# Run inference
python inference.py

The inference script runs all 3 tasks and emits structured [START], [STEP], [END] logs for the evaluator.

RL Training (REINFORCE with Learned Baseline)

The repository includes train_rl.py — a complete REINFORCE policy gradient training loop that trains a neural network policy directly against the environment's shaped reward signal.

How It Works

Component	Description
State encoder	16-dimensional feature vector: med count, high-risk drug count, Beers-flagged drugs, budget utilization, query outcomes (severe/moderate fractions), step progress, pair coverage
Policy network	3-layer MLP (16 → 128 → 128 → 166) with ReLU, outputs masked logits over discrete action space
Value baseline	3-layer MLP (16 → 128 → 64 → 1) trained with MSE against discounted returns
Action space	166 discrete actions: 105 query_ddi pairs (C(15,2)), 60 interventions (4 types × 15 slots), 1 finish_review
Action masking	Invalid actions (exhausted budgets, already-queried pairs, empty drug slots) are masked to `-inf` before softmax
Optimization	REINFORCE with advantage (return - baseline), entropy bonus for exploration, gradient clipping

Training

# Install PyTorch (CPU is sufficient)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# Train on easy task (fast, ~30s)
python train_rl.py --task easy_screening --episodes 200

# Train on medium task
python train_rl.py --task budgeted_screening --episodes 500

# Train on hard task (longer episodes)
python train_rl.py --task complex_tradeoff --episodes 500 --batch-size 10

# Full options
python train_rl.py --task easy_screening --episodes 200 \
  --lr 0.0003 --gamma 0.99 --entropy-coeff 0.02 \
  --hidden-dim 128 --batch-size 5 --print-every 10

Outputs:

Policy checkpoints: backend/src/polypharmacy_env/checkpoints/best_{task}.pt and final_{task}.pt
Training metrics: training_metrics.json (per-episode rewards, grader scores, losses)

Observed Training Results

Task	Episodes	Greedy Eval (Grader Score)	Stochastic Eval
Easy Screening	200	0.698	0.475
Budgeted Screening	200	0.195	0.170
Complex Tradeoff	200	0.040	0.035

The easy task shows clear policy improvement. Medium and hard tasks benefit from more episodes (500+) and hyperparameter tuning — the larger action spaces and longer episodes create a harder credit assignment problem, exactly as designed.

Integration with OpenEnv Training Pipeline

For production-scale training, this environment is compatible with TRL's GRPOTrainer via OpenEnv's standard interface:

# Conceptual integration with TRL GRPO
from trl import GRPOTrainer
from openenv import GenericEnvClient

def rollout_func(prompts, trainer):
    env = GenericEnvClient("ws://localhost:7860/ws")
    # ... collect trajectories with token-level logprobs
    # ... return prompt_ids, completion_ids, logprobs, rewards

trainer = GRPOTrainer(model, rollout_function=rollout_func, ...)
trainer.train()

The included train_rl.py demonstrates the core RL loop with a lightweight MLP policy. For LLM-based policies, connect TRL/veRL/SkyRL to this environment via the WebSocket or HTTP interface.

Neural Bandit Training (OptimNeuralTS)

The repository implements the OptimNeuralTS algorithm from the reference paper. This combines Neural Thompson Sampling with Differential Evolution to efficiently search for dangerous drug combinations in a large combinatorial space.

How OptimNeuralTS Works

Phase	What Happens
Warm-up	Randomly sample drug combinations and observe their risk scores to initialize the model's understanding
Neural Thompson Sampling	A neural network predicts risk for any drug combination, while gradient-based uncertainty drives exploration toward combinations that could be dangerous
Differential Evolution	Evolves a population of candidate drug combinations, guided by the neural network, to propose new combinations worth investigating
Nearest-neighbor mapping	Since DE can suggest combinations not in the dataset, we map to the closest real combination using Hamming distance
Ensemble building	Each training step saves a model snapshot; the final ensemble combines all snapshots for high-precision predictions

Key Components (in `neural_bandits.py`)

Component	Description
`RewardNetwork`	Neural network that predicts the Relative Risk (RR) for a multi-hot drug combination vector
`NeuralTS`	Thompson Sampling agent using gradient-based uncertainty: `s_t(x) = sqrt(λ · g(x)^T · U^{-1} · g(x))`
`differential_evolution()`	DE best/1/bin optimization over multi-hot feature space
`OptimNeuralTS`	Full pipeline: warm-up → NeuralTS+DE exploration → ensemble building

Training

# Quick run (small dataset, fast)
python train_bandit.py --total-steps 500 --warmup-steps 100

# Full training (closer to paper settings)
python train_bandit.py --total-steps 3000 --warmup-steps 500 --n-combinations 10000

# Custom DE parameters
python train_bandit.py --de-population 32 --de-steps 16 --de-crossover 0.9

# All options
python train_bandit.py --help

Outputs:

Ensemble model: backend/src/polypharmacy_env/checkpoints/bandit_ensemble.pt
Training metrics: bandit_metrics.json (precision, recall, patterns detected at each eval step)

API Endpoints

The trained ensemble is also accessible via API:

Method	Path	Description
`POST`	`/bandit/predict`	Predict risk for a single drug combination
`POST`	`/bandit/screen`	Screen multiple combinations in bulk
`GET`	`/bandit/metrics`	Get current bandit training metrics

Testing & Validation

# Unit tests
python -m pytest backend/src/polypharmacy_env/tests -v

# Full validation (tests + heuristic baseline)
./scripts/run_validation.sh

# OpenEnv spec validation
openenv validate

Data Sources & Future Plans

Current Implementation

Drug interaction data: Currently extracted from curated clinical databases and research literature, generating 24 DDI pairs across 33 drugs, 15 Beers criteria entries, and 120 patient episodes across 3 difficulty levels. Data is stored as CSV for deterministic, reproducible evaluation.
RL training: A lightweight REINFORCE policy gradient training loop (train_rl.py) trains a neural network policy (MLP) directly against the environment's shaped reward signal. This validates the MDP formulation and demonstrates that the reward shaping enables genuine policy improvement. The trained policy achieves a 0.698 grader score on easy screening after 200 episodes.

Planned Enhancements

Full-scale GRPO training on GPU: We are provisioning AWS GPU resources (A100/H100 instances) to run full-scale GRPO (Group Relative Policy Optimization) training using TRL's GRPOTrainer with LLM-based policies. This will train language models to generate optimal clinical actions by collecting batched rollouts against the environment and computing policy gradient updates on token-level log-probabilities. The OpenEnv WebSocket interface enables high-throughput parallel rollout collection needed for efficient GRPO training.
LLM fine-tuning via OpenEnv training pipeline: Integrate with TRL, veRL, and SkyRL frameworks to fine-tune open-weight LLMs (Llama 3, Qwen 2.5) using the environment's shaped reward as the RL training signal, producing specialized clinical pharmacist agents.
Live drug database integration: Connect directly to established drug interaction databases (DrugBank, RxNorm, FDA Adverse Event Reporting System) for real-time DDI lookup instead of static CSV files, enabling the environment to scale to thousands of drug combinations.
EHR integration pipeline: Develop FHIR-compatible data ingestion so the environment can accept de-identified electronic health record data, making it applicable to real hospital deployments.
Multi-agent training: Extend the environment to support multi-agent scenarios where specialist agents (cardiologist, endocrinologist, etc.) must coordinate on a shared patient regimen.
Pharmacogenomics layer: Incorporate genetic variant data (CYP450 metabolizer status) to personalize drug interaction severity, adding a pharmacogenomics dimension to the RL training signal.

Architecture & Design Decisions

OpenEnv compliance: Full typed Pydantic models for Action, Observation, and State. Environment extends openenv.core.env_server.interfaces.Environment.
Shaped rewards: Continuous reward signal at every step to enable efficient RL training (not sparse end-of-episode only).
Budget constraints: Query and intervention budgets create a resource-allocation problem that makes the RL optimization non-trivial.
Critical drug handling: The hard task penalizes stopping critical medications (warfarin, insulin, etc.) without substitution, teaching the agent about real-world clinical constraints.
Deterministic graders: All graders produce reproducible scores for consistent evaluation.

Troubleshooting

Issue	Solution
`ModuleNotFoundError: polypharmacy_env`	Start backend via `./scripts/dev_backend.sh` from repo root
`/agent/suggest` returns errors	Check `.env` for valid API keys, restart backend
UI shows stale data	Hard refresh browser (Ctrl+Shift+R), click Reset Episode
Docker build fails	Ensure Docker has at least 4GB memory allocated
WebSocket connection refused	Verify backend is running on port 7860

License

MIT

PolypharmacyEnv — Elderly Medication Safety via Reinforcement Learning

Why This Matters

How OpenEnv & RL Power This

The RL Formulation

OpenEnv Interface

What the Environment Enables

Included Agents

Repository Structure

Action & Observation Spaces

Actions

Observations

Tasks & Difficulty Progression

Grading Criteria

Reward Function Design

Prerequisites

Setup & Local Development

1. Clone and configure

2. Install dependencies

3. Generate synthetic data (if not already present)

4. Start services

5. Open the application

Docker Deployment

Build and run (single container — production mode)

Development mode (separate services)

Hugging Face Spaces Deployment

1. Create a new Space

2. Set secrets and variables

3. Push the repository to the Space

4. Verify

API Reference

OpenEnv Endpoints

Additional Endpoints

Running the Baseline Inference

RL Training (REINFORCE with Learned Baseline)

How It Works

Training

Observed Training Results

Integration with OpenEnv Training Pipeline

Neural Bandit Training (OptimNeuralTS)

How OptimNeuralTS Works

Key Components (in neural_bandits.py)

Training

API Endpoints

Testing & Validation

Data Sources & Future Plans

Current Implementation

Planned Enhancements

Architecture & Design Decisions

Troubleshooting

License

Key Components (in `neural_bandits.py`)