π« SepsisPilot β OpenEnv
Reinforcement learning environment for optimal sepsis treatment sequencing
Meta PyTorch OpenEnv Hackathon 2026 β Submission
Environment Description & Motivation
Sepsis kills ~11 million people per year β yet optimal treatment sequencing remains one of the hardest challenges in critical care. The right antibiotic at the right time, combined with precise vasopressor dosing, can mean the difference between survival and multi-organ failure.
SepsisPilot simulates an ICU sepsis patient at hourly resolution. An AI agent observes real clinical vitals (MAP, lactate, WBC, temperature, heart rate, creatinine) and decides which antibiotic and vasopressor combination to administer each hour. The environment models realistic physiology including:
- Gram-specific antibiotic efficacy β broad-spectrum covers gram-negative; narrow-spectrum (vancomycin) covers gram-positive
- Antibiotic resistance accumulation β repeated suboptimal antibiotic use degrades efficacy
- Haemodynamic-metabolic coupling β low MAP causes tissue ischaemia (rising lactate), compensatory tachycardia
- Renal vasoconstriction β high-dose vasopressors raise MAP but risk acute kidney injury
This is a real clinical problem with life-or-death stakes, well-defined physiology, meaningful partial progress signals, and a genuinely hard exploration challenge β making it an ideal RL environment.
Action Space
| ID | Name | Description |
|---|---|---|
| 0 | no_treatment |
Watchful waiting β no intervention |
| 1 | broad_antibiotics |
Piperacillin-tazobactam β gram-negative coverage |
| 2 | narrow_antibiotics |
Vancomycin β gram-positive / MRSA coverage |
| 3 | low_vasopressor |
Norepinephrine 0.1 mcg/kg/min β raises MAP |
| 4 | high_vasopressor |
Norepinephrine 0.3 mcg/kg/min β raises MAP more; β renal risk |
| 5 | broad_plus_low_vaso |
Broad-spectrum AB + low-dose vasopressor |
| 6 | broad_plus_high_vaso |
Broad-spectrum AB + high-dose vasopressor |
| 7 | narrow_plus_low_vaso |
Narrow-spectrum AB + low-dose vasopressor |
| 8 | narrow_plus_high_vaso |
Narrow-spectrum AB + high-dose vasopressor |
Type: Discrete Β· n: 9
Observation Space
| Field | Unit | Normal Range | Clinical Meaning |
|---|---|---|---|
map_mmhg |
mmHg | 70β100 | Mean Arterial Pressure β sepsis goal β₯ 65 |
lactate |
mmol/L | 0.5β2.0 | Tissue ischaemia marker β target < 2.0 |
wbc |
k/uL | 4β11 | White blood cells β infection proxy |
temperature |
Β°C | 36.5β37.5 | Fever indicates active infection |
heart_rate |
bpm | 60β100 | Tachycardia in sepsis |
creatinine |
mg/dL | 0.6β1.2 | Renal function β rises with AKI |
sofa_score |
0β24 | 0β2 | Multi-organ failure composite |
resistance |
0β1 | 0.0 | Antibiotic resistance index (hard task) |
step_fraction |
0β1 | β | Fraction of episode elapsed |
Type: Continuous Β· Shape: [9]
Task Descriptions
Task 1 β mild_sepsis Β· Easy
- Scenario: Mild sepsis secondary to gram-negative urinary tract infection
- Initial state: MAP 65, Lactate 2.5, WBC 14, Temp 38.2Β°C
- Optimal strategy: Broad-spectrum antibiotics; vasopressors only if MAP drops below 65
- Max steps: 24 (24 hours)
- Expected baseline score: 0.55β0.75
- Key challenge: Learning that broad-spectrum is the right antibiotic class
Task 2 β septic_shock Β· Medium
- Scenario: Septic shock from gram-positive bacteraemia (MRSA suspected)
- Initial state: MAP 52 β , Lactate 4.2, WBC 18, Temp 38.9Β°C
- Optimal strategy: Immediate vasopressors + narrow-spectrum antibiotics (vancomycin). Every delayed hour increases organ failure risk.
- Max steps: 48 (48 hours)
- Expected baseline score: 0.35β0.60
- Key challenge: Correctly identifying gram-positive infection; mandatory haemodynamic support
Task 3 β severe_mods Β· Hard
- Scenario: Severe sepsis with Multi-Organ Dysfunction Syndrome (MODS). Mixed drug-resistant infection.
- Initial state: MAP 42 β β , Lactate 7.0, WBC 22, Temp 39.6Β°C, Creatinine 2.2
- Optimal strategy: Broad-spectrum first (2 steps) β switch to narrow-spectrum β maintain MAP β₯ 65 with lowest effective vasopressor dose
- Max steps: 72 (72 hours)
- Expected baseline score: 0.20β0.45
- Key challenge: Precise antibiotic sequencing to manage resistance; renal protection; multi-objective optimisation
Reward Function
Dense rewards at every timestep (not just episode end):
Per step:
+0.35 MAP β₯ 65 mmHg (haemodynamic stability)
+0.30 Lactate < 2.0 mmol/L (tissue perfusion restored)
+0.10 WBC in 4β12 k/uL (infection controlled)
+0.08 Temperature 36β38Β°C (fever resolved)
+0.05 Creatinine improving (renal protection)
β0.15 Resistance increasing (wrong antibiotic penalty)
β0.025 Per step (time pressure)
Terminal:
+5.00 All vitals stable (full stabilisation bonus)
β8.00 Patient death (MAP < 35 or Lactate > 15)
Range: approximately β8.0 to +5.775 per step.
Grader (0.0 β 1.0)
Each completed episode is scored by a task-specific grader:
| Component | Easy | Medium | Hard |
|---|---|---|---|
| Survival | 40% | 30% | 25% |
| MAP normalisation | 25% | 20% | β |
| Lactate clearance | 20% | 15% | β |
| Vital combo (MAP + lactate) | β | β | 20% |
| WBC / temperature | 10% | 5% | β |
| Correct antibiotic class | β | 15% | β |
| Vasopressor usage | β | 5% | β |
| Antibiotic sequencing | β | β | 15% |
| Resistance management | β | β | 15% |
| Renal protection | β | β | 15% |
| Speed bonus | 5% | 5% | 10% |
Baseline Scores
Baseline LLM agent (Nemotron 3 Super via NVIDIA API, seed=42):
| Task | Baseline Score | Notes |
|---|---|---|
mild_sepsis |
~0.62 | LLM correctly identifies broad-spectrum; moderate speed |
septic_shock |
~0.44 | Often misses narrow-spectrum; vasopressors applied correctly |
severe_mods |
~0.31 | Sequencing rarely optimal; resistance accumulates |
Random agent (action sampled uniformly):
| Task | Random Score |
|---|---|
mild_sepsis |
~0.35 |
septic_shock |
~0.18 |
severe_mods |
~0.08 |
Setup & Usage
Quick Start (Docker)
# Build
docker build -t sepsispilot .
# Run
docker run -p 7860:7860 sepsispilot
# Test
curl http://localhost:7860/health
Local Development
# Install dependencies
pip install -r requirements.txt
# Start server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Open dashboard
open http://localhost:7860
Running the Baseline Agent
export OPENAI_API_KEY="your-api-key"
export API_BASE_URL="https://integrate.api.nvidia.com/v1"
export MODEL_NAME="nvidia/llama-3.1-nemotron-70b-instruct"
export ENV_BASE_URL="http://localhost:7860"
python inference.py
# Runs all 3 tasks, 1 episode each (seed=42)
python inference.py --episodes 3 --seed 42
# 3 episodes per task
python inference.py --task mild_sepsis
# Single task only
Running Tests
pip install pytest
pytest tests/ -v
Pre-Submission Validation
# With server running:
python validate.py --url http://localhost:7860
API Reference
| Method | Endpoint | Description |
|---|---|---|
POST |
/reset |
Start new episode {"task": "mild_sepsis", "seed": 42} |
POST |
/step |
Take action {"action": 5} |
GET |
/state |
Current patient state |
GET |
/grade |
Score completed episode (0.0β1.0) |
GET |
/tasks |
List all tasks |
GET |
/health |
Server health check |
GET |
/ |
Interactive visual dashboard |
Interactive API docs: http://localhost:7860/docs
Project Structure
sepsispilot/
βββ openenv.yaml β OpenEnv spec config
βββ Dockerfile β HF Spaces container
βββ requirements.txt
βββ README.md
βββ inference.py β Baseline LLM agent (mandatory)
βββ validate.py β Pre-submission validation
βββ app.py β FastAPI HTTP server + dashboard
βββ environment/
β βββ __init__.py
β βββ models.py β Pydantic typed models
β βββ patient_sim.py β Physiology simulation engine
β βββ graders.py β Episode scoring (0.0β1.0)
β βββ env.py β OpenEnv class (reset/step/state/grade)
βββ tests/
βββ test_env.py β Unit tests (pytest)
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
β | β | API key for LLM endpoint |
API_BASE_URL |
β | https://integrate.api.nvidia.com/v1 |
LLM endpoint |
MODEL_NAME |
β | nvidia/llama-3.1-nemotron-70b-instruct |
Model name |
HF_TOKEN |
β | β | Hugging Face token |
ENV_BASE_URL |
β | http://localhost:7860 |
Environment server URL |
Design Decisions
Why sepsis? It's a genuine, high-stakes clinical problem where optimal treatment sequencing has enormous impact. The physiology is well-understood but the decision-making is hard β perfect for RL.
Why synthetic simulation instead of direct MIMIC-IV replay? MIMIC-IV access requires credentialing. The simulation is calibrated to match MIMIC-IV population statistics, making it accessible while remaining medically realistic. An RL agent trained here can be evaluated against real MIMIC-IV data in future work.
Why dense rewards? Sepsis treatment spans 24β72 hours. Episode-end-only rewards create too sparse a signal for meaningful learning. Per-step vital sign improvements provide rich learning signal throughout.
Built with β€οΈ for the Meta PyTorch OpenEnv Hackathon 2026