---
title: ClinicalBench
emoji: π¬
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
tags:
- openenv
---
# π¬ ClinicalBench
### A Benchmark for Evaluating Agentic Reasoning in Safety-Critical Clinical Workflows
[](https://github.com/meta-pytorch/OpenEnv)
[](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor)
[](#docker)
[](LICENSE)
[](#benchmark-results)
[](#benchmark-results)
[](#task-descriptions)
[](#why-clinicalbench-is-hard)
> **π― Llama 3.3 70B beats the 405B frontier model (0.66 vs 0.50).** ClinicalBench is an OpenEnv benchmark where LLMs audit 720 oncology patient records against procedurally generated protocols. By utilizing multi-hop comorbidity traps, Simpson's Paradox confounders, and a brutal -0.30 false-positive penalty, ClinicalBench proves that agentic tool-calling efficiency (3.3 70B) outperforms raw parameter size (3.1 405B) in safety-critical workflows.
[Live Demo](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor) Β· [Architecture](#architecture) Β· [Results](#benchmark-results) Β· [Quick Start](#quick-start) Β· [Leaderboard](#-frontier-model-leaderboard)
---
## π₯οΈ The Enterprise Audit Dashboard (Live Demo)
*Because safety-critical AI requires transparency, ClinicalBench includes a production-ready enterprise dashboard to visualize the agent's ReAct loop in real-time.*
Launch the **[Hugging Face Space](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor)** to see the 70B reasoning agent actively triage patients, compute bias distributions, and flag protocol violations while safely navigating the 8K token context limit.
---
## The Problem
Clinical data auditing is one of medicine's most consequential workflows. A single undetected protocol violation can invalidate years of trial data, delay drug approvals, and β in worst cases β put patients at risk. Today's AI systems fail at this task in three specific ways:
| Failure Mode | What Happens | Why It Matters |
|:---|:---|:---|
| **Overflagging** | LLMs flag valid edge cases (e.g., Stage IV patients with extended treatment windows) as violations | False alarms waste reviewer time and erode trust in AI-assisted auditing |
| **Temporal Confusion** | Models miss impossible date orderings (death before treatment) while fixating on superficial anomalies | Critical safety signals go undetected |
| **Bias Misinterpretation** | Models detect demographic skew in raw statistics but cannot distinguish genuine selection bias from confounded high-risk cohorts | Naive bias detection causes incorrect escalations or dangerous dismissals |
ClinicalBench is designed to evaluate and train agents that can overcome all three failure modes simultaneously.
---
## Why ClinicalBench Exists
Existing RL benchmarks for agents fall into two categories: **game-like environments** (code golf, math puzzles) where memorization helps, and **static dataset tasks** (classification, extraction) where the answer is fixed. Neither captures the reality of clinical auditing, where:
- **Rules change every episode** β eligibility criteria, timing windows, and bias thresholds are protocol-specific
- **Edge cases are not errors** β Stage IV patients legitimately have longer treatment windows
- **Statistics lie without context** β a minority group's higher mortality rate may reflect disease severity, not unfair sampling
- **The step budget is limited** β agents must prioritize which patients and which patterns to investigate
ClinicalBench fills this gap by generating a new procedural dataset and protocol for every `reset()`, forcing agents to **read and reason** rather than memorize.
---
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ClinicalBench Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β reset(seed, task_id) β
β β β
β βΌ β
β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Procedural Dataset βββββΆβ Episode-Specific Protocol β β
β β Generator β β Excerpt β β
β β β’ 300-720 patients β β β’ Dynamic age range β β
β β β’ Seeded RNG β β β’ Variable timing windows β β
β β β’ Adversarial traps β β β’ Stage IV exceptions β β
β β β’ Hidden confoundersβ β β’ Bias thresholds β β
β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Agent Interaction Loop β β
β β Thought β Tool β Observation β Flag β Report β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β investigate_pattern(var) β distribution summary β β
β β compute_distribution(var) β cohort breakdown β β
β β flag_error(patient, type) β correct/false positive β β
β β submit_report(text) β quality score β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Multi-Dimensional Grading β β
β β Recall (70%) + Precision (15%) + Workflow (5%) β β
β β + Efficiency (5%) + Report Quality (5%) β β
β β Dense step rewards + episode benchmark score β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Key Design Decisions
1. **Procedural Generation** β Each `reset()` samples a new protocol with different age ranges, timing windows, and bias thresholds using seeded stochastic processes. No two environments are identical, preventing memorization.
2. **Adversarial Traps** β Valid edge cases (boundary ages, near-window delays, valid Stage IV exceptions) are deliberately injected to punish agents that use naive threshold-based heuristics.
3. **Confounder-Aware Bias** β Hard episodes may contain either genuine selection bias OR a confounded high-risk cohort. The confounder (high-risk outreach site with more late-stage patients) creates an overall mortality gap that disappears after stage-stratified analysis. Agents must perform this adjustment before flagging.
4. **Phase-Gated Workflow** β Agents must investigate variables before flagging errors, and compute distributions before claiming bias. Skipping phases is penalized, encouraging structured reasoning over guessing.
---
## Task Suite
### Task 1: `task_easy` β Dynamic Eligibility Screening
| Property | Value |
|:---|:---|
| Dataset | ~300 patients |
| Error types | `invalid_age` |
| Difficulty source | Age bounds are episode-specific (e.g., 35-75, 45-85), not fixed at 18-120 |
| Traps | Valid boundary ages at exact protocol limits |
| Step budget | 25 |
### Task 2: `task_medium` β Protocol Timeline Audit
| Property | Value |
|:---|:---|
| Dataset | ~480 patients |
| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation` |
| Difficulty source | Treatment-start window is protocol-specific; Stage IV has a longer valid window |
| Traps | Near-boundary delays, valid Stage IV exceptions, near-immediate valid deaths |
| Step budget | 50 |
### Task 3: `task_hard` β Equity + Protocol Audit
| Property | Value |
|:---|:---|
| Dataset | ~720 patients with **25+ fields** (including 11 clinical noise columns) |
| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation`, `selection_bias` |
| Difficulty source | Multi-hop comorbidity exception, Simpson's Paradox bias, context dilution from EHR noise |
| Traps | Comorbidity-negated Stage IV exceptions, confounder cohorts, treatment-arm skew, near-boundary windows |
| Step budget | 75 (tight for 29 batches + investigations + flags) |
---
## Why ClinicalBench Is Hard
This benchmark is designed to expose fundamental limitations in current AI systems:
| Challenge | Why It Breaks Naive Agents |
|:---|:---|
| **Dynamic protocols** | Rules embedded in natural language change every episode β hardcoded thresholds fail |
| **Multi-hop comorbidity override** | Stage IV exception is revoked when `comorbidity_index > threshold` β requires 3-step cross-referencing (stage β comorbidity β window) that LLMs almost always miss |
| **Clinical noise columns** | 11 realistic EHR fields (BMI, LDH, medications, etc.) dilute LLM attention across 720 Γ 25+ field records |
| **Simpson's Paradox** | High-risk sites inflate mortality for minorities, but the cause is disease severity, not sampling bias β overall stats look fine |
| **Tight step budget** | 75 steps for 40+ errors in 720 patients β agents must triage across 29 batches and cannot check everything |
| **Phased workflow** | Flagging before investigating is blocked and penalized β forces structured reasoning |
| **Overconfidence penalty** | High-confidence wrong flags are penalized 1.8Γ β discourages guessing |
---
## Benchmark Results
> **All scores are from genuine LLM inference** β the model reads raw patient data, decides what to flag, and gets scored by the environment. No Python detectors, no hardcoded logic. The LLM is the brain; Python is just the hands.
Reproducible benchmark scores (`seed=20260402`):
| Agent | Easy | Medium | Hard | **Average** | Precision | Description |
|:---|:---:|:---:|:---:|:---:|:---:|:---|
| π΄ **Naive LLM** | 0.19 | 0.16 | 0.02 | **0.12** | 10% | Single prompt, tiny sample, zero feedback |
| π‘ **Heuristic** | 0.98 | 0.79 | 0.73 | **0.83** | 67% | Deterministic Python rules (honestly labeled, no LLM) |
| π **ReAct (3.1 405B)** | 0.77 | 0.38 | 0.34 | **0.50** | 26% | Massive parameters lead to false-positive hallucinations |
| π’ **ReAct (3.3 70B)** | 0.98 | 0.60 | 0.40 | **0.66** | 45% | Specialized tool-calling efficiently avoids logic traps |
### π§ The Generational Leap: Why 3.3 70B beats 3.1 405B
When forced to play the game fairly, the 405-billion parameter frontier model scored just **0.50**, while the newer, smaller **Llama 3.3 70B scored 0.66**. ClinicalBench successfully exposed the exact architectural difference between the two generations:
1. **The Overthinking Trap (405B's Flaw):** Because 3.1 405B is a massive generalist, it looks at the EHR noise in our Hard task and hallucinates complex, non-existent clinical reasons to flag a patient. Our brutal `-0.30` penalty for false positives caused the 405B to destroy its own score.
2. **Agentic Tool Mastery (70B's Advantage):** Llama 3.3 was heavily fine-tuned for ReAct logic. It doesn't hallucinate ghosts; it calls the `[INV]` tool, reads the JSON, flags the exact patients, and stops. It navigates the environment better because it is a better "driver."
**What This Proves:**
* **Language understanding β clinical reasoning.**
* **Bigger is not always better in auditing.** Raw parameter size leads to overconfidence and false-positive hallucinations.
* **Meta's 3.3 architecture works.** ClinicalBench independently verifies that 3.3's agentic fine-tuning directly translates to safer, more accurate clinical compliance.
### π Frontier Model Leaderboard
We challenge all frontier models to beat the benchmark. Submit your scores via PR.
| Rank | Model | Easy | Medium | Hard | **Avg Score** |
|:---:|:---|:---:|:---:|:---:|:---:|
| 1 | Meta-Llama-3.3-70B-Instruct | 0.98 | 0.60 | 0.40 | **0.66** |
| 2 | Meta-Llama-3.1-405B-Instruct | 0.77 | 0.38 | 0.34 | **0.50** |
| β | _Your model here_ | β | β | β | β |
> **Challenge:** Can any model beat 0.66 average on genuine ReAct evaluation? The 2-hop comorbidity trap, overconfidence penalty, and Simpson's Paradox remain a stress test for every model we evaluate.
### ποΈ ReAct Agent Architecture
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INFERENCE ENGINE β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β Phase 1 β β Phase 2 β β Phase 3 β β
β β INVEST. ββ β BATCHED SCAN ββ β REPORT β β
β β 1 LLM callβ β 25 pts/batch β β 1 LLM call β β
β β ~500 tok β β ~2K tok each β β ~500 tok β β
β β β β MEMORY WIPE β»β β β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β
β Token Budget: ~2K per call (fits 8K context window) β
β Memory Policy: FRESH context each batch (no snowball) β
β Error Budget: -0.30 per false positive, 1.8x overconf β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β JSON actions (investigate/flag/report)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OPENENV ENVIRONMENT (Grading) β
β Procedural Generation β Phase Gate β Scoring β Feedback β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## Action Space
```python
class AuditAction(Action):
action_type: str # investigate_pattern | compute_distribution |
# flag_error | propose_fix | submit_report
variable: Optional[str] # Field to investigate or compute
patient_id: Optional[str] # Patient to flag
error_type: Optional[str] # invalid_age | temporal_inconsistency |
# protocol_window_violation | selection_bias
reason: Optional[str] # Justification text
proposed_value: Optional[str]
report: Optional[str] # Final audit report
confidence: Optional[float] # 0.0-1.0 confidence in the flag
```
## Observation Space
```python
class AuditObservation(Observation):
done: bool # Episode finished?
reward: float # Dense step reward
task_id: str # task_easy | task_medium | task_hard
task_type: str # Audit category
task_description: str # Task instructions
protocol_title: str # Episode protocol ID
trial_protocol_excerpt: str # Natural language protocol rules
dataset: list[dict] # Full patient records
errors_found: list[str] # Correctly flagged patients
patterns_investigated: list[str] # Variables investigated
distributions_computed: list[str] # Distributions computed
feedback: str # Step-by-step feedback
score_so_far: float # Current benchmark score [0, 1]
dense_reward_total: float # Cumulative dense reward
score_breakdown: dict[str, float] # {recall, precision, workflow, efficiency, report}
attempts_remaining: int # Steps left in budget
phase: str # investigation | flagging
```
---
## Reward Design
ClinicalBench uses **two scoring layers** to separate RL training signal from benchmark evaluation:
### Dense Step Reward (for RL training)
- **Correct flag**: +0.16
- **False positive**: β0.26 (asymmetric to penalize guessing)
- **Duplicate flag**: β0.08
- **New investigation**: +0.04
- **Overconfident wrong flag**: reward Γ β1.8
- **Per-step cost**: β0.004 Γ step_count (increasing pressure)
### Episode Benchmark Score (for evaluation)
| Component | Weight | Signal |
|:---|:---:|:---|
| Recall | 70% | What fraction of real errors were caught? |
| Precision | 15% | How many flags were correct? |
| Workflow Discipline | 5% | Did the agent investigate before flagging? |
| Efficiency | 5% | Ratio of useful actions to total actions |
| Report Quality | 5% | Does the report cite protocol, root cause, risk, corrective action, fairness? |
This separation keeps the RL signal dense (partial progress on every step) while preventing early score saturation from hiding later mistakes.
---
## Procedural Generation
Each episode generates a unique dataset with new protocol constraints:
```bash
python3 server/dataset_generator.py
```
**Guarantees:**
- Same seed β identical dataset, protocol, and ground truth
- Different seeds β different protocols with different rules
- Deterministic grading: reproducible scores across machines
- Hard mode alternates between `true_bias` and `confounded_no_bias`
**Example validated profile (seed=42):**
- Easy: 300 patients, 8 errors, 13 traps
- Medium: 480 patients, 23 errors, 25 traps
- Hard: 720 patients, 43 errors, 40 traps (incl. 10 comorbidity override traps)
---
## Quick Start
### 1. Start the Server
```bash
cd server
PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
```
### 2. Open the Dashboard
Navigate to [http://localhost:8000](http://localhost:8000) to see the enterprise audit command center. Select an agent and task, then click **Start Audit** to watch the reasoning loop in real time.
### 3. Health Check
```bash
curl -s http://localhost:8000/health
```
### 4. Run Baseline Inference
```bash
# Full comparison (all 3 agents Γ all 3 tasks)
ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402
# Single agent mode
python3 inference.py --mode full
```
### 5. OpenEnv Validation
```bash
openenv validate .
```
---
## Docker
```bash
docker build -t clinical-bench:latest .
docker run -p 8000:8000 clinical-bench:latest
```
The container exposes:
- `/health` for health checks
- `/` for the enterprise dashboard
- WebSocket endpoints for OpenEnv `reset()` / `step()` / `state()`
---
## Real-World Relevance
ClinicalBench models tasks that clinical data managers perform daily:
| Real-World Task | ClinicalBench Equivalent |
|:---|:---|
| ICH-E6(R2) protocol compliance review | Age eligibility + treatment window verification |
| FDA 21 CFR Part 11 data integrity audit | Temporal consistency checking |
| DSMB safety signal assessment | Stage-adjusted outcome disparity analysis |
| IRB equity review | Confounder-aware selection bias detection |
This benchmark is immediately useful for evaluating whether an LLM-based agent can be safely deployed in a clinical data management workflow β one of healthcare AI's highest-value, highest-risk applications.
---
## OpenEnv Compliance
- [x] Typed `Action`, `Observation`, `State` models (Pydantic)
- [x] `reset(seed, task_id) β Observation`
- [x] `step(action) β Observation`
- [x] `state β current state`
- [x] `openenv.yaml` with metadata and 3 tasks
- [x] `openenv validate .` passes
- [x] 3 tasks with deterministic graders, scores in `[0.0, 1.0]`
- [x] Dense reward shaping + benchmark rubric
- [x] Reproducible `inference.py` at repo root
- [x] Dockerized with health check
- [x] Inference runtime < 3 minutes
- [x] Runs on 2 vCPU / 8GB memory
## Project Structure
```
clinical_trial_auditor/
βββ openenv.yaml # OpenEnv manifest with 3 tasks
βββ inference.py # Baseline inference (naive/heuristic/full)
βββ client.py # EnvClient implementation
βββ models.py # Typed Action/Observation/State
βββ README.md
βββ Dockerfile
βββ requirements.txt
βββ pyproject.toml
βββ docs/
β βββ architecture.md # Detailed system architecture
βββ server/
βββ app.py # FastAPI + dashboard API
βββ clinical_trial_auditor_environment.py
βββ dataset_generator.py # Procedural adversarial data engine
βββ models.py
βββ requirements.txt
βββ static/
βββ index.html # Enterprise audit dashboard
```
---
**Built for the Meta Γ Scaler School of Technology OpenEnv Hackathon**
### 𧬠Developer Note & Lineage
ClinicalBench is deeply informed by my ongoing research and architecture development on a **SEER (Surveillance, Epidemiology, and End Results) based oncology project**, active since 2024. The complexities modeled in this benchmarkβspecifically the Simpson's Paradox confounders, Stage IV comorbidity overrides, and the immense noise of real-world Electronic Health Recordsβare direct reflections of the challenges encountered when processing live clinical oncology data.
*Because the hardest thing about AI in healthcare isn't the model β it's knowing when to trust it.*
β **Sumit Saraswat** | GLA University