Spaces:
Sleeping
Sleeping
| title: ClinicalBench | |
| emoji: 🔬 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| tags: | |
| - openenv | |
| <div align="center"> | |
| # 🔬 ClinicalBench | |
| ### A Benchmark for Evaluating Agentic Reasoning in Safety-Critical Clinical Workflows | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor) | |
| [](#docker) | |
| [](LICENSE) | |
| [](#benchmark-results) | |
| [](#benchmark-results) | |
| [](#task-descriptions) | |
| [](#why-clinicalbench-is-hard) | |
| > **🎯 Llama 3.3 70B beats the 405B frontier model (0.66 vs 0.50).** ClinicalBench is an OpenEnv benchmark where LLMs audit 720 oncology patient records against procedurally generated protocols. By utilizing multi-hop comorbidity traps, Simpson's Paradox confounders, and a brutal -0.30 false-positive penalty, ClinicalBench proves that agentic tool-calling efficiency (3.3 70B) outperforms raw parameter size (3.1 405B) in safety-critical workflows. | |
| [Live Demo](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor) · [Architecture](#architecture) · [Results](#benchmark-results) · [Quick Start](#quick-start) · [Leaderboard](#-frontier-model-leaderboard) | |
| </div> | |
| --- | |
| ## 🖥️ The Enterprise Audit Dashboard (Live Demo) | |
| *Because safety-critical AI requires transparency, ClinicalBench includes a production-ready enterprise dashboard to visualize the agent's ReAct loop in real-time.* | |
| Launch the **[Hugging Face Space](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor)** to see the 70B reasoning agent actively triage patients, compute bias distributions, and flag protocol violations while safely navigating the 8K token context limit. | |
| --- | |
| ## The Problem | |
| Clinical data auditing is one of medicine's most consequential workflows. A single undetected protocol violation can invalidate years of trial data, delay drug approvals, and — in worst cases — put patients at risk. Today's AI systems fail at this task in three specific ways: | |
| | Failure Mode | What Happens | Why It Matters | | |
| |:---|:---|:---| | |
| | **Overflagging** | LLMs flag valid edge cases (e.g., Stage IV patients with extended treatment windows) as violations | False alarms waste reviewer time and erode trust in AI-assisted auditing | | |
| | **Temporal Confusion** | Models miss impossible date orderings (death before treatment) while fixating on superficial anomalies | Critical safety signals go undetected | | |
| | **Bias Misinterpretation** | Models detect demographic skew in raw statistics but cannot distinguish genuine selection bias from confounded high-risk cohorts | Naive bias detection causes incorrect escalations or dangerous dismissals | | |
| ClinicalBench is designed to evaluate and train agents that can overcome all three failure modes simultaneously. | |
| --- | |
| ## Why ClinicalBench Exists | |
| Existing RL benchmarks for agents fall into two categories: **game-like environments** (code golf, math puzzles) where memorization helps, and **static dataset tasks** (classification, extraction) where the answer is fixed. Neither captures the reality of clinical auditing, where: | |
| - **Rules change every episode** — eligibility criteria, timing windows, and bias thresholds are protocol-specific | |
| - **Edge cases are not errors** — Stage IV patients legitimately have longer treatment windows | |
| - **Statistics lie without context** — a minority group's higher mortality rate may reflect disease severity, not unfair sampling | |
| - **The step budget is limited** — agents must prioritize which patients and which patterns to investigate | |
| ClinicalBench fills this gap by generating a new procedural dataset and protocol for every `reset()`, forcing agents to **read and reason** rather than memorize. | |
| --- | |
| ## Architecture | |
| ``` | |
| ┌─────────────────────────────────────────────────────────────────┐ | |
| │ ClinicalBench Architecture │ | |
| ├─────────────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ reset(seed, task_id) │ | |
| │ │ │ | |
| │ ▼ │ | |
| │ ┌──────────────────────┐ ┌─────────────────────────────┐ │ | |
| │ │ Procedural Dataset │───▶│ Episode-Specific Protocol │ │ | |
| │ │ Generator │ │ Excerpt │ │ | |
| │ │ • 300-720 patients │ │ • Dynamic age range │ │ | |
| │ │ • Seeded RNG │ │ • Variable timing windows │ │ | |
| │ │ • Adversarial traps │ │ • Stage IV exceptions │ │ | |
| │ │ • Hidden confounders│ │ • Bias thresholds │ │ | |
| │ └──────────────────────┘ └─────────────────────────────┘ │ | |
| │ │ │ │ | |
| │ ▼ ▼ │ | |
| │ ┌─────────────────────────────────────────────────────────┐ │ | |
| │ │ Agent Interaction Loop │ │ | |
| │ │ Thought → Tool → Observation → Flag → Report │ │ | |
| │ ├─────────────────────────────────────────────────────────┤ │ | |
| │ │ investigate_pattern(var) → distribution summary │ │ | |
| │ │ compute_distribution(var) → cohort breakdown │ │ | |
| │ │ flag_error(patient, type) → correct/false positive │ │ | |
| │ │ submit_report(text) → quality score │ │ | |
| │ └─────────────────────────────────────────────────────────┘ │ | |
| │ │ │ | |
| │ ▼ │ | |
| │ ┌─────────────────────────────────────────────────────────┐ │ | |
| │ │ Multi-Dimensional Grading │ │ | |
| │ │ Recall (70%) + Precision (15%) + Workflow (5%) │ │ | |
| │ │ + Efficiency (5%) + Report Quality (5%) │ │ | |
| │ │ Dense step rewards + episode benchmark score │ │ | |
| │ └─────────────────────────────────────────────────────────┘ │ | |
| │ │ | |
| └─────────────────────────────────────────────────────────────────┘ | |
| ``` | |
| ### Key Design Decisions | |
| 1. **Procedural Generation** — Each `reset()` samples a new protocol with different age ranges, timing windows, and bias thresholds using seeded stochastic processes. No two environments are identical, preventing memorization. | |
| 2. **Adversarial Traps** — Valid edge cases (boundary ages, near-window delays, valid Stage IV exceptions) are deliberately injected to punish agents that use naive threshold-based heuristics. | |
| 3. **Confounder-Aware Bias** — Hard episodes may contain either genuine selection bias OR a confounded high-risk cohort. The confounder (high-risk outreach site with more late-stage patients) creates an overall mortality gap that disappears after stage-stratified analysis. Agents must perform this adjustment before flagging. | |
| 4. **Phase-Gated Workflow** — Agents must investigate variables before flagging errors, and compute distributions before claiming bias. Skipping phases is penalized, encouraging structured reasoning over guessing. | |
| --- | |
| ## Task Suite | |
| ### Task 1: `task_easy` — Dynamic Eligibility Screening | |
| | Property | Value | | |
| |:---|:---| | |
| | Dataset | ~300 patients | | |
| | Error types | `invalid_age` | | |
| | Difficulty source | Age bounds are episode-specific (e.g., 35-75, 45-85), not fixed at 18-120 | | |
| | Traps | Valid boundary ages at exact protocol limits | | |
| | Step budget | 25 | | |
| ### Task 2: `task_medium` — Protocol Timeline Audit | |
| | Property | Value | | |
| |:---|:---| | |
| | Dataset | ~480 patients | | |
| | Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation` | | |
| | Difficulty source | Treatment-start window is protocol-specific; Stage IV has a longer valid window | | |
| | Traps | Near-boundary delays, valid Stage IV exceptions, near-immediate valid deaths | | |
| | Step budget | 50 | | |
| ### Task 3: `task_hard` — Equity + Protocol Audit | |
| | Property | Value | | |
| |:---|:---| | |
| | Dataset | ~720 patients with **25+ fields** (including 11 clinical noise columns) | | |
| | Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation`, `selection_bias` | | |
| | Difficulty source | Multi-hop comorbidity exception, Simpson's Paradox bias, context dilution from EHR noise | | |
| | Traps | Comorbidity-negated Stage IV exceptions, confounder cohorts, treatment-arm skew, near-boundary windows | | |
| | Step budget | 75 (tight for 29 batches + investigations + flags) | | |
| --- | |
| ## Why ClinicalBench Is Hard | |
| This benchmark is designed to expose fundamental limitations in current AI systems: | |
| | Challenge | Why It Breaks Naive Agents | | |
| |:---|:---| | |
| | **Dynamic protocols** | Rules embedded in natural language change every episode — hardcoded thresholds fail | | |
| | **Multi-hop comorbidity override** | Stage IV exception is revoked when `comorbidity_index > threshold` — requires 3-step cross-referencing (stage → comorbidity → window) that LLMs almost always miss | | |
| | **Clinical noise columns** | 11 realistic EHR fields (BMI, LDH, medications, etc.) dilute LLM attention across 720 × 25+ field records | | |
| | **Simpson's Paradox** | High-risk sites inflate mortality for minorities, but the cause is disease severity, not sampling bias — overall stats look fine | | |
| | **Tight step budget** | 75 steps for 40+ errors in 720 patients — agents must triage across 29 batches and cannot check everything | | |
| | **Phased workflow** | Flagging before investigating is blocked and penalized — forces structured reasoning | | |
| | **Overconfidence penalty** | High-confidence wrong flags are penalized 1.8× — discourages guessing | | |
| --- | |
| ## Benchmark Results | |
| > **All scores are from genuine LLM inference** — the model reads raw patient data, decides what to flag, and gets scored by the environment. No Python detectors, no hardcoded logic. The LLM is the brain; Python is just the hands. | |
| Reproducible benchmark scores (`seed=20260402`): | |
| | Agent | Easy | Medium | Hard | **Average** | Precision | Description | | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---| | |
| | 🔴 **Naive LLM** | 0.19 | 0.16 | 0.02 | **0.12** | 10% | Single prompt, tiny sample, zero feedback | | |
| | 🟡 **Heuristic** | 0.98 | 0.79 | 0.73 | **0.83** | 67% | Deterministic Python rules (honestly labeled, no LLM) | | |
| | 🟠 **ReAct (3.1 405B)** | 0.77 | 0.38 | 0.34 | **0.50** | 26% | Massive parameters lead to false-positive hallucinations | | |
| | 🟢 **ReAct (3.3 70B)** | 0.98 | 0.60 | 0.40 | **0.66** | 45% | Specialized tool-calling efficiently avoids logic traps | | |
| ### 🧠 The Generational Leap: Why 3.3 70B beats 3.1 405B | |
| When forced to play the game fairly, the 405-billion parameter frontier model scored just **0.50**, while the newer, smaller **Llama 3.3 70B scored 0.66**. ClinicalBench successfully exposed the exact architectural difference between the two generations: | |
| 1. **The Overthinking Trap (405B's Flaw):** Because 3.1 405B is a massive generalist, it looks at the EHR noise in our Hard task and hallucinates complex, non-existent clinical reasons to flag a patient. Our brutal `-0.30` penalty for false positives caused the 405B to destroy its own score. | |
| 2. **Agentic Tool Mastery (70B's Advantage):** Llama 3.3 was heavily fine-tuned for ReAct logic. It doesn't hallucinate ghosts; it calls the `[INV]` tool, reads the JSON, flags the exact patients, and stops. It navigates the environment better because it is a better "driver." | |
| **What This Proves:** | |
| * **Language understanding ≠ clinical reasoning.** | |
| * **Bigger is not always better in auditing.** Raw parameter size leads to overconfidence and false-positive hallucinations. | |
| * **Meta's 3.3 architecture works.** ClinicalBench independently verifies that 3.3's agentic fine-tuning directly translates to safer, more accurate clinical compliance. | |
| ### 🏆 Frontier Model Leaderboard | |
| We challenge all frontier models to beat the benchmark. Submit your scores via PR. | |
| | Rank | Model | Easy | Medium | Hard | **Avg Score** | | |
| |:---:|:---|:---:|:---:|:---:|:---:| | |
| | 1 | Meta-Llama-3.3-70B-Instruct | 0.98 | 0.60 | 0.40 | **0.66** | | |
| | 2 | Meta-Llama-3.1-405B-Instruct | 0.77 | 0.38 | 0.34 | **0.50** | | |
| | — | _Your model here_ | — | — | — | — | | |
| > **Challenge:** Can any model beat 0.66 average on genuine ReAct evaluation? The 2-hop comorbidity trap, overconfidence penalty, and Simpson's Paradox remain a stress test for every model we evaluate. | |
| ### 🏗️ ReAct Agent Architecture | |
| ``` | |
| ┌────────────────────────────────────────────────────────────┐ | |
| │ INFERENCE ENGINE │ | |
| │ ┌──────────┐ ┌──────────────┐ ┌──────────────────────┐ │ | |
| │ │ Phase 1 │ │ Phase 2 │ │ Phase 3 │ │ | |
| │ │ INVEST. │→ │ BATCHED SCAN │→ │ REPORT │ │ | |
| │ │ 1 LLM call│ │ 25 pts/batch │ │ 1 LLM call │ │ | |
| │ │ ~500 tok │ │ ~2K tok each │ │ ~500 tok │ │ | |
| │ │ │ │ MEMORY WIPE ↻│ │ │ │ | |
| │ └──────────┘ └──────────────┘ └──────────────────────┘ │ | |
| │ │ | |
| │ Token Budget: ~2K per call (fits 8K context window) │ | |
| │ Memory Policy: FRESH context each batch (no snowball) │ | |
| │ Error Budget: -0.30 per false positive, 1.8x overconf │ | |
| └────────────────────────────────────────────────────────────┘ | |
| ↕ JSON actions (investigate/flag/report) | |
| ┌────────────────────────────────────────────────────────────┐ | |
| │ OPENENV ENVIRONMENT (Grading) │ | |
| │ Procedural Generation → Phase Gate → Scoring → Feedback │ | |
| └────────────────────────────────────────────────────────────┘ | |
| ``` | |
| --- | |
| ## Action Space | |
| ```python | |
| class AuditAction(Action): | |
| action_type: str # investigate_pattern | compute_distribution | | |
| # flag_error | propose_fix | submit_report | |
| variable: Optional[str] # Field to investigate or compute | |
| patient_id: Optional[str] # Patient to flag | |
| error_type: Optional[str] # invalid_age | temporal_inconsistency | | |
| # protocol_window_violation | selection_bias | |
| reason: Optional[str] # Justification text | |
| proposed_value: Optional[str] | |
| report: Optional[str] # Final audit report | |
| confidence: Optional[float] # 0.0-1.0 confidence in the flag | |
| ``` | |
| ## Observation Space | |
| ```python | |
| class AuditObservation(Observation): | |
| done: bool # Episode finished? | |
| reward: float # Dense step reward | |
| task_id: str # task_easy | task_medium | task_hard | |
| task_type: str # Audit category | |
| task_description: str # Task instructions | |
| protocol_title: str # Episode protocol ID | |
| trial_protocol_excerpt: str # Natural language protocol rules | |
| dataset: list[dict] # Full patient records | |
| errors_found: list[str] # Correctly flagged patients | |
| patterns_investigated: list[str] # Variables investigated | |
| distributions_computed: list[str] # Distributions computed | |
| feedback: str # Step-by-step feedback | |
| score_so_far: float # Current benchmark score [0, 1] | |
| dense_reward_total: float # Cumulative dense reward | |
| score_breakdown: dict[str, float] # {recall, precision, workflow, efficiency, report} | |
| attempts_remaining: int # Steps left in budget | |
| phase: str # investigation | flagging | |
| ``` | |
| --- | |
| ## Reward Design | |
| ClinicalBench uses **two scoring layers** to separate RL training signal from benchmark evaluation: | |
| ### Dense Step Reward (for RL training) | |
| - **Correct flag**: +0.16 | |
| - **False positive**: −0.26 (asymmetric to penalize guessing) | |
| - **Duplicate flag**: −0.08 | |
| - **New investigation**: +0.04 | |
| - **Overconfident wrong flag**: reward × −1.8 | |
| - **Per-step cost**: −0.004 × step_count (increasing pressure) | |
| ### Episode Benchmark Score (for evaluation) | |
| | Component | Weight | Signal | | |
| |:---|:---:|:---| | |
| | Recall | 70% | What fraction of real errors were caught? | | |
| | Precision | 15% | How many flags were correct? | | |
| | Workflow Discipline | 5% | Did the agent investigate before flagging? | | |
| | Efficiency | 5% | Ratio of useful actions to total actions | | |
| | Report Quality | 5% | Does the report cite protocol, root cause, risk, corrective action, fairness? | | |
| This separation keeps the RL signal dense (partial progress on every step) while preventing early score saturation from hiding later mistakes. | |
| --- | |
| ## Procedural Generation | |
| Each episode generates a unique dataset with new protocol constraints: | |
| ```bash | |
| python3 server/dataset_generator.py | |
| ``` | |
| **Guarantees:** | |
| - Same seed → identical dataset, protocol, and ground truth | |
| - Different seeds → different protocols with different rules | |
| - Deterministic grading: reproducible scores across machines | |
| - Hard mode alternates between `true_bias` and `confounded_no_bias` | |
| **Example validated profile (seed=42):** | |
| - Easy: 300 patients, 8 errors, 13 traps | |
| - Medium: 480 patients, 23 errors, 25 traps | |
| - Hard: 720 patients, 43 errors, 40 traps (incl. 10 comorbidity override traps) | |
| --- | |
| ## Quick Start | |
| ### 1. Start the Server | |
| ```bash | |
| cd server | |
| PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### 2. Open the Dashboard | |
| Navigate to [http://localhost:8000](http://localhost:8000) to see the enterprise audit command center. Select an agent and task, then click **Start Audit** to watch the reasoning loop in real time. | |
| ### 3. Health Check | |
| ```bash | |
| curl -s http://localhost:8000/health | |
| ``` | |
| ### 4. Run Baseline Inference | |
| ```bash | |
| # Full comparison (all 3 agents × all 3 tasks) | |
| ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402 | |
| # Single agent mode | |
| python3 inference.py --mode full | |
| ``` | |
| ### 5. OpenEnv Validation | |
| ```bash | |
| openenv validate . | |
| ``` | |
| --- | |
| ## Docker | |
| ```bash | |
| docker build -t clinical-bench:latest . | |
| docker run -p 8000:8000 clinical-bench:latest | |
| ``` | |
| The container exposes: | |
| - `/health` for health checks | |
| - `/` for the enterprise dashboard | |
| - WebSocket endpoints for OpenEnv `reset()` / `step()` / `state()` | |
| --- | |
| ## Real-World Relevance | |
| ClinicalBench models tasks that clinical data managers perform daily: | |
| | Real-World Task | ClinicalBench Equivalent | | |
| |:---|:---| | |
| | ICH-E6(R2) protocol compliance review | Age eligibility + treatment window verification | | |
| | FDA 21 CFR Part 11 data integrity audit | Temporal consistency checking | | |
| | DSMB safety signal assessment | Stage-adjusted outcome disparity analysis | | |
| | IRB equity review | Confounder-aware selection bias detection | | |
| This benchmark is immediately useful for evaluating whether an LLM-based agent can be safely deployed in a clinical data management workflow — one of healthcare AI's highest-value, highest-risk applications. | |
| --- | |
| ## OpenEnv Compliance | |
| - [x] Typed `Action`, `Observation`, `State` models (Pydantic) | |
| - [x] `reset(seed, task_id) → Observation` | |
| - [x] `step(action) → Observation` | |
| - [x] `state → current state` | |
| - [x] `openenv.yaml` with metadata and 3 tasks | |
| - [x] `openenv validate .` passes | |
| - [x] 3 tasks with deterministic graders, scores in `[0.0, 1.0]` | |
| - [x] Dense reward shaping + benchmark rubric | |
| - [x] Reproducible `inference.py` at repo root | |
| - [x] Dockerized with health check | |
| - [x] Inference runtime < 3 minutes | |
| - [x] Runs on 2 vCPU / 8GB memory | |
| ## Project Structure | |
| ``` | |
| clinical_trial_auditor/ | |
| ├── openenv.yaml # OpenEnv manifest with 3 tasks | |
| ├── inference.py # Baseline inference (naive/heuristic/full) | |
| ├── client.py # EnvClient implementation | |
| ├── models.py # Typed Action/Observation/State | |
| ├── README.md | |
| ├── Dockerfile | |
| ├── requirements.txt | |
| ├── pyproject.toml | |
| ├── docs/ | |
| │ └── architecture.md # Detailed system architecture | |
| └── server/ | |
| ├── app.py # FastAPI + dashboard API | |
| ├── clinical_trial_auditor_environment.py | |
| ├── dataset_generator.py # Procedural adversarial data engine | |
| ├── models.py | |
| ├── requirements.txt | |
| └── static/ | |
| └── index.html # Enterprise audit dashboard | |
| ``` | |
| --- | |
| <div align="center"> | |
| **Built for the Meta × Scaler School of Technology OpenEnv Hackathon** | |
| ### 🧬 Developer Note & Lineage | |
| ClinicalBench is deeply informed by my ongoing research and architecture development on a **SEER (Surveillance, Epidemiology, and End Results) based oncology project**, active since 2024. The complexities modeled in this benchmark—specifically the Simpson's Paradox confounders, Stage IV comorbidity overrides, and the immense noise of real-world Electronic Health Records—are direct reflections of the challenges encountered when processing live clinical oncology data. | |
| *Because the hardest thing about AI in healthcare isn't the model — it's knowing when to trust it.* <br> | |
| — **Sumit Saraswat** | GLA University | |
| </div> | |