ClinicalBench
Agentic Clinical Trial Audit Benchmark
OpenEnv v3
Active Model
3 Tasks
Easy → Hard
300–720
Patients/Episode
Seed
Elapsed

Active Episode Protocol

Protocol ID
Awaiting reset()
Trial Protocol Excerpt
Start an audit to load the episode-specific protocol. Each episode generates a unique protocol with dynamic rules: • Age eligibility ranges change per episode • Treatment scheduling windows vary • Stage IV exceptions create valid edge cases • Bias thresholds are protocol-specific The agent must READ these rules — not assume defaults.
Difficulty
Patients
Max Steps
Errors

Live Agent Telemetry

🔬
Select an agent and task, then click Start Audit
to watch the reasoning loop in real time.

The benchmark runs Naive → Heuristic → ReAct LLM agents
against procedurally generated clinical trial data.

Evaluation Metrics

0.00 BENCHMARK SCORE
Precision
Recall
Workflow
Efficiency
⚡ LLM Capability Gap (Average Score)
Naive
0.12
Heuristic
0.83
ReAct LLM
0.50
📊 Per-Task Breakdown
AgentEasyMedHardAvg
Naive LLM 0.190.16 0.020.12
Heuristic 0.980.79 0.730.83
ReAct 405B 0.770.38 0.340.50
Key finding: The 405B — the smartest frontier model — scores 0.50 on genuine ReAct evaluation. It drowns in false positives (74% wrong flags), fails calendar math, and misses 2-hop comorbidity traps. No cheating. Pure LLM struggle.
Environment ready
OpenEnv Spec v3 · Phase III Oncology · Procedural Generation