ClinicalBench — Agentic Clinical Trial Audit Benchmark

Active Episode Protocol

Protocol ID

Awaiting reset()

Trial Protocol Excerpt

Start an audit to load the episode-specific protocol. Each episode generates a unique protocol with dynamic rules: • Age eligibility ranges change per episode • Treatment scheduling windows vary • Stage IV exceptions create valid edge cases • Bias thresholds are protocol-specific The agent must READ these rules — not assume defaults.

Difficulty

—

Patients

—

Max Steps

—

Errors

—

Live Agent Telemetry

🔬

Select an agent and task, then click Start Audit
to watch the reasoning loop in real time.

The benchmark runs Naive → Heuristic → ReAct LLM agents
against procedurally generated clinical trial data.

Evaluation Metrics

Precision

—

Recall

—

Workflow

—

Efficiency

—

⚡ LLM Capability Gap (Average Score)

Naive

0.12

Heuristic

0.83

ReAct LLM

0.50

📊 Per-Task Breakdown

Agent	Easy	Med	Hard	Avg
Naive LLM	0.19	0.16	0.02	0.12
Heuristic	0.98	0.79	0.73	0.83
ReAct 405B	0.77	0.38	0.34	0.50

Key finding: The 405B — the smartest frontier model — scores 0.50 on genuine ReAct evaluation. It drowns in false positives (74% wrong flags), fails calendar math, and misses 2-hop comorbidity traps. No cheating. Pure LLM struggle.