Active Episode Protocol
Protocol ID
Awaiting reset()
Trial Protocol Excerpt
Start an audit to load the episode-specific protocol.
Each episode generates a unique protocol with dynamic rules:
• Age eligibility ranges change per episode
• Treatment scheduling windows vary
• Stage IV exceptions create valid edge cases
• Bias thresholds are protocol-specific
The agent must READ these rules — not assume defaults.
Live Agent Telemetry
Select an agent and task, then click Start Audit
to watch the reasoning loop in real time.
The benchmark runs Naive → Heuristic → ReAct LLM agents
against procedurally generated clinical trial data.
to watch the reasoning loop in real time.
The benchmark runs Naive → Heuristic → ReAct LLM agents
against procedurally generated clinical trial data.
Evaluation Metrics
Precision
—
Recall
—
Workflow
—
Efficiency
—
⚡ LLM Capability Gap (Average Score)
📊 Per-Task Breakdown
| Agent | Easy | Med | Hard | Avg |
|---|---|---|---|---|
| Naive LLM | 0.19 | 0.16 | 0.02 | 0.12 |
| Heuristic | 0.98 | 0.79 | 0.73 | 0.83 |
| ReAct 405B | 0.77 | 0.38 | 0.34 | 0.50 |
Key finding: The 405B — the smartest frontier model — scores 0.50 on genuine ReAct evaluation. It drowns in false positives (74% wrong flags), fails calendar math, and misses 2-hop comorbidity traps. No cheating. Pure LLM struggle.