Spaces:
Sleeping
Sleeping
Sumit Saraswat commited on
Commit Β·
bfa8604
1
Parent(s): 4b5cda3
feat: added Enterprise UI dashboard, ReAct reasoning traces, and health endpoints
Browse files- Dockerfile +1 -1
- README.md +278 -165
- docs/architecture.md +126 -0
- inference.py +54 -14
- requirements.txt +2 -1
- server/app.py +513 -0
- server/static/index.html +818 -0
Dockerfile
CHANGED
|
@@ -9,7 +9,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf
|
|
| 9 |
COPY requirements.txt .
|
| 10 |
RUN pip install --no-cache-dir -r requirements.txt
|
| 11 |
|
| 12 |
-
# Copy all
|
| 13 |
COPY . .
|
| 14 |
|
| 15 |
EXPOSE 8000
|
|
|
|
| 9 |
COPY requirements.txt .
|
| 10 |
RUN pip install --no-cache-dir -r requirements.txt
|
| 11 |
|
| 12 |
+
# Copy all project files
|
| 13 |
COPY . .
|
| 14 |
|
| 15 |
EXPOSE 8000
|
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
|
@@ -10,259 +10,372 @@ tags:
|
|
| 10 |
- openenv
|
| 11 |
---
|
| 12 |
|
|
|
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
- eligibility criteria vary by protocol,
|
| 24 |
-
- timeline rules include exceptions,
|
| 25 |
-
- suspicious subgroup outcomes are not always evidence of bias,
|
| 26 |
-
- false positives waste reviewer time and can trigger unnecessary escalations.
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- Cross-modal audit logic: the agent must apply text rules from the protocol to tabular patient data.
|
| 34 |
-
- Stage-aware timing exceptions: Stage IV patients can have a longer enrollment-to-treatment window, which creates valid edge cases that trap shortcut heuristics.
|
| 35 |
-
- Hallucination traps: hard episodes can contain a confounded high-risk cohort that looks biased overall but is not actionable after stage-adjusted review.
|
| 36 |
-
- Dense reward plus benchmark rubric: step rewards encourage learning, while `score_so_far` tracks a judge-facing episode rubric emphasizing recall, precision, workflow discipline, efficiency, and report quality.
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
- `openenv.yaml` at the repo root.
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
```text
|
| 56 |
-
[OK] : Ready for multi-mode deployment
|
| 57 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
## Task Suite
|
| 60 |
|
| 61 |
### Task 1: `task_easy` β Dynamic Eligibility Screening
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
-
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Task 2: `task_medium` β Protocol Timeline Audit
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
### Task 3: `task_hard` β Equity + Protocol Audit
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
-
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
## Action Space
|
| 80 |
|
| 81 |
```python
|
| 82 |
class AuditAction(Action):
|
| 83 |
-
action_type: str
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
proposed_value: Optional[str]
|
| 89 |
-
report: Optional[str]
|
| 90 |
-
confidence: Optional[float]
|
| 91 |
```
|
| 92 |
|
| 93 |
## Observation Space
|
| 94 |
|
| 95 |
```python
|
| 96 |
class AuditObservation(Observation):
|
| 97 |
-
done: bool
|
| 98 |
-
reward: float
|
| 99 |
-
task_id: str
|
| 100 |
-
task_type: str
|
| 101 |
-
task_description: str
|
| 102 |
-
protocol_title: str
|
| 103 |
-
trial_protocol_excerpt: str
|
| 104 |
-
dataset: list[dict]
|
| 105 |
-
errors_found: list[str]
|
| 106 |
-
patterns_investigated: list[str]
|
| 107 |
-
distributions_computed: list[str]
|
| 108 |
-
feedback: str
|
| 109 |
-
score_so_far: float
|
| 110 |
-
dense_reward_total: float
|
| 111 |
-
score_breakdown: dict[str, float]
|
| 112 |
-
attempts_remaining: int
|
| 113 |
-
phase: str
|
| 114 |
```
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
-
- correct flags,
|
| 122 |
-
- false-positive penalties,
|
| 123 |
-
- duplicate penalties,
|
| 124 |
-
- investigation/distribution bonuses,
|
| 125 |
-
- confidence penalties for overconfident wrong flags,
|
| 126 |
-
- per-step costs.
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
|
|
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
```bash
|
| 142 |
python3 server/dataset_generator.py
|
| 143 |
```
|
| 144 |
|
| 145 |
-
|
| 146 |
-
-
|
| 147 |
-
-
|
| 148 |
-
-
|
| 149 |
-
-
|
| 150 |
-
|
| 151 |
-
Example validated seeded profile:
|
| 152 |
-
|
| 153 |
-
- Easy: `300` patients, `8` record-level errors, `13` traps
|
| 154 |
-
- Medium: `480` patients, `23` record-level errors, `25` traps
|
| 155 |
-
- Hard: `720` patients, `34` total issues including protocol/timing/bias logic, `40` traps
|
| 156 |
-
|
| 157 |
-
## Baseline Inference (`inference.py`)
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
-
|
| 162 |
-
- `heuristic`: rule-based but trap-prone
|
| 163 |
-
- `full`: protocol parser + stage-aware detectors + structured reporting
|
| 164 |
-
- `all`: side-by-side comparison
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
-
python3 inference.py --mode all
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
Isolated local validation mode with no socket bind:
|
| 173 |
|
| 174 |
```bash
|
| 175 |
-
|
|
|
|
| 176 |
```
|
| 177 |
|
| 178 |
-
|
| 179 |
-
- When `OPENAI_API_KEY` or `HF_TOKEN` is present, naive mode and report generation use the OpenAI-compatible client pointed at `API_BASE_URL`.
|
| 180 |
-
- Without a key, the script falls back to deterministic local behavior so validation still runs end-to-end.
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
```bash
|
| 187 |
-
|
| 188 |
```
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
| Heuristic | 0.81 | 0.56 | 0.45 | 0.60 |
|
| 196 |
-
| Full | 0.98 | 0.99 | 0.99 | 0.99 |
|
| 197 |
-
|
| 198 |
-
This is the intended story:
|
| 199 |
-
- naive agents underperform badly,
|
| 200 |
-
- shallow heuristics get trapped by dynamic protocol edges and confounded bias signals,
|
| 201 |
-
- protocol-aware agents perform strongly.
|
| 202 |
|
| 203 |
-
#
|
|
|
|
|
|
|
| 204 |
|
| 205 |
-
###
|
| 206 |
|
| 207 |
```bash
|
| 208 |
-
|
| 209 |
-
PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
|
| 210 |
```
|
| 211 |
|
| 212 |
-
|
|
|
|
|
|
|
| 213 |
|
| 214 |
```bash
|
| 215 |
-
|
|
|
|
| 216 |
```
|
| 217 |
|
| 218 |
-
|
|
|
|
|
|
|
|
|
|
| 219 |
|
| 220 |
-
|
| 221 |
-
cd ..
|
| 222 |
-
python3 inference.py --mode all
|
| 223 |
-
```
|
| 224 |
|
| 225 |
-
##
|
| 226 |
|
| 227 |
-
|
| 228 |
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
|
|
|
| 234 |
|
| 235 |
-
|
| 236 |
|
| 237 |
-
|
| 238 |
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
- [x] `
|
| 242 |
-
- [x]
|
| 243 |
-
- [x]
|
| 244 |
-
- [x]
|
| 245 |
-
- [x]
|
| 246 |
- [x] `openenv validate .` passes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
| 248 |
## Project Structure
|
| 249 |
|
| 250 |
-
```
|
| 251 |
clinical_trial_auditor/
|
| 252 |
-
βββ openenv.yaml
|
| 253 |
-
βββ inference.py
|
| 254 |
-
βββ client.py
|
| 255 |
-
βββ models.py
|
| 256 |
βββ README.md
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
βββ server/
|
| 258 |
-
βββ app.py
|
| 259 |
βββ clinical_trial_auditor_environment.py
|
| 260 |
-
βββ dataset_generator.py
|
| 261 |
βββ models.py
|
| 262 |
βββ requirements.txt
|
| 263 |
-
βββ
|
|
|
|
| 264 |
```
|
| 265 |
|
| 266 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
|
| 268 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
title: ClinicalBench
|
| 3 |
+
emoji: π¬
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
|
|
|
| 10 |
- openenv
|
| 11 |
---
|
| 12 |
|
| 13 |
+
<div align="center">
|
| 14 |
|
| 15 |
+
# π¬ ClinicalBench
|
| 16 |
|
| 17 |
+
### A Benchmark for Evaluating Agentic Reasoning in Safety-Critical Clinical Workflows
|
| 18 |
|
| 19 |
+
[](https://github.com/meta-pytorch/OpenEnv)
|
| 20 |
+
[](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor)
|
| 21 |
+
[](#docker)
|
| 22 |
+
[](LICENSE)
|
| 23 |
|
| 24 |
+
**Modern AI systems fail silently in high-stakes domains like clinical trials due to inability to reason about protocol constraints, temporal causality, and fairness simultaneously. ClinicalBench is an OpenEnv benchmark that exposes these failure modes.**
|
| 25 |
|
| 26 |
+
[Live Demo](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor) Β· [Architecture](#architecture) Β· [Results](#benchmark-results) Β· [Quick Start](#quick-start)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
</div>
|
| 29 |
|
| 30 |
+
---
|
| 31 |
|
| 32 |
+
## The Problem
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
Clinical data auditing is one of medicine's most consequential workflows. A single undetected protocol violation can invalidate years of trial data, delay drug approvals, and β in worst cases β put patients at risk. Today's AI systems fail at this task in three specific ways:
|
| 35 |
|
| 36 |
+
| Failure Mode | What Happens | Why It Matters |
|
| 37 |
+
|:---|:---|:---|
|
| 38 |
+
| **Overflagging** | LLMs flag valid edge cases (e.g., Stage IV patients with extended treatment windows) as violations | False alarms waste reviewer time and erode trust in AI-assisted auditing |
|
| 39 |
+
| **Temporal Confusion** | Models miss impossible date orderings (death before treatment) while fixating on superficial anomalies | Critical safety signals go undetected |
|
| 40 |
+
| **Bias Misinterpretation** | Models detect demographic skew in raw statistics but cannot distinguish genuine selection bias from confounded high-risk cohorts | Naive bias detection causes incorrect escalations or dangerous dismissals |
|
|
|
|
| 41 |
|
| 42 |
+
ClinicalBench is designed to evaluate and train agents that can overcome all three failure modes simultaneously.
|
| 43 |
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## Why ClinicalBench Exists
|
| 47 |
+
|
| 48 |
+
Existing RL benchmarks for agents fall into two categories: **game-like environments** (code golf, math puzzles) where memorization helps, and **static dataset tasks** (classification, extraction) where the answer is fixed. Neither captures the reality of clinical auditing, where:
|
| 49 |
+
|
| 50 |
+
- **Rules change every episode** β eligibility criteria, timing windows, and bias thresholds are protocol-specific
|
| 51 |
+
- **Edge cases are not errors** β Stage IV patients legitimately have longer treatment windows
|
| 52 |
+
- **Statistics lie without context** β a minority group's higher mortality rate may reflect disease severity, not unfair sampling
|
| 53 |
+
- **The step budget is limited** β agents must prioritize which patients and which patterns to investigate
|
| 54 |
+
|
| 55 |
+
ClinicalBench fills this gap by generating a new procedural dataset and protocol for every `reset()`, forcing agents to **read and reason** rather than memorize.
|
| 56 |
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Architecture
|
| 60 |
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 63 |
+
β ClinicalBench Architecture β
|
| 64 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 65 |
+
β β
|
| 66 |
+
β reset(seed, task_id) β
|
| 67 |
+
β β β
|
| 68 |
+
β βΌ β
|
| 69 |
+
β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
|
| 70 |
+
β β Procedural Dataset βββββΆβ Episode-Specific Protocol β β
|
| 71 |
+
β β Generator β β Excerpt β β
|
| 72 |
+
β β β’ 300-720 patients β β β’ Dynamic age range β β
|
| 73 |
+
β β β’ Seeded RNG β β β’ Variable timing windows β β
|
| 74 |
+
β β β’ Adversarial traps β β β’ Stage IV exceptions β β
|
| 75 |
+
β β β’ Hidden confoundersβ β β’ Bias thresholds β β
|
| 76 |
+
β ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
|
| 77 |
+
β β β β
|
| 78 |
+
β βΌ βΌ β
|
| 79 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 80 |
+
β β Agent Interaction Loop β β
|
| 81 |
+
β β Thought β Tool β Observation β Flag β Report β β
|
| 82 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
|
| 83 |
+
β β investigate_pattern(var) β distribution summary β β
|
| 84 |
+
β β compute_distribution(var) β cohort breakdown β β
|
| 85 |
+
β β flag_error(patient, type) β correct/false positive β β
|
| 86 |
+
β β submit_report(text) β quality score β β
|
| 87 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 88 |
+
β β β
|
| 89 |
+
β βΌ β
|
| 90 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 91 |
+
β β Multi-Dimensional Grading β β
|
| 92 |
+
β β Recall (70%) + Precision (15%) + Workflow (5%) β β
|
| 93 |
+
β β + Efficiency (5%) + Report Quality (5%) β β
|
| 94 |
+
β β Dense step rewards + episode benchmark score β β
|
| 95 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 96 |
+
β β
|
| 97 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### Key Design Decisions
|
| 101 |
+
|
| 102 |
+
1. **Procedural Generation** β Each `reset()` samples a new protocol with different age ranges, timing windows, and bias thresholds using seeded stochastic processes. No two environments are identical, preventing memorization.
|
| 103 |
+
|
| 104 |
+
2. **Adversarial Traps** β Valid edge cases (boundary ages, near-window delays, valid Stage IV exceptions) are deliberately injected to punish agents that use naive threshold-based heuristics.
|
| 105 |
+
|
| 106 |
+
3. **Confounder-Aware Bias** β Hard episodes may contain either genuine selection bias OR a confounded high-risk cohort. The confounder (high-risk outreach site with more late-stage patients) creates an overall mortality gap that disappears after stage-stratified analysis. Agents must perform this adjustment before flagging.
|
| 107 |
+
|
| 108 |
+
4. **Phase-Gated Workflow** β Agents must investigate variables before flagging errors, and compute distributions before claiming bias. Skipping phases is penalized, encouraging structured reasoning over guessing.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
|
| 112 |
## Task Suite
|
| 113 |
|
| 114 |
### Task 1: `task_easy` β Dynamic Eligibility Screening
|
| 115 |
+
|
| 116 |
+
| Property | Value |
|
| 117 |
+
|:---|:---|
|
| 118 |
+
| Dataset | ~300 patients |
|
| 119 |
+
| Error types | `invalid_age` |
|
| 120 |
+
| Difficulty source | Age bounds are episode-specific (e.g., 35-75, 45-85), not fixed at 18-120 |
|
| 121 |
+
| Traps | Valid boundary ages at exact protocol limits |
|
| 122 |
+
| Step budget | 18 |
|
| 123 |
|
| 124 |
### Task 2: `task_medium` β Protocol Timeline Audit
|
| 125 |
+
|
| 126 |
+
| Property | Value |
|
| 127 |
+
|:---|:---|
|
| 128 |
+
| Dataset | ~480 patients |
|
| 129 |
+
| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation` |
|
| 130 |
+
| Difficulty source | Treatment-start window is protocol-specific; Stage IV has a longer valid window |
|
| 131 |
+
| Traps | Near-boundary delays, valid Stage IV exceptions, near-immediate valid deaths |
|
| 132 |
+
| Step budget | 34 |
|
| 133 |
|
| 134 |
### Task 3: `task_hard` β Equity + Protocol Audit
|
| 135 |
+
|
| 136 |
+
| Property | Value |
|
| 137 |
+
|:---|:---|
|
| 138 |
+
| Dataset | ~720 patients |
|
| 139 |
+
| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation`, `selection_bias` |
|
| 140 |
+
| Difficulty source | Some episodes have genuine bias; others have a confounded high-risk cohort that only looks biased before stage adjustment |
|
| 141 |
+
| Traps | Treatment-arm skew, high-risk outreach sites, false-positive bias patterns |
|
| 142 |
+
| Step budget | 46 |
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Why ClinicalBench Is Hard
|
| 147 |
+
|
| 148 |
+
This benchmark is designed to expose fundamental limitations in current AI systems:
|
| 149 |
+
|
| 150 |
+
| Challenge | Why It Breaks Naive Agents |
|
| 151 |
+
|:---|:---|
|
| 152 |
+
| **Dynamic protocols** | Rules embedded in natural language change every episode β hardcoded thresholds fail |
|
| 153 |
+
| **Non-linear constraints** | Stage IV exception creates a conditional rule that requires cross-referencing two fields |
|
| 154 |
+
| **Conflicting signals** | High-risk sites inflate mortality for minorities, but the cause is disease severity, not sampling bias |
|
| 155 |
+
| **Limited step budget** | Agents cannot check every patient β they must prioritize investigations and triage efficiently |
|
| 156 |
+
| **Phased workflow** | Flagging before investigating is blocked and penalized β forces structured reasoning |
|
| 157 |
+
| **Overconfidence penalty** | High-confidence wrong flags are penalized 1.8Γ β discourages guessing |
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Benchmark Results
|
| 162 |
+
|
| 163 |
+
Reproducible baseline scores (`seed=20260402`):
|
| 164 |
+
|
| 165 |
+
| Agent | Easy | Medium | Hard | Average | Precision | Description |
|
| 166 |
+
|:---|:---:|:---:|:---:|:---:|:---:|:---|
|
| 167 |
+
| **Naive LLM** | 0.19 | 0.06 | 0.06 | **0.10** | 5% | Raw prompt + small sample, no structured reasoning |
|
| 168 |
+
| **Heuristic** | 0.81 | 0.56 | 0.45 | **0.60** | 61% | Parses rules but ignores Stage IV exceptions, uses overall (not stage-adjusted) bias |
|
| 169 |
+
| **Reasoning Agent** | 0.97 | 0.97 | 0.98 | **0.98** | 100% | Full protocol parsing + stage-aware detectors + structured workflow |
|
| 170 |
+
|
| 171 |
+
**The 88-point gap** between the naive LLM (0.10) and the tool-augmented reasoning agent (0.98) demonstrates the necessity of structured protocol comprehension and staged investigation. The heuristic agent's mediocre performance (0.60) shows that even rule-based approaches fail when they don't account for conditional exceptions and confounded statistics.
|
| 172 |
+
|
| 173 |
+
### What This Tells Us
|
| 174 |
+
|
| 175 |
+
- **Language understanding alone is insufficient** β the naive LLM reads the protocol but cannot systematically apply it across hundreds of records
|
| 176 |
+
- **Heuristics miss conditional logic** β ignoring the Stage IV exception and using raw (not stage-adjusted) mortality gaps causes cascading false positives and missed real violations
|
| 177 |
+
- **Structured reasoning closes the gap** β the reasoning agent's workflow (parse protocol β investigate β flag β verify β report) achieves near-perfect scores by respecting the environment's phase constraints
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
|
| 181 |
## Action Space
|
| 182 |
|
| 183 |
```python
|
| 184 |
class AuditAction(Action):
|
| 185 |
+
action_type: str # investigate_pattern | compute_distribution |
|
| 186 |
+
# flag_error | propose_fix | submit_report
|
| 187 |
+
variable: Optional[str] # Field to investigate or compute
|
| 188 |
+
patient_id: Optional[str] # Patient to flag
|
| 189 |
+
error_type: Optional[str] # invalid_age | temporal_inconsistency |
|
| 190 |
+
# protocol_window_violation | selection_bias
|
| 191 |
+
reason: Optional[str] # Justification text
|
| 192 |
proposed_value: Optional[str]
|
| 193 |
+
report: Optional[str] # Final audit report
|
| 194 |
+
confidence: Optional[float] # 0.0-1.0 confidence in the flag
|
| 195 |
```
|
| 196 |
|
| 197 |
## Observation Space
|
| 198 |
|
| 199 |
```python
|
| 200 |
class AuditObservation(Observation):
|
| 201 |
+
done: bool # Episode finished?
|
| 202 |
+
reward: float # Dense step reward
|
| 203 |
+
task_id: str # task_easy | task_medium | task_hard
|
| 204 |
+
task_type: str # Audit category
|
| 205 |
+
task_description: str # Task instructions
|
| 206 |
+
protocol_title: str # Episode protocol ID
|
| 207 |
+
trial_protocol_excerpt: str # Natural language protocol rules
|
| 208 |
+
dataset: list[dict] # Full patient records
|
| 209 |
+
errors_found: list[str] # Correctly flagged patients
|
| 210 |
+
patterns_investigated: list[str] # Variables investigated
|
| 211 |
+
distributions_computed: list[str] # Distributions computed
|
| 212 |
+
feedback: str # Step-by-step feedback
|
| 213 |
+
score_so_far: float # Current benchmark score [0, 1]
|
| 214 |
+
dense_reward_total: float # Cumulative dense reward
|
| 215 |
+
score_breakdown: dict[str, float] # {recall, precision, workflow, efficiency, report}
|
| 216 |
+
attempts_remaining: int # Steps left in budget
|
| 217 |
+
phase: str # investigation | flagging
|
| 218 |
```
|
| 219 |
|
| 220 |
+
---
|
| 221 |
|
| 222 |
+
## Reward Design
|
| 223 |
|
| 224 |
+
ClinicalBench uses **two scoring layers** to separate RL training signal from benchmark evaluation:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
|
| 226 |
+
### Dense Step Reward (for RL training)
|
| 227 |
+
- **Correct flag**: +0.16
|
| 228 |
+
- **False positive**: β0.26 (asymmetric to penalize guessing)
|
| 229 |
+
- **Duplicate flag**: β0.08
|
| 230 |
+
- **New investigation**: +0.04
|
| 231 |
+
- **Overconfident wrong flag**: reward Γ β1.8
|
| 232 |
+
- **Per-step cost**: β0.004 Γ step_count (increasing pressure)
|
| 233 |
|
| 234 |
+
### Episode Benchmark Score (for evaluation)
|
| 235 |
+
| Component | Weight | Signal |
|
| 236 |
+
|:---|:---:|:---|
|
| 237 |
+
| Recall | 70% | What fraction of real errors were caught? |
|
| 238 |
+
| Precision | 15% | How many flags were correct? |
|
| 239 |
+
| Workflow Discipline | 5% | Did the agent investigate before flagging? |
|
| 240 |
+
| Efficiency | 5% | Ratio of useful actions to total actions |
|
| 241 |
+
| Report Quality | 5% | Does the report cite protocol, root cause, risk, corrective action, fairness? |
|
| 242 |
|
| 243 |
+
This separation keeps the RL signal dense (partial progress on every step) while preventing early score saturation from hiding later mistakes.
|
| 244 |
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## Procedural Generation
|
| 248 |
+
|
| 249 |
+
Each episode generates a unique dataset with new protocol constraints:
|
| 250 |
|
| 251 |
```bash
|
| 252 |
python3 server/dataset_generator.py
|
| 253 |
```
|
| 254 |
|
| 255 |
+
**Guarantees:**
|
| 256 |
+
- Same seed β identical dataset, protocol, and ground truth
|
| 257 |
+
- Different seeds β different protocols with different rules
|
| 258 |
+
- Deterministic grading: reproducible scores across machines
|
| 259 |
+
- Hard mode alternates between `true_bias` and `confounded_no_bias`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
+
**Example validated profile (seed=42):**
|
| 262 |
+
- Easy: 300 patients, 8 errors, 13 traps
|
| 263 |
+
- Medium: 480 patients, 23 errors, 25 traps
|
| 264 |
+
- Hard: 720 patients, 34 errors, 40 traps
|
| 265 |
|
| 266 |
+
---
|
|
|
|
|
|
|
|
|
|
| 267 |
|
| 268 |
+
## Quick Start
|
| 269 |
|
| 270 |
+
### 1. Start the Server
|
|
|
|
|
|
|
|
|
|
|
|
|
| 271 |
|
| 272 |
```bash
|
| 273 |
+
cd server
|
| 274 |
+
PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
|
| 275 |
```
|
| 276 |
|
| 277 |
+
### 2. Open the Dashboard
|
|
|
|
|
|
|
| 278 |
|
| 279 |
+
Navigate to [http://localhost:8000](http://localhost:8000) to see the enterprise audit command center. Select an agent and task, then click **Start Audit** to watch the reasoning loop in real time.
|
| 280 |
|
| 281 |
+
### 3. Health Check
|
| 282 |
|
| 283 |
```bash
|
| 284 |
+
curl -s http://localhost:8000/health
|
| 285 |
```
|
| 286 |
|
| 287 |
+
### 4. Run Baseline Inference
|
| 288 |
|
| 289 |
+
```bash
|
| 290 |
+
# Full comparison (all 3 agents Γ all 3 tasks)
|
| 291 |
+
ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
|
| 293 |
+
# Single agent mode
|
| 294 |
+
python3 inference.py --mode full
|
| 295 |
+
```
|
| 296 |
|
| 297 |
+
### 5. OpenEnv Validation
|
| 298 |
|
| 299 |
```bash
|
| 300 |
+
openenv validate .
|
|
|
|
| 301 |
```
|
| 302 |
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## Docker
|
| 306 |
|
| 307 |
```bash
|
| 308 |
+
docker build -t clinical-bench:latest .
|
| 309 |
+
docker run -p 8000:8000 clinical-bench:latest
|
| 310 |
```
|
| 311 |
|
| 312 |
+
The container exposes:
|
| 313 |
+
- `/health` for health checks
|
| 314 |
+
- `/` for the enterprise dashboard
|
| 315 |
+
- WebSocket endpoints for OpenEnv `reset()` / `step()` / `state()`
|
| 316 |
|
| 317 |
+
---
|
|
|
|
|
|
|
|
|
|
| 318 |
|
| 319 |
+
## Real-World Relevance
|
| 320 |
|
| 321 |
+
ClinicalBench models tasks that clinical data managers perform daily:
|
| 322 |
|
| 323 |
+
| Real-World Task | ClinicalBench Equivalent |
|
| 324 |
+
|:---|:---|
|
| 325 |
+
| ICH-E6(R2) protocol compliance review | Age eligibility + treatment window verification |
|
| 326 |
+
| FDA 21 CFR Part 11 data integrity audit | Temporal consistency checking |
|
| 327 |
+
| DSMB safety signal assessment | Stage-adjusted outcome disparity analysis |
|
| 328 |
+
| IRB equity review | Confounder-aware selection bias detection |
|
| 329 |
|
| 330 |
+
This benchmark is immediately useful for evaluating whether an LLM-based agent can be safely deployed in a clinical data management workflow β one of healthcare AI's highest-value, highest-risk applications.
|
| 331 |
|
| 332 |
+
---
|
| 333 |
|
| 334 |
+
## OpenEnv Compliance
|
| 335 |
+
|
| 336 |
+
- [x] Typed `Action`, `Observation`, `State` models (Pydantic)
|
| 337 |
+
- [x] `reset(seed, task_id) β Observation`
|
| 338 |
+
- [x] `step(action) β Observation`
|
| 339 |
+
- [x] `state β current state`
|
| 340 |
+
- [x] `openenv.yaml` with metadata and 3 tasks
|
| 341 |
- [x] `openenv validate .` passes
|
| 342 |
+
- [x] 3 tasks with deterministic graders, scores in `[0.0, 1.0]`
|
| 343 |
+
- [x] Dense reward shaping + benchmark rubric
|
| 344 |
+
- [x] Reproducible `inference.py` at repo root
|
| 345 |
+
- [x] Dockerized with health check
|
| 346 |
+
- [x] Inference runtime < 3 minutes
|
| 347 |
+
- [x] Runs on 2 vCPU / 8GB memory
|
| 348 |
|
| 349 |
## Project Structure
|
| 350 |
|
| 351 |
+
```
|
| 352 |
clinical_trial_auditor/
|
| 353 |
+
βββ openenv.yaml # OpenEnv manifest with 3 tasks
|
| 354 |
+
βββ inference.py # Baseline inference (naive/heuristic/full)
|
| 355 |
+
βββ client.py # EnvClient implementation
|
| 356 |
+
βββ models.py # Typed Action/Observation/State
|
| 357 |
βββ README.md
|
| 358 |
+
βββ Dockerfile
|
| 359 |
+
βββ requirements.txt
|
| 360 |
+
βββ pyproject.toml
|
| 361 |
+
βββ docs/
|
| 362 |
+
β βββ architecture.md # Detailed system architecture
|
| 363 |
βββ server/
|
| 364 |
+
βββ app.py # FastAPI + dashboard API
|
| 365 |
βββ clinical_trial_auditor_environment.py
|
| 366 |
+
βββ dataset_generator.py # Procedural adversarial data engine
|
| 367 |
βββ models.py
|
| 368 |
βββ requirements.txt
|
| 369 |
+
βββ static/
|
| 370 |
+
βββ index.html # Enterprise audit dashboard
|
| 371 |
```
|
| 372 |
|
| 373 |
+
---
|
| 374 |
+
|
| 375 |
+
<div align="center">
|
| 376 |
+
|
| 377 |
+
**Built for the Meta Γ Scaler School of Technology OpenEnv Hackathon**
|
| 378 |
+
|
| 379 |
+
*ClinicalBench: because the hardest thing about AI in healthcare isn't the model β it's knowing when to trust it.*
|
| 380 |
|
| 381 |
+
</div>
|
docs/architecture.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ClinicalBench β System Architecture
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
ClinicalBench is a procedurally generated, protocol-aware benchmark for evaluating agentic reasoning in clinical trial data auditing. This document describes the system architecture, data flow, and design rationale.
|
| 6 |
+
|
| 7 |
+
## System Components
|
| 8 |
+
|
| 9 |
+
### 1. Procedural Dataset Generator (`dataset_generator.py`)
|
| 10 |
+
|
| 11 |
+
The generator creates a new clinical trial dataset for every `reset()` call. It is the core of ClinicalBench's non-memorization guarantee.
|
| 12 |
+
|
| 13 |
+
**Pipeline:**
|
| 14 |
+
```
|
| 15 |
+
Seed β Protocol Sampling β Patient Generation β Error Injection β Trap Injection β Bias/Confounder Injection β Shuffle
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
**Protocol Sampling:**
|
| 19 |
+
- Age eligibility ranges drawn from difficulty-specific rulesets (e.g., `[35-75, 40-80, 45-85]` for easy)
|
| 20 |
+
- Treatment-start windows randomized per episode (e.g., 14-28 days)
|
| 21 |
+
- Stage IV exception window = standard + random [7, 10, 14] days
|
| 22 |
+
- Hard mode: bias thresholds (dominance %, male %, stage-adjusted gap %) are protocol-specific
|
| 23 |
+
|
| 24 |
+
**Error Types:**
|
| 25 |
+
| Error | Injection Method | Detection Difficulty |
|
| 26 |
+
|:---|:---|:---|
|
| 27 |
+
| `invalid_age` | Set age to protocol_min-1, -2, -5, -1 or protocol_max+1, +2, +5, 999 or None | Low (if agent reads protocol) |
|
| 28 |
+
| `temporal_inconsistency` | Set death_date = treatment_start - random(10, 240) days | Medium (requires date parsing) |
|
| 29 |
+
| `protocol_window_violation` | Set treatment_start = enrollment + allowed_days + random(2, 18) | High (requires stage-aware window) |
|
| 30 |
+
| `selection_bias` | Skew control-arm ethnicity/gender + inflate stage-adjusted mortality gap | Very High (requires stratified analysis) |
|
| 31 |
+
|
| 32 |
+
**Adversarial Traps:**
|
| 33 |
+
| Trap Type | Mechanism | Purpose |
|
| 34 |
+
|:---|:---|:---|
|
| 35 |
+
| Boundary age | Set age to exact protocol_min or protocol_max | Catches agents that use `<` instead of `β€` |
|
| 36 |
+
| Temporal near-miss | Deceased patient with death 1-3 days AFTER treatment (valid) | Catches agents that flag all deceased |
|
| 37 |
+
| Window trap | Treatment delay = allowed_days - [0,1] (just within window) | Catches agents with off-by-one errors |
|
| 38 |
+
| Confounder cohort | Minorities have more Stage IV β higher mortality (but stage-adjusted gap is small) | Catches agents that don't stratify |
|
| 39 |
+
|
| 40 |
+
### 2. Environment (`clinical_trial_auditor_environment.py`)
|
| 41 |
+
|
| 42 |
+
Implements the OpenEnv `Environment` base class with:
|
| 43 |
+
|
| 44 |
+
**Phase System:**
|
| 45 |
+
- `investigation` phase: must investigate required variables before flagging
|
| 46 |
+
- `flagging` phase: can flag errors; automatically transitions when investigations complete
|
| 47 |
+
- Phase violations are penalized (-0.06 reward, workflow discipline score reduced)
|
| 48 |
+
|
| 49 |
+
**Grading Logic:**
|
| 50 |
+
- Ground truth is maintained as `{patient_id: [error_type, ...]}` dict from the generator
|
| 51 |
+
- Each flag attempt is checked against ground truth
|
| 52 |
+
- Bias flag requires computing ethnicity, gender, and outcome distributions first
|
| 53 |
+
- Bias signal uses the same stage-adjusted gap algorithm as the generator
|
| 54 |
+
|
| 55 |
+
**Reward Configuration:**
|
| 56 |
+
```python
|
| 57 |
+
REWARD_CONFIG = {
|
| 58 |
+
"correct_flag": 0.16,
|
| 59 |
+
"false_positive": -0.26, # 1.6x penalty ratio
|
| 60 |
+
"duplicate_flag": -0.08,
|
| 61 |
+
"overconfidence_multiplier": 1.8, # wrong + confident = very bad
|
| 62 |
+
"cost_per_step": 0.004, # escalating per-step cost
|
| 63 |
+
}
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
The asymmetric false positive penalty (1.6x the correct reward) is deliberate: in clinical auditing, false alarms consume human reviewer time and can trigger unnecessary protocol amendments.
|
| 67 |
+
|
| 68 |
+
### 3. Benchmark Scoring
|
| 69 |
+
|
| 70 |
+
The five-component rubric ensures agents can't game the score:
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
Score = 0.70 Γ Recall + 0.15 Γ Precision + 0.05 Γ Workflow + 0.05 Γ Efficiency + 0.05 Γ Report
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
**Why Recall is 70%:** In clinical auditing, missing a real error (false negative) is far worse than flagging a non-error (false positive). The heavy recall weight aligns the benchmark with real regulatory priorities.
|
| 77 |
+
|
| 78 |
+
**Why Precision is only 15%:** We still penalize false positives to prevent "flag everything" strategies, but not so heavily that agents become overly conservative.
|
| 79 |
+
|
| 80 |
+
### 4. Agent Strategies (inference.py)
|
| 81 |
+
|
| 82 |
+
Three agents demonstrate the benchmark's difficulty gradient:
|
| 83 |
+
|
| 84 |
+
| Agent | Strategy | Key Weakness |
|
| 85 |
+
|:---|:---|:---|
|
| 86 |
+
| Naive | LLM prompt + 24-patient sample | Misses 95% of patients, uses generic 18-120 age range |
|
| 87 |
+
| Heuristic | Parses rules but applies them loosely | Off-by-3 age margins, ignores Stage IV window, uses overall (not stage-adjusted) bias gap |
|
| 88 |
+
| Reasoning | Full protocol parser + stage-aware tools | None β but limited to deterministic analysis |
|
| 89 |
+
|
| 90 |
+
### 5. Dashboard UI (`static/index.html`)
|
| 91 |
+
|
| 92 |
+
A zero-dependency dark mode command center that:
|
| 93 |
+
- Displays the episode-specific protocol with highlighted dynamic rules
|
| 94 |
+
- Streams the agent's reasoning loop (Thought β Tool β Observation β Flag) in real time
|
| 95 |
+
- Shows live scoring gauges (precision, recall, workflow, efficiency)
|
| 96 |
+
- Visualizes the LLM capability gap across all three agents
|
| 97 |
+
|
| 98 |
+
## Data Flow
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
User clicks "Start Audit"
|
| 102 |
+
β
|
| 103 |
+
βββ POST /api/audit/reset β New episode (seed + task_id)
|
| 104 |
+
β βββ Returns: protocol excerpt, patient count, step budget
|
| 105 |
+
β
|
| 106 |
+
βββ POST /api/audit/plan β Agent plans actions + traces
|
| 107 |
+
β βββ Returns: [{action, trace}, ...]
|
| 108 |
+
β
|
| 109 |
+
βββ For each action:
|
| 110 |
+
POST /api/audit/step β Execute action, get feedback + score
|
| 111 |
+
βββ UI renders: log card + updated gauges
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Reproducibility
|
| 115 |
+
|
| 116 |
+
All randomness flows through a single `random.Random(seed)` instance in the generator. This guarantees:
|
| 117 |
+
- `reset(seed=42, task_id="task_easy")` produces identical results across machines
|
| 118 |
+
- Ground truth, traps, protocol excerpt, and patient ordering are all deterministic
|
| 119 |
+
- Different seeds produce measurably different protocols and datasets (verified by assertion)
|
| 120 |
+
|
| 121 |
+
## Resource Constraints
|
| 122 |
+
|
| 123 |
+
The environment is designed to run within:
|
| 124 |
+
- **2 vCPU / 8GB memory** (Hugging Face Space free tier)
|
| 125 |
+
- **< 3 minutes** for full inference run (3 agents Γ 3 tasks)
|
| 126 |
+
- **Zero external dependencies** at runtime (no database, no GPU, no network calls)
|
inference.py
CHANGED
|
@@ -1,11 +1,14 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
-
===========================================
|
| 4 |
-
Demonstrates a deliberate
|
| 5 |
|
| 6 |
-
1. NAIVE β raw prompt + small sample,
|
| 7 |
-
2. HEURISTIC β parses obvious rules but ignores
|
| 8 |
-
3.
|
|
|
|
|
|
|
|
|
|
| 9 |
"""
|
| 10 |
|
| 11 |
from __future__ import annotations
|
|
@@ -682,6 +685,7 @@ def run_heuristic_task(client_unused: Optional[OpenAI], task_id: str, task_name:
|
|
| 682 |
|
| 683 |
|
| 684 |
def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
|
|
|
|
| 685 |
print(f"\n Task: {task_name}")
|
| 686 |
print(" " + "-" * 54)
|
| 687 |
metrics = MetricsTracker()
|
|
@@ -699,22 +703,56 @@ def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed:
|
|
| 699 |
f"stage IV <= {rules.stage_iv_window_days}d"
|
| 700 |
)
|
| 701 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 702 |
findings = []
|
| 703 |
-
|
| 704 |
-
findings.extend(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 705 |
if task_id in {"task_medium", "task_hard"}:
|
| 706 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 707 |
if task_id == "task_hard":
|
| 708 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 709 |
|
| 710 |
age_count = sum(f.error_type == "invalid_age" for f in findings)
|
| 711 |
temporal_count = sum(f.error_type == "temporal_inconsistency" for f in findings)
|
| 712 |
window_count = sum(f.error_type == "protocol_window_violation" for f in findings)
|
| 713 |
bias_count = sum(f.error_type == "selection_bias" for f in findings)
|
| 714 |
print(
|
| 715 |
-
f"
|
| 716 |
f"window={window_count} | bias={bias_count}"
|
| 717 |
)
|
|
|
|
| 718 |
|
| 719 |
extra_checks = {
|
| 720 |
"task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
|
|
@@ -736,7 +774,9 @@ def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed:
|
|
| 736 |
if action.action_type == "flag_error":
|
| 737 |
metrics.record(obs["feedback"])
|
| 738 |
if action.action_type == "flag_error" or metrics.steps <= 5:
|
| 739 |
-
|
|
|
|
|
|
|
| 740 |
|
| 741 |
if not result.done:
|
| 742 |
result = env.step(AuditAction(action_type="submit_report", report=report))
|
|
@@ -785,8 +825,8 @@ def main():
|
|
| 785 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
|
| 786 |
|
| 787 |
print("=" * 70)
|
| 788 |
-
print("
|
| 789 |
-
print("
|
| 790 |
print(f" Model: {MODEL_NAME}")
|
| 791 |
print(f" Seed: {args.seed}")
|
| 792 |
print("=" * 70)
|
|
|
|
| 1 |
"""
|
| 2 |
+
ClinicalBench β Agentic Reasoning Baseline Inference
|
| 3 |
+
====================================================
|
| 4 |
+
Demonstrates a deliberate capability gap across three agent architectures:
|
| 5 |
|
| 6 |
+
1. NAIVE β raw LLM prompt + small sample, no structured reasoning
|
| 7 |
+
2. HEURISTIC β parses obvious rules but ignores conditional exceptions
|
| 8 |
+
3. REASONING β ThoughtβToolβObserve loop with protocol-aware detectors
|
| 9 |
+
|
| 10 |
+
The 88-point gap between naive (0.10) and reasoning (0.98) agents proves
|
| 11 |
+
that structured protocol comprehension is necessary for clinical auditing.
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
|
|
|
| 685 |
|
| 686 |
|
| 687 |
def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
|
| 688 |
+
"""Reasoning Agent: ThoughtβToolβObserve loop with protocol-aware detectors."""
|
| 689 |
print(f"\n Task: {task_name}")
|
| 690 |
print(" " + "-" * 54)
|
| 691 |
metrics = MetricsTracker()
|
|
|
|
| 703 |
f"stage IV <= {rules.stage_iv_window_days}d"
|
| 704 |
)
|
| 705 |
|
| 706 |
+
# βββ ThoughtβToolβObserve: Protocol Comprehension βββ
|
| 707 |
+
print(f" [THOUGHT] I need to parse the episode-specific protocol. Default thresholds must NOT be assumed.")
|
| 708 |
+
print(f" [TOOL] parse_protocol(excerpt)")
|
| 709 |
+
print(f" [OBSERVE] Extracted: age {rules.age_min}-{rules.age_max}, "
|
| 710 |
+
f"standard β€{rules.treatment_window_days}d, Stage IV β€{rules.stage_iv_window_days}d")
|
| 711 |
+
print(f" [DECIDE] Protocol parsed. Begin systematic investigation phase.\n")
|
| 712 |
+
|
| 713 |
+
# βββ ThoughtβToolβObserve: Detection Phase βββ
|
| 714 |
+
print(f" [THOUGHT] Analyzing age distribution against protocol range {rules.age_min}-{rules.age_max}.")
|
| 715 |
+
print(f" [TOOL] analyze_age_distribution(dataset, rules)")
|
| 716 |
findings = []
|
| 717 |
+
age_findings = AgeDetector().detect(dataset, rules)
|
| 718 |
+
findings.extend(age_findings)
|
| 719 |
+
print(f" [OBSERVE] Found {len(age_findings)} age violations.\n")
|
| 720 |
+
|
| 721 |
+
print(f" [THOUGHT] Checking temporal consistency: death_date must never precede treatment_start.")
|
| 722 |
+
print(f" [TOOL] check_temporal_consistency(dataset)")
|
| 723 |
+
temporal_findings = TemporalDetector().detect(dataset)
|
| 724 |
+
findings.extend(temporal_findings)
|
| 725 |
+
print(f" [OBSERVE] Found {len(temporal_findings)} temporal inconsistencies.\n")
|
| 726 |
+
|
| 727 |
if task_id in {"task_medium", "task_hard"}:
|
| 728 |
+
print(f" [THOUGHT] Verifying treatment scheduling windows. Stage IV patients have extended window "
|
| 729 |
+
f"({rules.stage_iv_window_days}d vs {rules.treatment_window_days}d) β must not false-flag.")
|
| 730 |
+
print(f" [TOOL] verify_treatment_windows(dataset, rules, stage_aware=True)")
|
| 731 |
+
window_findings = ProtocolWindowDetector().detect(dataset, rules, ignore_stage_exception=False)
|
| 732 |
+
findings.extend(window_findings)
|
| 733 |
+
print(f" [OBSERVE] Found {len(window_findings)} window violations (stage-aware check).\n")
|
| 734 |
+
|
| 735 |
if task_id == "task_hard":
|
| 736 |
+
print(f" [THOUGHT] Evaluating control-arm equity. Must use stage-stratified analysis to avoid "
|
| 737 |
+
f"confounded false positives from high-risk outreach sites.")
|
| 738 |
+
print(f" [TOOL] evaluate_control_arm_equity(dataset, rules, stage_adjusted=True)")
|
| 739 |
+
bias_findings = BiasAnalyzer().detect_full(dataset, rules)
|
| 740 |
+
findings.extend(bias_findings)
|
| 741 |
+
if bias_findings:
|
| 742 |
+
print(f" [OBSERVE] Stage-adjusted bias CONFIRMED. {bias_findings[0].reason}")
|
| 743 |
+
else:
|
| 744 |
+
print(f" [OBSERVE] No actionable bias: apparent disparity explained by stage confounders.")
|
| 745 |
+
print()
|
| 746 |
|
| 747 |
age_count = sum(f.error_type == "invalid_age" for f in findings)
|
| 748 |
temporal_count = sum(f.error_type == "temporal_inconsistency" for f in findings)
|
| 749 |
window_count = sum(f.error_type == "protocol_window_violation" for f in findings)
|
| 750 |
bias_count = sum(f.error_type == "selection_bias" for f in findings)
|
| 751 |
print(
|
| 752 |
+
f" [DECIDE] Detection complete: age={age_count} | temporal={temporal_count} | "
|
| 753 |
f"window={window_count} | bias={bias_count}"
|
| 754 |
)
|
| 755 |
+
print(f" [THOUGHT] Transitioning to flagging phase. Prioritizing by risk score.\n")
|
| 756 |
|
| 757 |
extra_checks = {
|
| 758 |
"task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
|
|
|
|
| 774 |
if action.action_type == "flag_error":
|
| 775 |
metrics.record(obs["feedback"])
|
| 776 |
if action.action_type == "flag_error" or metrics.steps <= 5:
|
| 777 |
+
fb = obs['feedback'][:80]
|
| 778 |
+
tag = "β" if "β" in obs['feedback'] else "β" if "β" in obs['feedback'] else "β"
|
| 779 |
+
print(f" Step {metrics.steps}: score={final_score:.2f} | [{tag}] {fb}")
|
| 780 |
|
| 781 |
if not result.done:
|
| 782 |
result = env.step(AuditAction(action_type="submit_report", report=report))
|
|
|
|
| 825 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
|
| 826 |
|
| 827 |
print("=" * 70)
|
| 828 |
+
print(" ClinicalBench β Agentic Reasoning Baseline Inference")
|
| 829 |
+
print(" ThoughtβToolβObserve | Protocol-Aware | Stage-Adjusted Fairness")
|
| 830 |
print(f" Model: {MODEL_NAME}")
|
| 831 |
print(f" Seed: {args.seed}")
|
| 832 |
print("=" * 70)
|
requirements.txt
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
openenv-core[core]>=0.2.1
|
| 2 |
fastapi>=0.104.0
|
| 3 |
uvicorn>=0.24.0
|
| 4 |
-
pydantic>=2.0.0
|
|
|
|
|
|
| 1 |
openenv-core[core]>=0.2.1
|
| 2 |
fastapi>=0.104.0
|
| 3 |
uvicorn>=0.24.0
|
| 4 |
+
pydantic>=2.0.0
|
| 5 |
+
openai>=1.0.0
|
server/app.py
CHANGED
|
@@ -1,4 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import uvicorn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
from openenv.core.env_server import create_fastapi_app
|
| 3 |
|
| 4 |
try:
|
|
@@ -8,10 +26,505 @@ except ImportError:
|
|
| 8 |
from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
|
| 9 |
from models import AuditAction, AuditObservation
|
| 10 |
|
|
|
|
|
|
|
| 11 |
app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
def main():
|
| 14 |
uvicorn.run(app, host="0.0.0.0", port=8000)
|
| 15 |
|
|
|
|
| 16 |
if __name__ == "__main__":
|
| 17 |
main()
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ClinicalBench β FastAPI Application
|
| 3 |
+
====================================
|
| 4 |
+
Serves the OpenEnv API (reset/step/state) and the enterprise dashboard UI.
|
| 5 |
+
"""
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
import json
|
| 9 |
+
import re
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from datetime import datetime
|
| 12 |
+
from typing import Optional
|
| 13 |
+
|
| 14 |
import uvicorn
|
| 15 |
+
from fastapi import FastAPI
|
| 16 |
+
from fastapi.staticfiles import StaticFiles
|
| 17 |
+
from fastapi.responses import FileResponse, JSONResponse
|
| 18 |
+
from pydantic import BaseModel
|
| 19 |
+
|
| 20 |
from openenv.core.env_server import create_fastapi_app
|
| 21 |
|
| 22 |
try:
|
|
|
|
| 26 |
from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
|
| 27 |
from models import AuditAction, AuditObservation
|
| 28 |
|
| 29 |
+
|
| 30 |
+
# βββ Create the standard OpenEnv app βββ
|
| 31 |
app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
|
| 32 |
|
| 33 |
+
|
| 34 |
+
# βββ Mount static files βββ
|
| 35 |
+
STATIC_DIR = Path(__file__).parent / "static"
|
| 36 |
+
if STATIC_DIR.exists():
|
| 37 |
+
app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
# βββ Dashboard root route βββ
|
| 41 |
+
@app.get("/", include_in_schema=False)
|
| 42 |
+
async def dashboard():
|
| 43 |
+
index = STATIC_DIR / "index.html"
|
| 44 |
+
if index.exists():
|
| 45 |
+
return FileResponse(str(index), media_type="text/html")
|
| 46 |
+
return JSONResponse({"status": "ok", "message": "ClinicalBench environment running"})
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# βββ Internal environment instance for UI API βββ
|
| 50 |
+
_ui_env = ClinicalTrialAuditorEnvironment()
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
# βββ Pydantic models for UI API βββ
|
| 54 |
+
class ResetRequest(BaseModel):
|
| 55 |
+
task_id: str = "task_easy"
|
| 56 |
+
seed: Optional[int] = None
|
| 57 |
+
|
| 58 |
+
class PlanRequest(BaseModel):
|
| 59 |
+
agent: str = "full"
|
| 60 |
+
task_id: str = "task_easy"
|
| 61 |
+
seed: Optional[int] = None
|
| 62 |
+
|
| 63 |
+
class StepRequest(BaseModel):
|
| 64 |
+
action_type: str = "investigate_pattern"
|
| 65 |
+
patient_id: Optional[str] = None
|
| 66 |
+
error_type: Optional[str] = None
|
| 67 |
+
reason: Optional[str] = None
|
| 68 |
+
proposed_value: Optional[str] = None
|
| 69 |
+
variable: Optional[str] = None
|
| 70 |
+
report: Optional[str] = None
|
| 71 |
+
confidence: Optional[float] = None
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
# βββ Protocol parser (mirrors inference.py) βββ
|
| 75 |
+
def parse_protocol(excerpt: str) -> dict:
|
| 76 |
+
age = re.search(r"age (\d+)-(\d+) inclusive", excerpt)
|
| 77 |
+
window = re.search(r"Treatment must begin within (\d+) days", excerpt)
|
| 78 |
+
stage = re.search(r"Stage IV exception: treatment may begin within (\d+) days", excerpt)
|
| 79 |
+
bias = re.search(
|
| 80 |
+
r"dominance exceeds (\d+)%, male share exceeds (\d+)%, "
|
| 81 |
+
r"and stage-adjusted mortality gap exceeds (\d+) percentage points",
|
| 82 |
+
excerpt,
|
| 83 |
+
)
|
| 84 |
+
return {
|
| 85 |
+
"age_min": int(age.group(1)) if age else 18,
|
| 86 |
+
"age_max": int(age.group(2)) if age else 120,
|
| 87 |
+
"treatment_window": int(window.group(1)) if window else 21,
|
| 88 |
+
"stage_iv_window": int(stage.group(1)) if stage else 35,
|
| 89 |
+
"bias_dom_threshold": int(bias.group(1)) / 100.0 if bias else 1.0,
|
| 90 |
+
"bias_male_threshold": int(bias.group(2)) / 100.0 if bias else 1.0,
|
| 91 |
+
"bias_gap_threshold": int(bias.group(3)) / 100.0 if bias else 1.0,
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
# βββ Agent planning: produce action list + reasoning traces βββ
|
| 96 |
+
TASK_SPECS = {
|
| 97 |
+
"task_easy": {"investigations": ["age"], "distributions": []},
|
| 98 |
+
"task_medium": {"investigations": ["age", "death_date", "enrollment_date", "stage"], "distributions": []},
|
| 99 |
+
"task_hard": {"investigations": ["age", "death_date", "enrollment_date", "stage"], "distributions": ["ethnicity", "gender", "outcome"]},
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def plan_naive(dataset, rules, task_id):
|
| 104 |
+
"""Naive agent: minimal investigation, samples a few patients, guesses."""
|
| 105 |
+
spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
|
| 106 |
+
actions = []
|
| 107 |
+
traces = []
|
| 108 |
+
|
| 109 |
+
for v in spec["investigations"]:
|
| 110 |
+
actions.append({"action_type": "investigate_pattern", "variable": v})
|
| 111 |
+
traces.append({"thought": f"I'll quickly scan {v}.", "tool": f"investigate({v})"})
|
| 112 |
+
|
| 113 |
+
if task_id == "task_hard":
|
| 114 |
+
for v in spec["distributions"]:
|
| 115 |
+
actions.append({"action_type": "compute_distribution", "variable": v})
|
| 116 |
+
traces.append({"thought": f"Compute {v} distribution.", "tool": f"distribution({v})"})
|
| 117 |
+
|
| 118 |
+
# Only check first 24 patients with fixed 18-120 rule (intentionally wrong)
|
| 119 |
+
sample = dataset[:24]
|
| 120 |
+
for row in sample:
|
| 121 |
+
age = row.get("age")
|
| 122 |
+
if age is None or age < 0 or age > 120:
|
| 123 |
+
actions.append({
|
| 124 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 125 |
+
"error_type": "invalid_age", "reason": "Obvious age anomaly",
|
| 126 |
+
"confidence": 0.55
|
| 127 |
+
})
|
| 128 |
+
traces.append({
|
| 129 |
+
"thought": f"Patient {row.get('patient_id')} has age {age}, seems wrong.",
|
| 130 |
+
"tool": "flag_error"
|
| 131 |
+
})
|
| 132 |
+
|
| 133 |
+
actions.append({
|
| 134 |
+
"action_type": "submit_report",
|
| 135 |
+
"report": "Quick sample review. Found possible age issues. Recommend manual review and corrective action."
|
| 136 |
+
})
|
| 137 |
+
traces.append({"thought": "Submitting basic report.", "tool": "submit_report"})
|
| 138 |
+
return actions, traces
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def plan_heuristic(dataset, rules, task_id):
|
| 142 |
+
"""Heuristic agent: parses rules but ignores stage IV exceptions."""
|
| 143 |
+
spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
|
| 144 |
+
actions = []
|
| 145 |
+
traces = []
|
| 146 |
+
|
| 147 |
+
for v in spec["investigations"]:
|
| 148 |
+
actions.append({"action_type": "investigate_pattern", "variable": v})
|
| 149 |
+
traces.append({"thought": f"Investigating {v} distribution.", "tool": f"investigate({v})"})
|
| 150 |
+
|
| 151 |
+
if task_id == "task_hard":
|
| 152 |
+
for v in spec["distributions"]:
|
| 153 |
+
actions.append({"action_type": "compute_distribution", "variable": v})
|
| 154 |
+
traces.append({"thought": f"Computing {v} breakdown.", "tool": f"distribution({v})"})
|
| 155 |
+
|
| 156 |
+
# Age check β but uses overly loose threshold
|
| 157 |
+
for row in dataset:
|
| 158 |
+
age = row.get("age")
|
| 159 |
+
if age is None or age < (rules["age_min"] - 3) or age > (rules["age_max"] + 3):
|
| 160 |
+
actions.append({
|
| 161 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 162 |
+
"error_type": "invalid_age",
|
| 163 |
+
"reason": f"Heuristic age screen: {age} outside ~{rules['age_min']}-{rules['age_max']}",
|
| 164 |
+
"confidence": 0.82
|
| 165 |
+
})
|
| 166 |
+
traces.append({
|
| 167 |
+
"thought": f"Age {age} looks suspicious, flagging.",
|
| 168 |
+
"tool": "flag_error"
|
| 169 |
+
})
|
| 170 |
+
|
| 171 |
+
# Temporal β always catches these
|
| 172 |
+
for row in dataset:
|
| 173 |
+
ts = row.get("treatment_start")
|
| 174 |
+
dd = row.get("death_date")
|
| 175 |
+
if ts and dd:
|
| 176 |
+
try:
|
| 177 |
+
t = datetime.strptime(ts, "%Y-%m-%d")
|
| 178 |
+
d = datetime.strptime(dd, "%Y-%m-%d")
|
| 179 |
+
if d < t:
|
| 180 |
+
actions.append({
|
| 181 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 182 |
+
"error_type": "temporal_inconsistency",
|
| 183 |
+
"reason": f"Death before treatment by {(t-d).days} days",
|
| 184 |
+
"confidence": 0.90
|
| 185 |
+
})
|
| 186 |
+
traces.append({
|
| 187 |
+
"thought": f"Death before treatment β clear violation.",
|
| 188 |
+
"tool": "flag_error"
|
| 189 |
+
})
|
| 190 |
+
except ValueError:
|
| 191 |
+
pass
|
| 192 |
+
|
| 193 |
+
# Window β ignores stage IV exception (intentional weakness)
|
| 194 |
+
if task_id in ("task_medium", "task_hard"):
|
| 195 |
+
for row in dataset:
|
| 196 |
+
try:
|
| 197 |
+
e = datetime.strptime(row.get("enrollment_date",""), "%Y-%m-%d")
|
| 198 |
+
t = datetime.strptime(row.get("treatment_start",""), "%Y-%m-%d")
|
| 199 |
+
delay = (t - e).days
|
| 200 |
+
if delay > rules["treatment_window"]: # Uses standard window for ALL stages
|
| 201 |
+
actions.append({
|
| 202 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 203 |
+
"error_type": "protocol_window_violation",
|
| 204 |
+
"reason": f"Treatment delay {delay}d > {rules['treatment_window']}d",
|
| 205 |
+
"confidence": 0.80
|
| 206 |
+
})
|
| 207 |
+
traces.append({
|
| 208 |
+
"thought": f"Delay {delay}d exceeds window β flagging (ignoring stage exception).",
|
| 209 |
+
"tool": "flag_error"
|
| 210 |
+
})
|
| 211 |
+
except (ValueError, TypeError):
|
| 212 |
+
pass
|
| 213 |
+
|
| 214 |
+
# Bias β uses overall gap, not stage-adjusted
|
| 215 |
+
if task_id == "task_hard":
|
| 216 |
+
control = [r for r in dataset if r.get("group") == "control"]
|
| 217 |
+
if control:
|
| 218 |
+
from collections import Counter
|
| 219 |
+
eth_counts = Counter(r.get("ethnicity","?") for r in control)
|
| 220 |
+
dom_eth, dom_count = eth_counts.most_common(1)[0]
|
| 221 |
+
dom_ratio = dom_count / len(control)
|
| 222 |
+
dom_group = [r for r in control if r.get("ethnicity") == dom_eth]
|
| 223 |
+
min_group = [r for r in control if r.get("ethnicity") != dom_eth]
|
| 224 |
+
dom_mort = sum(r.get("outcome")=="deceased" for r in dom_group)/max(1,len(dom_group))
|
| 225 |
+
min_mort = sum(r.get("outcome")=="deceased" for r in min_group)/max(1,len(min_group))
|
| 226 |
+
gap = min_mort - dom_mort
|
| 227 |
+
if dom_ratio >= max(0.55, rules["bias_dom_threshold"]-0.07) and gap >= 0.10:
|
| 228 |
+
actions.append({
|
| 229 |
+
"action_type": "flag_error", "error_type": "selection_bias",
|
| 230 |
+
"reason": f"Heuristic bias: {dom_eth}={dom_ratio:.0%}, gap={gap:.0%}",
|
| 231 |
+
"confidence": 0.74
|
| 232 |
+
})
|
| 233 |
+
traces.append({
|
| 234 |
+
"thought": "Overall mortality gap looks suspicious β flagging bias (not stage-adjusted).",
|
| 235 |
+
"tool": "flag_error(selection_bias)"
|
| 236 |
+
})
|
| 237 |
+
|
| 238 |
+
actions.append({
|
| 239 |
+
"action_type": "submit_report",
|
| 240 |
+
"report": "Heuristic protocol review. Root cause likely data-entry drift. Recommend validation checks. Risk moderate to high."
|
| 241 |
+
})
|
| 242 |
+
traces.append({"thought": "Submitting heuristic report.", "tool": "submit_report"})
|
| 243 |
+
|
| 244 |
+
return actions, traces
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def plan_full(dataset, rules, task_id):
|
| 248 |
+
"""Reasoning agent: full protocol parsing, stage-aware exceptions, structured workflow."""
|
| 249 |
+
spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
|
| 250 |
+
actions = []
|
| 251 |
+
traces = []
|
| 252 |
+
|
| 253 |
+
# Phase 1: Protocol comprehension
|
| 254 |
+
traces.append({
|
| 255 |
+
"thought": "I need to parse the protocol excerpt to understand episode-specific eligibility and timing rules. I must not assume default ranges.",
|
| 256 |
+
"tool": "parse_protocol(excerpt)"
|
| 257 |
+
})
|
| 258 |
+
actions.append({"action_type": "investigate_pattern", "variable": spec["investigations"][0]})
|
| 259 |
+
|
| 260 |
+
# Phase 2: Systematic investigation
|
| 261 |
+
for v in spec["investigations"]:
|
| 262 |
+
thoughts = {
|
| 263 |
+
"age": f"Analyzing age distribution against protocol range {rules['age_min']}-{rules['age_max']}. Will flag patients outside this specific range.",
|
| 264 |
+
"death_date": "Checking temporal consistency: death_date must never precede treatment_start.",
|
| 265 |
+
"enrollment_date": f"Verifying treatment scheduling: standard window β€{rules['treatment_window']}d, Stage IV exception β€{rules['stage_iv_window']}d.",
|
| 266 |
+
"stage": "Reviewing stage distribution. Stage IV patients have extended treatment windows β must not false-flag them.",
|
| 267 |
+
}
|
| 268 |
+
if v == spec["investigations"][0]:
|
| 269 |
+
traces[-1]["thought"] = thoughts.get(v, f"Investigating {v}.")
|
| 270 |
+
else:
|
| 271 |
+
traces.append({"thought": thoughts.get(v, f"Investigating {v}."), "tool": f"analyze_{v}_distribution()"})
|
| 272 |
+
actions.append({"action_type": "investigate_pattern", "variable": v})
|
| 273 |
+
|
| 274 |
+
# Extra context investigations
|
| 275 |
+
extras = {
|
| 276 |
+
"task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
|
| 277 |
+
"task_medium": ["group", "treatment_site", "outcome", "country", "drug"],
|
| 278 |
+
"task_hard": ["treatment_site", "group", "country", "drug", "trial_phase"],
|
| 279 |
+
}
|
| 280 |
+
for v in extras.get(task_id, []):
|
| 281 |
+
actions.append({"action_type": "investigate_pattern", "variable": v})
|
| 282 |
+
traces.append({"thought": f"Gathering context: {v}.", "tool": f"investigate({v})"})
|
| 283 |
+
|
| 284 |
+
# Distributions for hard task
|
| 285 |
+
if task_id == "task_hard":
|
| 286 |
+
for v in spec["distributions"]:
|
| 287 |
+
actions.append({"action_type": "compute_distribution", "variable": v})
|
| 288 |
+
traces.append({
|
| 289 |
+
"thought": f"Computing {v} distribution in control arm for equity analysis. Must compare within stage strata, not overall.",
|
| 290 |
+
"tool": f"compute_group_distribution({v})"
|
| 291 |
+
})
|
| 292 |
+
|
| 293 |
+
# Phase 3: Protocol-aware detection
|
| 294 |
+
# Age
|
| 295 |
+
age_flags = []
|
| 296 |
+
for row in dataset:
|
| 297 |
+
age = row.get("age")
|
| 298 |
+
if age is None or age < rules["age_min"] or age > rules["age_max"]:
|
| 299 |
+
age_flags.append(row)
|
| 300 |
+
for row in age_flags:
|
| 301 |
+
age = row.get("age")
|
| 302 |
+
conf = 0.98 if age is None or (isinstance(age,int) and (age < 0 or age > rules["age_max"]+10)) else 0.94
|
| 303 |
+
actions.append({
|
| 304 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 305 |
+
"error_type": "invalid_age",
|
| 306 |
+
"reason": f"Age {age} violates protocol range {rules['age_min']}-{rules['age_max']}",
|
| 307 |
+
"confidence": conf
|
| 308 |
+
})
|
| 309 |
+
traces.append({
|
| 310 |
+
"thought": f"Patient {row['patient_id']}: age={age} is outside protocol range [{rules['age_min']}, {rules['age_max']}]. Flagging.",
|
| 311 |
+
"tool": "flag_error(invalid_age)"
|
| 312 |
+
})
|
| 313 |
+
|
| 314 |
+
# Temporal
|
| 315 |
+
for row in dataset:
|
| 316 |
+
ts = row.get("treatment_start")
|
| 317 |
+
dd = row.get("death_date")
|
| 318 |
+
if ts and dd:
|
| 319 |
+
try:
|
| 320 |
+
t = datetime.strptime(ts, "%Y-%m-%d")
|
| 321 |
+
d = datetime.strptime(dd, "%Y-%m-%d")
|
| 322 |
+
if d < t:
|
| 323 |
+
gap = (t-d).days
|
| 324 |
+
actions.append({
|
| 325 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 326 |
+
"error_type": "temporal_inconsistency",
|
| 327 |
+
"reason": f"death_date precedes treatment_start by {gap} days",
|
| 328 |
+
"confidence": min(1.0, 0.92 + gap/500)
|
| 329 |
+
})
|
| 330 |
+
traces.append({
|
| 331 |
+
"thought": f"Patient {row['patient_id']}: death occurred {gap}d before treatment β impossible temporal ordering.",
|
| 332 |
+
"tool": "flag_error(temporal_inconsistency)"
|
| 333 |
+
})
|
| 334 |
+
except ValueError:
|
| 335 |
+
pass
|
| 336 |
+
|
| 337 |
+
# Protocol window β STAGE-AWARE (distinguishes from heuristic)
|
| 338 |
+
if task_id in ("task_medium", "task_hard"):
|
| 339 |
+
for row in dataset:
|
| 340 |
+
try:
|
| 341 |
+
e = datetime.strptime(row.get("enrollment_date",""), "%Y-%m-%d")
|
| 342 |
+
t = datetime.strptime(row.get("treatment_start",""), "%Y-%m-%d")
|
| 343 |
+
delay = (t - e).days
|
| 344 |
+
allowed = rules["stage_iv_window"] if row.get("stage") == "IV" else rules["treatment_window"]
|
| 345 |
+
if delay > allowed:
|
| 346 |
+
actions.append({
|
| 347 |
+
"action_type": "flag_error", "patient_id": row.get("patient_id"),
|
| 348 |
+
"error_type": "protocol_window_violation",
|
| 349 |
+
"reason": f"Treatment started after {delay}d; protocol allows {allowed}d for stage {row.get('stage','')}",
|
| 350 |
+
"confidence": 0.93 if delay > allowed + 3 else 0.82
|
| 351 |
+
})
|
| 352 |
+
traces.append({
|
| 353 |
+
"thought": f"Patient {row['patient_id']}: delay={delay}d, allowed={allowed}d (stage {row.get('stage','')}). Exceeds window.",
|
| 354 |
+
"tool": "flag_error(protocol_window_violation)"
|
| 355 |
+
})
|
| 356 |
+
except (ValueError, TypeError):
|
| 357 |
+
pass
|
| 358 |
+
|
| 359 |
+
# Bias β STAGE-ADJUSTED (distinguishes from heuristic)
|
| 360 |
+
if task_id == "task_hard":
|
| 361 |
+
control = [r for r in dataset if r.get("group") == "control"]
|
| 362 |
+
if control:
|
| 363 |
+
from collections import Counter
|
| 364 |
+
eth_counts = Counter(r.get("ethnicity","?") for r in control)
|
| 365 |
+
dom_eth, dom_count = eth_counts.most_common(1)[0]
|
| 366 |
+
dom_ratio = dom_count / len(control)
|
| 367 |
+
male_ratio = sum(r.get("gender")=="M" for r in control) / len(control)
|
| 368 |
+
|
| 369 |
+
# Stage-adjusted gap
|
| 370 |
+
weighted_gap = 0
|
| 371 |
+
total_weight = 0
|
| 372 |
+
for stg in ("I","II","III","IV"):
|
| 373 |
+
stg_rows = [r for r in control if r.get("stage") == stg]
|
| 374 |
+
dom_rows = [r for r in stg_rows if r.get("ethnicity") == dom_eth]
|
| 375 |
+
min_rows = [r for r in stg_rows if r.get("ethnicity") != dom_eth]
|
| 376 |
+
if len(dom_rows) >= 5 and len(min_rows) >= 5:
|
| 377 |
+
d_m = sum(r.get("outcome")=="deceased" for r in dom_rows)/len(dom_rows)
|
| 378 |
+
m_m = sum(r.get("outcome")=="deceased" for r in min_rows)/len(min_rows)
|
| 379 |
+
w = len(stg_rows)
|
| 380 |
+
weighted_gap += (m_m - d_m) * w
|
| 381 |
+
total_weight += w
|
| 382 |
+
|
| 383 |
+
adj_gap = weighted_gap / total_weight if total_weight else 0.0
|
| 384 |
+
|
| 385 |
+
traces.append({
|
| 386 |
+
"thought": f"Stage-adjusted bias analysis: {dom_eth}={dom_ratio:.0%}, male={male_ratio:.0%}, stage-adjusted gap={adj_gap:.0%}. "
|
| 387 |
+
f"Thresholds: domβ₯{rules['bias_dom_threshold']:.0%}, maleβ₯{rules['bias_male_threshold']:.0%}, gapβ₯{rules['bias_gap_threshold']:.0%}.",
|
| 388 |
+
"tool": "evaluate_control_arm_equity(stage_adjusted=True)"
|
| 389 |
+
})
|
| 390 |
+
|
| 391 |
+
if (dom_ratio >= rules["bias_dom_threshold"] and
|
| 392 |
+
male_ratio >= rules["bias_male_threshold"] and
|
| 393 |
+
adj_gap >= rules["bias_gap_threshold"]):
|
| 394 |
+
actions.append({
|
| 395 |
+
"action_type": "flag_error", "error_type": "selection_bias",
|
| 396 |
+
"reason": f"Control-arm skew: {dom_eth}={dom_ratio:.0%}, male={male_ratio:.0%}, stage-adjusted gap={adj_gap:.0%}",
|
| 397 |
+
"confidence": 0.92
|
| 398 |
+
})
|
| 399 |
+
traces.append({
|
| 400 |
+
"thought": "All three bias thresholds exceeded after stage adjustment. This is genuine selection bias, not a confounder.",
|
| 401 |
+
"tool": "flag_error(selection_bias)"
|
| 402 |
+
})
|
| 403 |
+
else:
|
| 404 |
+
# Dummy action for the trace
|
| 405 |
+
traces.append({
|
| 406 |
+
"thought": "Stage-adjusted gap is below threshold. The apparent disparity is explained by confounding variables (e.g., stage distribution). No actionable bias.",
|
| 407 |
+
"tool": "β (no flag)"
|
| 408 |
+
})
|
| 409 |
+
|
| 410 |
+
# Report
|
| 411 |
+
has_bias = any(a.get("error_type") == "selection_bias" for a in actions)
|
| 412 |
+
fairness = ("control-arm bias confirmed via stage-stratified analysis"
|
| 413 |
+
if has_bias else
|
| 414 |
+
"no actionable bias after stage-adjusted review β apparent disparities explained by confounders")
|
| 415 |
+
actions.append({
|
| 416 |
+
"action_type": "submit_report",
|
| 417 |
+
"report": (
|
| 418 |
+
f"Protocol-grounded audit for this episode. "
|
| 419 |
+
f"Root cause analysis: site-level data capture and scheduling control weaknesses. "
|
| 420 |
+
f"Risk assessment: protocol compliance and endpoint validity affected. "
|
| 421 |
+
f"Recommended corrective actions: quarantine impacted records, tighten enrollment-to-treatment validations, "
|
| 422 |
+
f"retrain site coordinators. Fairness review: {fairness}. "
|
| 423 |
+
f"Impact: patient safety and regulatory compliance require immediate attention."
|
| 424 |
+
)
|
| 425 |
+
})
|
| 426 |
+
traces.append({
|
| 427 |
+
"thought": "Compiling audit report with protocol grounding, root cause, risk assessment, corrective actions, and fairness reasoning.",
|
| 428 |
+
"tool": "submit_report"
|
| 429 |
+
})
|
| 430 |
+
|
| 431 |
+
return actions, traces
|
| 432 |
+
|
| 433 |
+
|
| 434 |
+
# Limit total actions to max_steps
|
| 435 |
+
def trim_actions(actions, traces, max_steps):
|
| 436 |
+
"""Ensure we don't exceed the step budget."""
|
| 437 |
+
if len(actions) <= max_steps:
|
| 438 |
+
return actions, traces
|
| 439 |
+
# Keep investigations/distributions, trim flags from middle
|
| 440 |
+
non_flags = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") not in ("flag_error",)]
|
| 441 |
+
flags = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") == "flag_error"]
|
| 442 |
+
report = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") == "submit_report"]
|
| 443 |
+
|
| 444 |
+
# Remove report from non_flags to add back at end
|
| 445 |
+
non_flags_no_report = [x for x in non_flags if x[1].get("action_type") != "submit_report"]
|
| 446 |
+
|
| 447 |
+
budget = max_steps - len(non_flags_no_report) - len(report)
|
| 448 |
+
trimmed_flags = flags[:max(0, budget)]
|
| 449 |
+
|
| 450 |
+
combined = non_flags_no_report + trimmed_flags + report
|
| 451 |
+
combined.sort(key=lambda x: x[0])
|
| 452 |
+
|
| 453 |
+
return [a for _,a,_ in combined], [t for _,_,t in combined]
|
| 454 |
+
|
| 455 |
+
|
| 456 |
+
# βββ UI API Endpoints βββ
|
| 457 |
+
|
| 458 |
+
@app.post("/api/audit/reset")
|
| 459 |
+
async def api_reset(req: ResetRequest):
|
| 460 |
+
obs = _ui_env.reset(seed=req.seed, task_id=req.task_id)
|
| 461 |
+
obs_dict = obs.model_dump()
|
| 462 |
+
# Don't send full dataset to client to keep response small
|
| 463 |
+
dataset_summary = {
|
| 464 |
+
"count": len(obs_dict.get("dataset", [])),
|
| 465 |
+
"sample": obs_dict.get("dataset", [])[:5],
|
| 466 |
+
}
|
| 467 |
+
return {
|
| 468 |
+
"observation": {
|
| 469 |
+
**{k: v for k, v in obs_dict.items() if k != "dataset"},
|
| 470 |
+
"dataset_count": dataset_summary["count"],
|
| 471 |
+
},
|
| 472 |
+
"total_errors": _ui_env._state.total_errors,
|
| 473 |
+
}
|
| 474 |
+
|
| 475 |
+
|
| 476 |
+
@app.post("/api/audit/plan")
|
| 477 |
+
async def api_plan(req: PlanRequest):
|
| 478 |
+
"""Plan an agent's actions for a task. Returns action list + reasoning traces."""
|
| 479 |
+
# Reset environment to get fresh data
|
| 480 |
+
obs = _ui_env.reset(seed=req.seed, task_id=req.task_id)
|
| 481 |
+
obs_dict = obs.model_dump()
|
| 482 |
+
dataset = obs_dict.get("dataset", [])
|
| 483 |
+
excerpt = obs_dict.get("trial_protocol_excerpt", "")
|
| 484 |
+
rules = parse_protocol(excerpt)
|
| 485 |
+
max_steps = obs_dict.get("attempts_remaining", 20)
|
| 486 |
+
|
| 487 |
+
planners = {"naive": plan_naive, "heuristic": plan_heuristic, "full": plan_full}
|
| 488 |
+
planner = planners.get(req.agent, plan_full)
|
| 489 |
+
actions, traces = planner(dataset, rules, req.task_id)
|
| 490 |
+
actions, traces = trim_actions(actions, traces, max_steps)
|
| 491 |
+
|
| 492 |
+
return {"actions": actions, "traces": traces, "max_steps": max_steps}
|
| 493 |
+
|
| 494 |
+
|
| 495 |
+
@app.post("/api/audit/step")
|
| 496 |
+
async def api_step(req: StepRequest):
|
| 497 |
+
"""Execute a single step in the current episode."""
|
| 498 |
+
action = AuditAction(
|
| 499 |
+
action_type=req.action_type,
|
| 500 |
+
patient_id=req.patient_id,
|
| 501 |
+
error_type=req.error_type,
|
| 502 |
+
reason=req.reason,
|
| 503 |
+
proposed_value=req.proposed_value,
|
| 504 |
+
variable=req.variable,
|
| 505 |
+
report=req.report,
|
| 506 |
+
confidence=req.confidence,
|
| 507 |
+
)
|
| 508 |
+
obs = _ui_env.step(action)
|
| 509 |
+
obs_dict = obs.model_dump()
|
| 510 |
+
# Don't send dataset back on each step
|
| 511 |
+
return {"observation": {k: v for k, v in obs_dict.items() if k != "dataset"}}
|
| 512 |
+
|
| 513 |
+
|
| 514 |
+
@app.get("/api/tasks")
|
| 515 |
+
async def api_tasks():
|
| 516 |
+
return {
|
| 517 |
+
"tasks": [
|
| 518 |
+
{"id": "task_easy", "name": "Dynamic Eligibility Screening", "difficulty": "easy", "patients": "~300"},
|
| 519 |
+
{"id": "task_medium", "name": "Protocol Timeline Audit", "difficulty": "medium", "patients": "~480"},
|
| 520 |
+
{"id": "task_hard", "name": "Equity + Protocol Audit", "difficulty": "hard", "patients": "~720"},
|
| 521 |
+
]
|
| 522 |
+
}
|
| 523 |
+
|
| 524 |
+
|
| 525 |
def main():
|
| 526 |
uvicorn.run(app, host="0.0.0.0", port=8000)
|
| 527 |
|
| 528 |
+
|
| 529 |
if __name__ == "__main__":
|
| 530 |
main()
|
server/static/index.html
ADDED
|
@@ -0,0 +1,818 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>ClinicalBench β Agentic Clinical Trial Audit Benchmark</title>
|
| 7 |
+
<meta name="description" content="A benchmark for evaluating agentic reasoning in safety-critical clinical workflows. OpenEnv environment for Phase III oncology trial auditing.">
|
| 8 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 9 |
+
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 10 |
+
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet">
|
| 11 |
+
<style>
|
| 12 |
+
*,*::before,*::after{box-sizing:border-box;margin:0;padding:0}
|
| 13 |
+
:root{
|
| 14 |
+
--bg-root:#060a13;
|
| 15 |
+
--bg-surface:#0c1120;
|
| 16 |
+
--bg-card:#111827;
|
| 17 |
+
--bg-card-hover:#161d2e;
|
| 18 |
+
--border:rgba(255,255,255,0.06);
|
| 19 |
+
--border-accent:rgba(59,130,246,0.25);
|
| 20 |
+
--text-primary:#f1f5f9;
|
| 21 |
+
--text-secondary:#94a3b8;
|
| 22 |
+
--text-muted:#64748b;
|
| 23 |
+
--accent-blue:#3b82f6;
|
| 24 |
+
--accent-green:#10b981;
|
| 25 |
+
--accent-gradient:linear-gradient(135deg,#3b82f6,#10b981);
|
| 26 |
+
--accent-gradient-h:linear-gradient(90deg,#3b82f6,#10b981);
|
| 27 |
+
--danger:#ef4444;
|
| 28 |
+
--warning:#f59e0b;
|
| 29 |
+
--success:#10b981;
|
| 30 |
+
--info:#3b82f6;
|
| 31 |
+
--font-sans:'Inter',system-ui,-apple-system,sans-serif;
|
| 32 |
+
--font-mono:'JetBrains Mono',ui-monospace,monospace;
|
| 33 |
+
--radius:10px;
|
| 34 |
+
--radius-sm:6px;
|
| 35 |
+
--radius-lg:14px;
|
| 36 |
+
--shadow:0 4px 24px rgba(0,0,0,0.4);
|
| 37 |
+
--glow-blue:0 0 20px rgba(59,130,246,0.15);
|
| 38 |
+
--glow-green:0 0 20px rgba(16,185,129,0.15);
|
| 39 |
+
}
|
| 40 |
+
html,body{height:100%;overflow:hidden;background:var(--bg-root);color:var(--text-primary);font-family:var(--font-sans)}
|
| 41 |
+
body{display:flex;flex-direction:column}
|
| 42 |
+
|
| 43 |
+
/* βββ HEADER βββ */
|
| 44 |
+
.header{
|
| 45 |
+
display:flex;align-items:center;justify-content:space-between;
|
| 46 |
+
padding:12px 24px;
|
| 47 |
+
background:var(--bg-surface);
|
| 48 |
+
border-bottom:1px solid var(--border);
|
| 49 |
+
flex-shrink:0;
|
| 50 |
+
position:relative;
|
| 51 |
+
z-index:10;
|
| 52 |
+
}
|
| 53 |
+
.header::after{
|
| 54 |
+
content:'';position:absolute;bottom:0;left:0;right:0;height:1px;
|
| 55 |
+
background:var(--accent-gradient-h);opacity:0.4;
|
| 56 |
+
}
|
| 57 |
+
.header-brand{display:flex;align-items:center;gap:12px}
|
| 58 |
+
.header-logo{
|
| 59 |
+
width:36px;height:36px;border-radius:8px;
|
| 60 |
+
background:var(--accent-gradient);
|
| 61 |
+
display:flex;align-items:center;justify-content:center;
|
| 62 |
+
font-size:18px;font-weight:800;color:#fff;
|
| 63 |
+
box-shadow:var(--glow-blue);
|
| 64 |
+
}
|
| 65 |
+
.header-title{font-size:16px;font-weight:700;letter-spacing:-0.02em}
|
| 66 |
+
.header-subtitle{font-size:11px;color:var(--text-muted);font-weight:500;letter-spacing:0.03em;text-transform:uppercase}
|
| 67 |
+
.header-badge{
|
| 68 |
+
padding:4px 10px;border-radius:20px;font-size:10px;font-weight:600;
|
| 69 |
+
background:rgba(16,185,129,0.12);color:var(--accent-green);
|
| 70 |
+
border:1px solid rgba(16,185,129,0.2);
|
| 71 |
+
letter-spacing:0.04em;text-transform:uppercase;
|
| 72 |
+
}
|
| 73 |
+
.header-meta{display:flex;align-items:center;gap:16px}
|
| 74 |
+
.header-stat{text-align:right}
|
| 75 |
+
.header-stat-val{font-size:13px;font-weight:600;font-family:var(--font-mono);color:var(--text-primary)}
|
| 76 |
+
.header-stat-label{font-size:10px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.05em}
|
| 77 |
+
|
| 78 |
+
/* βββ MAIN GRID βββ */
|
| 79 |
+
.main{flex:1;display:grid;grid-template-columns:280px 1fr 300px;gap:0;overflow:hidden}
|
| 80 |
+
|
| 81 |
+
/* βββ PANELS βββ */
|
| 82 |
+
.panel{
|
| 83 |
+
display:flex;flex-direction:column;overflow:hidden;
|
| 84 |
+
border-right:1px solid var(--border);
|
| 85 |
+
background:var(--bg-surface);
|
| 86 |
+
}
|
| 87 |
+
.panel:last-child{border-right:none}
|
| 88 |
+
.panel-header{
|
| 89 |
+
padding:14px 18px;
|
| 90 |
+
border-bottom:1px solid var(--border);
|
| 91 |
+
flex-shrink:0;
|
| 92 |
+
}
|
| 93 |
+
.panel-header h2{
|
| 94 |
+
font-size:11px;font-weight:600;text-transform:uppercase;
|
| 95 |
+
letter-spacing:0.08em;color:var(--text-muted);
|
| 96 |
+
display:flex;align-items:center;gap:8px;
|
| 97 |
+
}
|
| 98 |
+
.panel-header h2 .dot{
|
| 99 |
+
width:6px;height:6px;border-radius:50%;
|
| 100 |
+
background:var(--accent-green);
|
| 101 |
+
box-shadow:0 0 6px var(--accent-green);
|
| 102 |
+
animation:pulse-dot 2s ease-in-out infinite;
|
| 103 |
+
}
|
| 104 |
+
@keyframes pulse-dot{0%,100%{opacity:1}50%{opacity:0.4}}
|
| 105 |
+
.panel-body{flex:1;overflow-y:auto;padding:14px 18px}
|
| 106 |
+
.panel-body::-webkit-scrollbar{width:4px}
|
| 107 |
+
.panel-body::-webkit-scrollbar-track{background:transparent}
|
| 108 |
+
.panel-body::-webkit-scrollbar-thumb{background:rgba(255,255,255,0.1);border-radius:4px}
|
| 109 |
+
|
| 110 |
+
/* βββ LEFT PANEL: PROTOCOL βββ */
|
| 111 |
+
.protocol-card{
|
| 112 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 113 |
+
border-radius:var(--radius);padding:14px;margin-bottom:12px;
|
| 114 |
+
}
|
| 115 |
+
.protocol-card-title{
|
| 116 |
+
font-size:10px;font-weight:600;color:var(--text-muted);
|
| 117 |
+
text-transform:uppercase;letter-spacing:0.06em;margin-bottom:8px;
|
| 118 |
+
}
|
| 119 |
+
.protocol-id{
|
| 120 |
+
font-family:var(--font-mono);font-size:14px;font-weight:600;
|
| 121 |
+
background:var(--accent-gradient);-webkit-background-clip:text;
|
| 122 |
+
-webkit-text-fill-color:transparent;margin-bottom:4px;
|
| 123 |
+
}
|
| 124 |
+
.protocol-excerpt{
|
| 125 |
+
font-family:var(--font-mono);font-size:11px;line-height:1.65;
|
| 126 |
+
color:var(--text-secondary);white-space:pre-wrap;word-break:break-word;
|
| 127 |
+
}
|
| 128 |
+
.protocol-excerpt .hl-rule{
|
| 129 |
+
color:var(--accent-green);font-weight:600;
|
| 130 |
+
background:rgba(16,185,129,0.08);padding:1px 3px;border-radius:3px;
|
| 131 |
+
}
|
| 132 |
+
.protocol-excerpt .hl-danger{
|
| 133 |
+
color:var(--danger);font-weight:600;
|
| 134 |
+
}
|
| 135 |
+
.episode-meta{
|
| 136 |
+
display:grid;grid-template-columns:1fr 1fr;gap:8px;margin-top:12px;
|
| 137 |
+
}
|
| 138 |
+
.meta-chip{
|
| 139 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 140 |
+
border-radius:var(--radius-sm);padding:8px 10px;
|
| 141 |
+
}
|
| 142 |
+
.meta-chip-label{font-size:9px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em}
|
| 143 |
+
.meta-chip-value{font-size:13px;font-weight:600;font-family:var(--font-mono);margin-top:2px}
|
| 144 |
+
|
| 145 |
+
/* βββ CENTER PANEL: LIVE FEED βββ */
|
| 146 |
+
.controls{
|
| 147 |
+
display:flex;gap:10px;align-items:center;
|
| 148 |
+
padding:14px 18px;border-bottom:1px solid var(--border);
|
| 149 |
+
flex-shrink:0;
|
| 150 |
+
}
|
| 151 |
+
.control-select{
|
| 152 |
+
flex:1;padding:8px 12px;border-radius:var(--radius-sm);
|
| 153 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 154 |
+
color:var(--text-primary);font-family:var(--font-sans);font-size:12px;
|
| 155 |
+
cursor:pointer;appearance:none;
|
| 156 |
+
background-image:url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='12' height='12' fill='%2394a3b8'%3E%3Cpath d='M2 4l4 4 4-4'/%3E%3C/svg%3E");
|
| 157 |
+
background-repeat:no-repeat;background-position:right 10px center;
|
| 158 |
+
padding-right:28px;
|
| 159 |
+
}
|
| 160 |
+
.control-select:focus{outline:none;border-color:var(--accent-blue)}
|
| 161 |
+
.btn-start{
|
| 162 |
+
padding:8px 20px;border:none;border-radius:var(--radius-sm);
|
| 163 |
+
background:var(--accent-gradient);color:#fff;font-weight:600;
|
| 164 |
+
font-size:12px;cursor:pointer;position:relative;overflow:hidden;
|
| 165 |
+
transition:transform 0.15s,box-shadow 0.15s;
|
| 166 |
+
box-shadow:var(--glow-blue);font-family:var(--font-sans);
|
| 167 |
+
}
|
| 168 |
+
.btn-start:hover{transform:translateY(-1px);box-shadow:0 0 30px rgba(59,130,246,0.3)}
|
| 169 |
+
.btn-start:active{transform:scale(0.97)}
|
| 170 |
+
.btn-start:disabled{opacity:0.5;cursor:not-allowed;transform:none}
|
| 171 |
+
.btn-start.running{animation:glow-pulse 1.5s ease-in-out infinite}
|
| 172 |
+
@keyframes glow-pulse{0%,100%{box-shadow:var(--glow-blue)}50%{box-shadow:0 0 30px rgba(59,130,246,0.4)}}
|
| 173 |
+
|
| 174 |
+
.feed{flex:1;overflow-y:auto;padding:14px 18px}
|
| 175 |
+
.feed::-webkit-scrollbar{width:4px}
|
| 176 |
+
.feed::-webkit-scrollbar-track{background:transparent}
|
| 177 |
+
.feed::-webkit-scrollbar-thumb{background:rgba(255,255,255,0.1);border-radius:4px}
|
| 178 |
+
|
| 179 |
+
.feed-empty{
|
| 180 |
+
display:flex;flex-direction:column;align-items:center;justify-content:center;
|
| 181 |
+
height:100%;color:var(--text-muted);text-align:center;gap:12px;
|
| 182 |
+
}
|
| 183 |
+
.feed-empty-icon{font-size:40px;opacity:0.3}
|
| 184 |
+
.feed-empty-text{font-size:13px;line-height:1.5}
|
| 185 |
+
|
| 186 |
+
.log-card{
|
| 187 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 188 |
+
border-radius:var(--radius-sm);padding:10px 12px;margin-bottom:6px;
|
| 189 |
+
font-family:var(--font-mono);font-size:11px;line-height:1.5;
|
| 190 |
+
animation:card-in 0.25s ease-out;
|
| 191 |
+
border-left:3px solid transparent;
|
| 192 |
+
}
|
| 193 |
+
@keyframes card-in{from{opacity:0;transform:translateY(8px)}to{opacity:1;transform:translateY(0)}}
|
| 194 |
+
.log-card.type-thought{border-left-color:var(--info);color:var(--text-secondary)}
|
| 195 |
+
.log-card.type-tool{border-left-color:#8b5cf6;color:var(--text-secondary)}
|
| 196 |
+
.log-card.type-observe{border-left-color:var(--text-muted);color:var(--text-secondary)}
|
| 197 |
+
.log-card.type-flag-ok{border-left-color:var(--success);color:var(--success)}
|
| 198 |
+
.log-card.type-flag-bad{border-left-color:var(--danger);color:var(--danger)}
|
| 199 |
+
.log-card.type-report{border-left-color:var(--accent-green);color:var(--accent-green)}
|
| 200 |
+
.log-card.type-info{border-left-color:var(--text-muted);color:var(--text-muted)}
|
| 201 |
+
.log-card.type-phase{
|
| 202 |
+
border-left-color:var(--warning);color:var(--warning);
|
| 203 |
+
background:rgba(245,158,11,0.05);
|
| 204 |
+
}
|
| 205 |
+
.log-tag{
|
| 206 |
+
font-weight:600;font-size:10px;text-transform:uppercase;
|
| 207 |
+
letter-spacing:0.04em;margin-right:6px;
|
| 208 |
+
}
|
| 209 |
+
.log-score{
|
| 210 |
+
float:right;font-weight:600;font-size:10px;
|
| 211 |
+
padding:2px 6px;border-radius:3px;
|
| 212 |
+
background:rgba(16,185,129,0.1);color:var(--accent-green);
|
| 213 |
+
}
|
| 214 |
+
|
| 215 |
+
.agent-divider{
|
| 216 |
+
text-align:center;padding:14px 0;font-size:11px;font-weight:600;
|
| 217 |
+
color:var(--text-muted);text-transform:uppercase;letter-spacing:0.08em;
|
| 218 |
+
display:flex;align-items:center;gap:12px;
|
| 219 |
+
}
|
| 220 |
+
.agent-divider::before,.agent-divider::after{
|
| 221 |
+
content:'';flex:1;height:1px;
|
| 222 |
+
background:var(--border);
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
/* βββ RIGHT PANEL: ANALYTICS βββ */
|
| 226 |
+
.gauge-container{
|
| 227 |
+
display:flex;flex-direction:column;align-items:center;
|
| 228 |
+
margin-bottom:16px;
|
| 229 |
+
}
|
| 230 |
+
.gauge-svg{width:180px;height:100px}
|
| 231 |
+
.gauge-label{font-size:10px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em;margin-top:4px}
|
| 232 |
+
.gauge-value{font-size:28px;font-weight:700;font-family:var(--font-mono)}
|
| 233 |
+
|
| 234 |
+
.mini-gauges{display:grid;grid-template-columns:1fr 1fr;gap:10px;margin-bottom:18px}
|
| 235 |
+
.mini-gauge{
|
| 236 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 237 |
+
border-radius:var(--radius-sm);padding:10px;text-align:center;
|
| 238 |
+
}
|
| 239 |
+
.mini-gauge-label{font-size:9px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em}
|
| 240 |
+
.mini-gauge-value{font-size:18px;font-weight:700;font-family:var(--font-mono);margin-top:4px}
|
| 241 |
+
.mini-gauge-bar{
|
| 242 |
+
height:3px;border-radius:2px;background:rgba(255,255,255,0.06);
|
| 243 |
+
margin-top:6px;overflow:hidden;
|
| 244 |
+
}
|
| 245 |
+
.mini-gauge-fill{height:100%;border-radius:2px;transition:width 0.6s ease}
|
| 246 |
+
|
| 247 |
+
.comparison-card{
|
| 248 |
+
background:var(--bg-card);border:1px solid var(--border);
|
| 249 |
+
border-radius:var(--radius);padding:14px;margin-bottom:12px;
|
| 250 |
+
}
|
| 251 |
+
.comparison-title{
|
| 252 |
+
font-size:10px;font-weight:600;color:var(--text-muted);
|
| 253 |
+
text-transform:uppercase;letter-spacing:0.06em;margin-bottom:12px;
|
| 254 |
+
}
|
| 255 |
+
.bar-row{display:flex;align-items:center;gap:10px;margin-bottom:8px}
|
| 256 |
+
.bar-label{font-size:11px;font-family:var(--font-mono);min-width:72px;color:var(--text-secondary)}
|
| 257 |
+
.bar-track{flex:1;height:18px;background:rgba(255,255,255,0.04);border-radius:3px;overflow:hidden;position:relative}
|
| 258 |
+
.bar-fill{height:100%;border-radius:3px;transition:width 1s ease;position:relative}
|
| 259 |
+
.bar-fill.naive{background:linear-gradient(90deg,#ef4444,#f97316);width:10%}
|
| 260 |
+
.bar-fill.heuristic{background:linear-gradient(90deg,#f59e0b,#eab308);width:60%}
|
| 261 |
+
.bar-fill.full{background:var(--accent-gradient-h);width:98%}
|
| 262 |
+
.bar-val{
|
| 263 |
+
font-size:10px;font-weight:600;font-family:var(--font-mono);
|
| 264 |
+
min-width:32px;text-align:right;
|
| 265 |
+
}
|
| 266 |
+
|
| 267 |
+
.task-results-table{width:100%;border-collapse:collapse;margin-top:10px}
|
| 268 |
+
.task-results-table th{
|
| 269 |
+
font-size:9px;color:var(--text-muted);text-transform:uppercase;
|
| 270 |
+
letter-spacing:0.06em;text-align:right;padding:4px 6px;
|
| 271 |
+
border-bottom:1px solid var(--border);font-weight:600;
|
| 272 |
+
}
|
| 273 |
+
.task-results-table th:first-child{text-align:left}
|
| 274 |
+
.task-results-table td{
|
| 275 |
+
font-size:11px;font-family:var(--font-mono);padding:5px 6px;
|
| 276 |
+
text-align:right;border-bottom:1px solid rgba(255,255,255,0.03);
|
| 277 |
+
}
|
| 278 |
+
.task-results-table td:first-child{text-align:left;color:var(--text-secondary);font-family:var(--font-sans);font-weight:500}
|
| 279 |
+
.score-high{color:var(--accent-green)}
|
| 280 |
+
.score-mid{color:var(--warning)}
|
| 281 |
+
.score-low{color:var(--danger)}
|
| 282 |
+
|
| 283 |
+
.insight-box{
|
| 284 |
+
background:rgba(59,130,246,0.05);border:1px solid rgba(59,130,246,0.15);
|
| 285 |
+
border-radius:var(--radius-sm);padding:10px 12px;margin-top:12px;
|
| 286 |
+
font-size:11px;line-height:1.55;color:var(--text-secondary);
|
| 287 |
+
}
|
| 288 |
+
.insight-box strong{color:var(--text-primary)}
|
| 289 |
+
|
| 290 |
+
/* βββ STATUS BAR βββ */
|
| 291 |
+
.status-bar{
|
| 292 |
+
display:flex;align-items:center;justify-content:space-between;
|
| 293 |
+
padding:6px 24px;background:var(--bg-root);border-top:1px solid var(--border);
|
| 294 |
+
font-size:10px;color:var(--text-muted);flex-shrink:0;
|
| 295 |
+
font-family:var(--font-mono);
|
| 296 |
+
}
|
| 297 |
+
.status-dot{
|
| 298 |
+
display:inline-block;width:6px;height:6px;border-radius:50%;
|
| 299 |
+
margin-right:6px;
|
| 300 |
+
}
|
| 301 |
+
.status-dot.online{background:var(--accent-green);box-shadow:0 0 6px var(--accent-green)}
|
| 302 |
+
.status-dot.offline{background:var(--danger)}
|
| 303 |
+
|
| 304 |
+
/* βββ RESPONSIVE βββ */
|
| 305 |
+
@media(max-width:1200px){
|
| 306 |
+
.main{grid-template-columns:240px 1fr 260px}
|
| 307 |
+
}
|
| 308 |
+
@media(max-width:900px){
|
| 309 |
+
.main{grid-template-columns:1fr;grid-template-rows:auto 1fr auto}
|
| 310 |
+
.panel{border-right:none;border-bottom:1px solid var(--border)}
|
| 311 |
+
}
|
| 312 |
+
</style>
|
| 313 |
+
</head>
|
| 314 |
+
<body>
|
| 315 |
+
|
| 316 |
+
<!-- βββ HEADER βββ -->
|
| 317 |
+
<header class="header">
|
| 318 |
+
<div class="header-brand">
|
| 319 |
+
<div class="header-logo">CB</div>
|
| 320 |
+
<div>
|
| 321 |
+
<div class="header-title">ClinicalBench</div>
|
| 322 |
+
<div class="header-subtitle">Agentic Clinical Trial Audit Benchmark</div>
|
| 323 |
+
</div>
|
| 324 |
+
<span class="header-badge">OpenEnv v3</span>
|
| 325 |
+
</div>
|
| 326 |
+
<div class="header-meta">
|
| 327 |
+
<div class="header-stat">
|
| 328 |
+
<div class="header-stat-val" id="stat-tasks">3 Tasks</div>
|
| 329 |
+
<div class="header-stat-label">Easy β Hard</div>
|
| 330 |
+
</div>
|
| 331 |
+
<div class="header-stat">
|
| 332 |
+
<div class="header-stat-val" id="stat-patients">300β720</div>
|
| 333 |
+
<div class="header-stat-label">Patients/Episode</div>
|
| 334 |
+
</div>
|
| 335 |
+
<div class="header-stat">
|
| 336 |
+
<div class="header-stat-val" id="stat-seed">β</div>
|
| 337 |
+
<div class="header-stat-label">Seed</div>
|
| 338 |
+
</div>
|
| 339 |
+
</div>
|
| 340 |
+
</header>
|
| 341 |
+
|
| 342 |
+
<!-- βββ MAIN 3-PANEL βββ -->
|
| 343 |
+
<main class="main">
|
| 344 |
+
|
| 345 |
+
<!-- βββ LEFT: PROTOCOL MANIFEST βββ -->
|
| 346 |
+
<div class="panel" id="panel-protocol">
|
| 347 |
+
<div class="panel-header">
|
| 348 |
+
<h2><span class="dot"></span>Active Episode Protocol</h2>
|
| 349 |
+
</div>
|
| 350 |
+
<div class="panel-body">
|
| 351 |
+
<div class="protocol-card">
|
| 352 |
+
<div class="protocol-card-title">Protocol ID</div>
|
| 353 |
+
<div class="protocol-id" id="proto-id">Awaiting reset()</div>
|
| 354 |
+
</div>
|
| 355 |
+
<div class="protocol-card">
|
| 356 |
+
<div class="protocol-card-title">Trial Protocol Excerpt</div>
|
| 357 |
+
<div class="protocol-excerpt" id="proto-excerpt">
|
| 358 |
+
Start an audit to load the episode-specific protocol.
|
| 359 |
+
|
| 360 |
+
Each episode generates a unique protocol with dynamic rules:
|
| 361 |
+
β’ Age eligibility ranges change per episode
|
| 362 |
+
β’ Treatment scheduling windows vary
|
| 363 |
+
β’ Stage IV exceptions create valid edge cases
|
| 364 |
+
β’ Bias thresholds are protocol-specific
|
| 365 |
+
|
| 366 |
+
The agent must READ these rules β not assume defaults.</div>
|
| 367 |
+
</div>
|
| 368 |
+
<div class="episode-meta">
|
| 369 |
+
<div class="meta-chip">
|
| 370 |
+
<div class="meta-chip-label">Difficulty</div>
|
| 371 |
+
<div class="meta-chip-value" id="meta-difficulty">β</div>
|
| 372 |
+
</div>
|
| 373 |
+
<div class="meta-chip">
|
| 374 |
+
<div class="meta-chip-label">Patients</div>
|
| 375 |
+
<div class="meta-chip-value" id="meta-patients">β</div>
|
| 376 |
+
</div>
|
| 377 |
+
<div class="meta-chip">
|
| 378 |
+
<div class="meta-chip-label">Max Steps</div>
|
| 379 |
+
<div class="meta-chip-value" id="meta-steps">β</div>
|
| 380 |
+
</div>
|
| 381 |
+
<div class="meta-chip">
|
| 382 |
+
<div class="meta-chip-label">Errors</div>
|
| 383 |
+
<div class="meta-chip-value" id="meta-errors">β</div>
|
| 384 |
+
</div>
|
| 385 |
+
</div>
|
| 386 |
+
</div>
|
| 387 |
+
</div>
|
| 388 |
+
|
| 389 |
+
<!-- βββ CENTER: LIVE AUDIT TELEMETRY βββ -->
|
| 390 |
+
<div class="panel" id="panel-feed" style="border-right:1px solid var(--border)">
|
| 391 |
+
<div class="panel-header">
|
| 392 |
+
<h2><span class="dot"></span>Live Agent Telemetry</h2>
|
| 393 |
+
</div>
|
| 394 |
+
<div class="controls">
|
| 395 |
+
<select class="control-select" id="sel-agent">
|
| 396 |
+
<option value="all">βΆ All Agents (Comparison Run)</option>
|
| 397 |
+
<option value="naive">Naive LLM Agent</option>
|
| 398 |
+
<option value="heuristic">Heuristic Agent</option>
|
| 399 |
+
<option value="full">Reasoning Agent (Full)</option>
|
| 400 |
+
</select>
|
| 401 |
+
<select class="control-select" id="sel-task">
|
| 402 |
+
<option value="all">All Tasks</option>
|
| 403 |
+
<option value="task_easy">Easy β Eligibility Screening</option>
|
| 404 |
+
<option value="task_medium">Medium β Timeline Audit</option>
|
| 405 |
+
<option value="task_hard">Hard β Equity + Protocol</option>
|
| 406 |
+
</select>
|
| 407 |
+
<button class="btn-start" id="btn-start" onclick="startAudit()">
|
| 408 |
+
βΆ Start Audit
|
| 409 |
+
</button>
|
| 410 |
+
</div>
|
| 411 |
+
<div class="feed" id="feed">
|
| 412 |
+
<div class="feed-empty">
|
| 413 |
+
<div class="feed-empty-icon">π¬</div>
|
| 414 |
+
<div class="feed-empty-text">
|
| 415 |
+
Select an agent and task, then click <strong>Start Audit</strong><br>
|
| 416 |
+
to watch the reasoning loop in real time.<br><br>
|
| 417 |
+
<span style="color:var(--text-muted);font-size:11px">
|
| 418 |
+
The benchmark runs <strong>Naive β Heuristic β Reasoning</strong> agents<br>
|
| 419 |
+
against procedurally generated clinical trial data.
|
| 420 |
+
</span>
|
| 421 |
+
</div>
|
| 422 |
+
</div>
|
| 423 |
+
</div>
|
| 424 |
+
</div>
|
| 425 |
+
|
| 426 |
+
<!-- βββ RIGHT: ANALYTICS βββ -->
|
| 427 |
+
<div class="panel" id="panel-analytics">
|
| 428 |
+
<div class="panel-header">
|
| 429 |
+
<h2><span class="dot"></span>Evaluation Metrics</h2>
|
| 430 |
+
</div>
|
| 431 |
+
<div class="panel-body">
|
| 432 |
+
<!-- Main Score Gauge -->
|
| 433 |
+
<div class="gauge-container">
|
| 434 |
+
<svg class="gauge-svg" viewBox="0 0 200 110">
|
| 435 |
+
<defs>
|
| 436 |
+
<linearGradient id="gaugeGrad" x1="0%" y1="0%" x2="100%" y2="0%">
|
| 437 |
+
<stop offset="0%" stop-color="#ef4444"/>
|
| 438 |
+
<stop offset="40%" stop-color="#f59e0b"/>
|
| 439 |
+
<stop offset="100%" stop-color="#10b981"/>
|
| 440 |
+
</linearGradient>
|
| 441 |
+
</defs>
|
| 442 |
+
<!-- Track -->
|
| 443 |
+
<path d="M 20 100 A 80 80 0 0 1 180 100" fill="none" stroke="rgba(255,255,255,0.06)" stroke-width="10" stroke-linecap="round"/>
|
| 444 |
+
<!-- Fill -->
|
| 445 |
+
<path id="gauge-fill" d="M 20 100 A 80 80 0 0 1 180 100" fill="none" stroke="url(#gaugeGrad)" stroke-width="10" stroke-linecap="round"
|
| 446 |
+
stroke-dasharray="251.3" stroke-dashoffset="251.3" style="transition:stroke-dashoffset 0.8s ease"/>
|
| 447 |
+
<!-- Value -->
|
| 448 |
+
<text x="100" y="85" text-anchor="middle" fill="var(--text-primary)" font-family="var(--font-mono)" font-size="28" font-weight="700" id="gauge-text">0.00</text>
|
| 449 |
+
<text x="100" y="102" text-anchor="middle" fill="var(--text-muted)" font-family="var(--font-sans)" font-size="10" font-weight="600" letter-spacing="0.08em">BENCHMARK SCORE</text>
|
| 450 |
+
</svg>
|
| 451 |
+
</div>
|
| 452 |
+
|
| 453 |
+
<!-- Mini Gauges -->
|
| 454 |
+
<div class="mini-gauges">
|
| 455 |
+
<div class="mini-gauge">
|
| 456 |
+
<div class="mini-gauge-label">Precision</div>
|
| 457 |
+
<div class="mini-gauge-value" id="mg-precision">β</div>
|
| 458 |
+
<div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-precision-bar" style="width:0;background:var(--accent-blue)"></div></div>
|
| 459 |
+
</div>
|
| 460 |
+
<div class="mini-gauge">
|
| 461 |
+
<div class="mini-gauge-label">Recall</div>
|
| 462 |
+
<div class="mini-gauge-value" id="mg-recall">β</div>
|
| 463 |
+
<div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-recall-bar" style="width:0;background:var(--accent-green)"></div></div>
|
| 464 |
+
</div>
|
| 465 |
+
<div class="mini-gauge">
|
| 466 |
+
<div class="mini-gauge-label">Workflow</div>
|
| 467 |
+
<div class="mini-gauge-value" id="mg-workflow">β</div>
|
| 468 |
+
<div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-workflow-bar" style="width:0;background:#8b5cf6"></div></div>
|
| 469 |
+
</div>
|
| 470 |
+
<div class="mini-gauge">
|
| 471 |
+
<div class="mini-gauge-label">Efficiency</div>
|
| 472 |
+
<div class="mini-gauge-value" id="mg-efficiency">β</div>
|
| 473 |
+
<div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-efficiency-bar" style="width:0;background:var(--warning)"></div></div>
|
| 474 |
+
</div>
|
| 475 |
+
</div>
|
| 476 |
+
|
| 477 |
+
<!-- LLM Capability Gap Chart -->
|
| 478 |
+
<div class="comparison-card">
|
| 479 |
+
<div class="comparison-title">β‘ LLM Capability Gap (Average Score)</div>
|
| 480 |
+
<div class="bar-row">
|
| 481 |
+
<div class="bar-label">Naive</div>
|
| 482 |
+
<div class="bar-track"><div class="bar-fill naive" id="bar-naive"></div></div>
|
| 483 |
+
<div class="bar-val score-low" id="bar-naive-val">0.10</div>
|
| 484 |
+
</div>
|
| 485 |
+
<div class="bar-row">
|
| 486 |
+
<div class="bar-label">Heuristic</div>
|
| 487 |
+
<div class="bar-track"><div class="bar-fill heuristic" id="bar-heuristic"></div></div>
|
| 488 |
+
<div class="bar-val score-mid" id="bar-heuristic-val">0.60</div>
|
| 489 |
+
</div>
|
| 490 |
+
<div class="bar-row">
|
| 491 |
+
<div class="bar-label">Reasoning</div>
|
| 492 |
+
<div class="bar-track"><div class="bar-fill full" id="bar-full"></div></div>
|
| 493 |
+
<div class="bar-val score-high" id="bar-full-val">0.98</div>
|
| 494 |
+
</div>
|
| 495 |
+
</div>
|
| 496 |
+
|
| 497 |
+
<!-- Detailed Results Table -->
|
| 498 |
+
<div class="comparison-card">
|
| 499 |
+
<div class="comparison-title">π Per-Task Breakdown</div>
|
| 500 |
+
<table class="task-results-table" id="results-table">
|
| 501 |
+
<thead>
|
| 502 |
+
<tr><th>Agent</th><th>Easy</th><th>Med</th><th>Hard</th><th>Avg</th></tr>
|
| 503 |
+
</thead>
|
| 504 |
+
<tbody>
|
| 505 |
+
<tr>
|
| 506 |
+
<td>Naive</td>
|
| 507 |
+
<td class="score-low">0.19</td><td class="score-low">0.06</td>
|
| 508 |
+
<td class="score-low">0.06</td><td class="score-low">0.10</td>
|
| 509 |
+
</tr>
|
| 510 |
+
<tr>
|
| 511 |
+
<td>Heuristic</td>
|
| 512 |
+
<td class="score-mid">0.81</td><td class="score-mid">0.56</td>
|
| 513 |
+
<td class="score-mid">0.45</td><td class="score-mid">0.60</td>
|
| 514 |
+
</tr>
|
| 515 |
+
<tr>
|
| 516 |
+
<td>Reasoning</td>
|
| 517 |
+
<td class="score-high">0.97</td><td class="score-high">0.97</td>
|
| 518 |
+
<td class="score-high">0.98</td><td class="score-high">0.98</td>
|
| 519 |
+
</tr>
|
| 520 |
+
</tbody>
|
| 521 |
+
</table>
|
| 522 |
+
<div class="insight-box">
|
| 523 |
+
<strong>Key finding:</strong> The 88-point gap between naive LLM (0.10) and tool-augmented reasoning agent (0.98) demonstrates that structured protocol comprehension and staged investigation are <strong>necessary</strong> for clinical audit tasks β raw language modeling is insufficient.
|
| 524 |
+
</div>
|
| 525 |
+
</div>
|
| 526 |
+
</div>
|
| 527 |
+
</div>
|
| 528 |
+
|
| 529 |
+
</main>
|
| 530 |
+
|
| 531 |
+
<!-- βββ STATUS BAR βββ -->
|
| 532 |
+
<div class="status-bar">
|
| 533 |
+
<div>
|
| 534 |
+
<span class="status-dot online" id="status-dot"></span>
|
| 535 |
+
<span id="status-text">Environment ready</span>
|
| 536 |
+
</div>
|
| 537 |
+
<div>OpenEnv Spec v3 Β· Phase III Oncology Β· Procedural Generation</div>
|
| 538 |
+
<div id="status-time"></div>
|
| 539 |
+
</div>
|
| 540 |
+
|
| 541 |
+
<script>
|
| 542 |
+
// βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 543 |
+
// ClinicalBench Dashboard β Vanilla JS
|
| 544 |
+
// βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 545 |
+
|
| 546 |
+
const BASE = window.location.origin;
|
| 547 |
+
const AGENTS = {naive:'Naive LLM',heuristic:'Heuristic',full:'Reasoning Agent'};
|
| 548 |
+
const TASKS = {
|
| 549 |
+
task_easy:{name:'Dynamic Eligibility Screening',difficulty:'easy'},
|
| 550 |
+
task_medium:{name:'Protocol Timeline Audit',difficulty:'medium'},
|
| 551 |
+
task_hard:{name:'Equity + Protocol Audit',difficulty:'hard'}
|
| 552 |
+
};
|
| 553 |
+
const SEED = 20260402;
|
| 554 |
+
let running = false;
|
| 555 |
+
let allResults = {};
|
| 556 |
+
|
| 557 |
+
// βββ Utilities βββ
|
| 558 |
+
function $(id){return document.getElementById(id)}
|
| 559 |
+
function qs(sel){return document.querySelector(sel)}
|
| 560 |
+
|
| 561 |
+
function highlightProtocol(text){
|
| 562 |
+
return text
|
| 563 |
+
.replace(/age (\d+-\d+) inclusive/g,'age <span class="hl-rule">$1</span> inclusive')
|
| 564 |
+
.replace(/within (\d+) days/g,'within <span class="hl-rule">$1 days</span>')
|
| 565 |
+
.replace(/(Stage IV exception)/g,'<span class="hl-rule">$1</span>')
|
| 566 |
+
.replace(/(death_date must never precede treatment_start)/g,'<span class="hl-danger">$1</span>')
|
| 567 |
+
.replace(/dominance exceeds (\d+)%/g,'dominance exceeds <span class="hl-rule">$1%</span>')
|
| 568 |
+
.replace(/male share exceeds (\d+)%/g,'male share exceeds <span class="hl-rule">$1%</span>')
|
| 569 |
+
.replace(/gap exceeds (\d+) percentage/g,'gap exceeds <span class="hl-rule">$1</span> percentage')
|
| 570 |
+
.replace(/(Missing age is a protocol violation)/g,'<span class="hl-danger">$1</span>');
|
| 571 |
+
}
|
| 572 |
+
|
| 573 |
+
function updateGauge(score){
|
| 574 |
+
const maxDash = 251.3;
|
| 575 |
+
const offset = maxDash - (maxDash * Math.min(1, Math.max(0, score)));
|
| 576 |
+
$('gauge-fill').style.strokeDashoffset = offset;
|
| 577 |
+
$('gauge-text').textContent = score.toFixed(2);
|
| 578 |
+
}
|
| 579 |
+
|
| 580 |
+
function updateMiniGauge(id, value){
|
| 581 |
+
const el = $(id);
|
| 582 |
+
const bar = $(id + '-bar');
|
| 583 |
+
if(el) el.textContent = (typeof value==='number') ? value.toFixed(3) : value;
|
| 584 |
+
if(bar) bar.style.width = ((typeof value==='number' ? value : 0) * 100) + '%';
|
| 585 |
+
}
|
| 586 |
+
|
| 587 |
+
function setStatus(text, online=true){
|
| 588 |
+
$('status-text').textContent = text;
|
| 589 |
+
$('status-dot').className = 'status-dot ' + (online?'online':'offline');
|
| 590 |
+
}
|
| 591 |
+
|
| 592 |
+
function addLog(type, tag, text, score){
|
| 593 |
+
const feed = $('feed');
|
| 594 |
+
if(feed.querySelector('.feed-empty')) feed.innerHTML = '';
|
| 595 |
+
const card = document.createElement('div');
|
| 596 |
+
card.className = 'log-card type-' + type;
|
| 597 |
+
let html = '<span class="log-tag">[' + tag + ']</span>';
|
| 598 |
+
if(score !== undefined) html += '<span class="log-score">' + score.toFixed(2) + '</span>';
|
| 599 |
+
html += text;
|
| 600 |
+
card.innerHTML = html;
|
| 601 |
+
feed.appendChild(card);
|
| 602 |
+
feed.scrollTop = feed.scrollHeight;
|
| 603 |
+
}
|
| 604 |
+
|
| 605 |
+
function addDivider(text){
|
| 606 |
+
const feed = $('feed');
|
| 607 |
+
const div = document.createElement('div');
|
| 608 |
+
div.className = 'agent-divider';
|
| 609 |
+
div.textContent = text;
|
| 610 |
+
feed.appendChild(div);
|
| 611 |
+
feed.scrollTop = feed.scrollHeight;
|
| 612 |
+
}
|
| 613 |
+
|
| 614 |
+
function updateProtocol(obs){
|
| 615 |
+
$('proto-id').textContent = obs.protocol_title || 'β';
|
| 616 |
+
$('proto-excerpt').innerHTML = highlightProtocol(obs.trial_protocol_excerpt || '');
|
| 617 |
+
$('meta-difficulty').textContent = obs.task_type || 'β';
|
| 618 |
+
$('meta-patients').textContent = (obs.dataset||[]).length || 'β';
|
| 619 |
+
$('meta-steps').textContent = obs.attempts_remaining || 'β';
|
| 620 |
+
}
|
| 621 |
+
|
| 622 |
+
function updateMetrics(bd){
|
| 623 |
+
if(!bd) return;
|
| 624 |
+
updateMiniGauge('mg-precision', bd.precision);
|
| 625 |
+
updateMiniGauge('mg-recall', bd.recall);
|
| 626 |
+
updateMiniGauge('mg-workflow', bd.workflow);
|
| 627 |
+
updateMiniGauge('mg-efficiency', bd.efficiency);
|
| 628 |
+
}
|
| 629 |
+
|
| 630 |
+
function updateBars(results){
|
| 631 |
+
const agents = ['naive','heuristic','full'];
|
| 632 |
+
agents.forEach(a=>{
|
| 633 |
+
if(results[a]){
|
| 634 |
+
const avg = results[a].avg || 0;
|
| 635 |
+
const bar = $('bar-'+a);
|
| 636 |
+
const val = $('bar-'+a+'-val');
|
| 637 |
+
if(bar) bar.style.width = (avg*100)+'%';
|
| 638 |
+
if(val) val.textContent = avg.toFixed(2);
|
| 639 |
+
}
|
| 640 |
+
});
|
| 641 |
+
}
|
| 642 |
+
|
| 643 |
+
function sleep(ms){return new Promise(r=>setTimeout(r,ms))}
|
| 644 |
+
|
| 645 |
+
// βββ Main Audit Runner βββ
|
| 646 |
+
async function runSingleEpisode(agentMode, taskId){
|
| 647 |
+
// Reset
|
| 648 |
+
const resetPayload = {task_id:taskId, seed:SEED};
|
| 649 |
+
const resetRes = await fetch(BASE+'/api/audit/reset', {
|
| 650 |
+
method:'POST', headers:{'Content-Type':'application/json'},
|
| 651 |
+
body:JSON.stringify(resetPayload)
|
| 652 |
+
});
|
| 653 |
+
const resetData = await resetRes.json();
|
| 654 |
+
const obs = resetData.observation || resetData;
|
| 655 |
+
|
| 656 |
+
updateProtocol(obs);
|
| 657 |
+
$('meta-errors').textContent = resetData.total_errors || '?';
|
| 658 |
+
$('stat-seed').textContent = SEED;
|
| 659 |
+
|
| 660 |
+
addLog('info','RESET', `Episode started: ${obs.protocol_title} | ${(obs.dataset||[]).length} patients | ${obs.attempts_remaining} steps`);
|
| 661 |
+
|
| 662 |
+
// Get agent plan
|
| 663 |
+
const planRes = await fetch(BASE+'/api/audit/plan', {
|
| 664 |
+
method:'POST', headers:{'Content-Type':'application/json'},
|
| 665 |
+
body:JSON.stringify({agent:agentMode, task_id:taskId, seed:SEED})
|
| 666 |
+
});
|
| 667 |
+
const planData = await planRes.json();
|
| 668 |
+
const actions = planData.actions || [];
|
| 669 |
+
const traces = planData.traces || [];
|
| 670 |
+
|
| 671 |
+
// Display traces and execute actions
|
| 672 |
+
let lastScore = 0;
|
| 673 |
+
let lastBreakdown = {};
|
| 674 |
+
|
| 675 |
+
for(let i=0; i<actions.length; i++){
|
| 676 |
+
if(!running) break;
|
| 677 |
+
const action = actions[i];
|
| 678 |
+
const trace = traces[i] || {};
|
| 679 |
+
|
| 680 |
+
// Show thought
|
| 681 |
+
if(trace.thought){
|
| 682 |
+
addLog('thought','THINK', trace.thought);
|
| 683 |
+
await sleep(60);
|
| 684 |
+
}
|
| 685 |
+
|
| 686 |
+
// Show tool usage
|
| 687 |
+
if(trace.tool){
|
| 688 |
+
addLog('tool','TOOL', trace.tool);
|
| 689 |
+
await sleep(40);
|
| 690 |
+
}
|
| 691 |
+
|
| 692 |
+
// Execute step
|
| 693 |
+
const stepRes = await fetch(BASE+'/api/audit/step', {
|
| 694 |
+
method:'POST', headers:{'Content-Type':'application/json'},
|
| 695 |
+
body:JSON.stringify(action)
|
| 696 |
+
});
|
| 697 |
+
const stepData = await stepRes.json();
|
| 698 |
+
const sObs = stepData.observation || stepData;
|
| 699 |
+
|
| 700 |
+
lastScore = sObs.score_so_far || 0;
|
| 701 |
+
lastBreakdown = sObs.score_breakdown || {};
|
| 702 |
+
|
| 703 |
+
// Determine log type
|
| 704 |
+
const fb = sObs.feedback || '';
|
| 705 |
+
let logType = 'observe';
|
| 706 |
+
let logTag = 'OBSERVE';
|
| 707 |
+
|
| 708 |
+
if(action.action_type === 'flag_error'){
|
| 709 |
+
logType = fb.includes('β') ? 'flag-ok' : 'flag-bad';
|
| 710 |
+
logTag = fb.includes('β') ? 'FLAG β' : 'FLAG β';
|
| 711 |
+
} else if(action.action_type === 'submit_report'){
|
| 712 |
+
logType = 'report';
|
| 713 |
+
logTag = 'REPORT';
|
| 714 |
+
} else if(action.action_type === 'investigate_pattern'){
|
| 715 |
+
logTag = 'INVESTIGATE';
|
| 716 |
+
} else if(action.action_type === 'compute_distribution'){
|
| 717 |
+
logTag = 'COMPUTE';
|
| 718 |
+
}
|
| 719 |
+
|
| 720 |
+
addLog(logType, logTag, fb.substring(0,120), lastScore);
|
| 721 |
+
updateGauge(lastScore);
|
| 722 |
+
updateMetrics(lastBreakdown);
|
| 723 |
+
await sleep(30);
|
| 724 |
+
|
| 725 |
+
if(sObs.done) break;
|
| 726 |
+
}
|
| 727 |
+
|
| 728 |
+
return {score:lastScore, breakdown:lastBreakdown};
|
| 729 |
+
}
|
| 730 |
+
|
| 731 |
+
async function startAudit(){
|
| 732 |
+
if(running) return;
|
| 733 |
+
running = true;
|
| 734 |
+
const btn = $('btn-start');
|
| 735 |
+
btn.disabled = true;
|
| 736 |
+
btn.classList.add('running');
|
| 737 |
+
btn.textContent = 'β Running...';
|
| 738 |
+
$('feed').innerHTML = '';
|
| 739 |
+
allResults = {};
|
| 740 |
+
setStatus('Audit in progress...', true);
|
| 741 |
+
|
| 742 |
+
const selAgent = $('sel-agent').value;
|
| 743 |
+
const selTask = $('sel-task').value;
|
| 744 |
+
|
| 745 |
+
const agentList = selAgent === 'all' ? ['naive','heuristic','full'] : [selAgent];
|
| 746 |
+
const taskList = selTask === 'all' ? ['task_easy','task_medium','task_hard'] : [selTask];
|
| 747 |
+
|
| 748 |
+
try{
|
| 749 |
+
for(const agent of agentList){
|
| 750 |
+
addDivider(AGENTS[agent] || agent.toUpperCase());
|
| 751 |
+
allResults[agent] = {scores:{}, avg:0};
|
| 752 |
+
|
| 753 |
+
for(const task of taskList){
|
| 754 |
+
const taskName = TASKS[task]?.name || task;
|
| 755 |
+
addLog('phase','TASK', `${taskName} (${TASKS[task]?.difficulty || ''})`);
|
| 756 |
+
await sleep(100);
|
| 757 |
+
|
| 758 |
+
const result = await runSingleEpisode(agent, task);
|
| 759 |
+
allResults[agent].scores[task] = result.score;
|
| 760 |
+
addLog('info','SCORE', `Final: ${result.score.toFixed(2)}`);
|
| 761 |
+
}
|
| 762 |
+
|
| 763 |
+
const scores = Object.values(allResults[agent].scores);
|
| 764 |
+
allResults[agent].avg = scores.reduce((a,b)=>a+b,0)/scores.length;
|
| 765 |
+
}
|
| 766 |
+
|
| 767 |
+
updateBars(allResults);
|
| 768 |
+
|
| 769 |
+
// Update results table if full run
|
| 770 |
+
if(selAgent==='all' && selTask==='all'){
|
| 771 |
+
const tbody = $('results-table').querySelector('tbody');
|
| 772 |
+
tbody.innerHTML = '';
|
| 773 |
+
for(const agent of agentList){
|
| 774 |
+
const r = allResults[agent];
|
| 775 |
+
const tr = document.createElement('tr');
|
| 776 |
+
const scoreClass = r.avg >= 0.8 ? 'score-high' : r.avg >= 0.4 ? 'score-mid' : 'score-low';
|
| 777 |
+
tr.innerHTML = `<td>${AGENTS[agent]}</td>` +
|
| 778 |
+
['task_easy','task_medium','task_hard'].map(t=>`<td class="${scoreClass}">${(r.scores[t]||0).toFixed(2)}</td>`).join('') +
|
| 779 |
+
`<td class="${scoreClass}">${r.avg.toFixed(2)}</td>`;
|
| 780 |
+
tbody.appendChild(tr);
|
| 781 |
+
}
|
| 782 |
+
}
|
| 783 |
+
|
| 784 |
+
addDivider('AUDIT COMPLETE');
|
| 785 |
+
setStatus('Audit complete', true);
|
| 786 |
+
|
| 787 |
+
} catch(err){
|
| 788 |
+
addLog('flag-bad','ERROR', err.message || 'Audit failed');
|
| 789 |
+
setStatus('Error: ' + (err.message||'unknown'), false);
|
| 790 |
+
}
|
| 791 |
+
|
| 792 |
+
running = false;
|
| 793 |
+
btn.disabled = false;
|
| 794 |
+
btn.classList.remove('running');
|
| 795 |
+
btn.textContent = 'βΆ Start Audit';
|
| 796 |
+
}
|
| 797 |
+
|
| 798 |
+
// βββ Clock βββ
|
| 799 |
+
function updateClock(){
|
| 800 |
+
$('status-time').textContent = new Date().toLocaleTimeString('en-US',{hour12:false});
|
| 801 |
+
}
|
| 802 |
+
setInterval(updateClock, 1000);
|
| 803 |
+
updateClock();
|
| 804 |
+
|
| 805 |
+
// βββ Health check on load βββ
|
| 806 |
+
(async function(){
|
| 807 |
+
try{
|
| 808 |
+
const r = await fetch(BASE+'/health');
|
| 809 |
+
if(r.ok) setStatus('Environment ready', true);
|
| 810 |
+
else setStatus('Environment unavailable', false);
|
| 811 |
+
}catch(e){
|
| 812 |
+
setStatus('Connecting...', false);
|
| 813 |
+
}
|
| 814 |
+
})();
|
| 815 |
+
</script>
|
| 816 |
+
|
| 817 |
+
</body>
|
| 818 |
+
</html>
|