File size: 6,052 Bytes
bfa8604
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# ClinicalBench β€” System Architecture

## Overview

ClinicalBench is a procedurally generated, protocol-aware benchmark for evaluating agentic reasoning in clinical trial data auditing. This document describes the system architecture, data flow, and design rationale.

## System Components

### 1. Procedural Dataset Generator (`dataset_generator.py`)

The generator creates a new clinical trial dataset for every `reset()` call. It is the core of ClinicalBench's non-memorization guarantee.

**Pipeline:**
```
Seed β†’ Protocol Sampling β†’ Patient Generation β†’ Error Injection β†’ Trap Injection β†’ Bias/Confounder Injection β†’ Shuffle
```

**Protocol Sampling:**
- Age eligibility ranges drawn from difficulty-specific rulesets (e.g., `[35-75, 40-80, 45-85]` for easy)
- Treatment-start windows randomized per episode (e.g., 14-28 days)
- Stage IV exception window = standard + random [7, 10, 14] days
- Hard mode: bias thresholds (dominance %, male %, stage-adjusted gap %) are protocol-specific

**Error Types:**
| Error | Injection Method | Detection Difficulty |
|:---|:---|:---|
| `invalid_age` | Set age to protocol_min-1, -2, -5, -1 or protocol_max+1, +2, +5, 999 or None | Low (if agent reads protocol) |
| `temporal_inconsistency` | Set death_date = treatment_start - random(10, 240) days | Medium (requires date parsing) |
| `protocol_window_violation` | Set treatment_start = enrollment + allowed_days + random(2, 18) | High (requires stage-aware window) |
| `selection_bias` | Skew control-arm ethnicity/gender + inflate stage-adjusted mortality gap | Very High (requires stratified analysis) |

**Adversarial Traps:**
| Trap Type | Mechanism | Purpose |
|:---|:---|:---|
| Boundary age | Set age to exact protocol_min or protocol_max | Catches agents that use `<` instead of `≀` |
| Temporal near-miss | Deceased patient with death 1-3 days AFTER treatment (valid) | Catches agents that flag all deceased |
| Window trap | Treatment delay = allowed_days - [0,1] (just within window) | Catches agents with off-by-one errors |
| Confounder cohort | Minorities have more Stage IV β†’ higher mortality (but stage-adjusted gap is small) | Catches agents that don't stratify |

### 2. Environment (`clinical_trial_auditor_environment.py`)

Implements the OpenEnv `Environment` base class with:

**Phase System:**
- `investigation` phase: must investigate required variables before flagging
- `flagging` phase: can flag errors; automatically transitions when investigations complete
- Phase violations are penalized (-0.06 reward, workflow discipline score reduced)

**Grading Logic:**
- Ground truth is maintained as `{patient_id: [error_type, ...]}` dict from the generator
- Each flag attempt is checked against ground truth
- Bias flag requires computing ethnicity, gender, and outcome distributions first
- Bias signal uses the same stage-adjusted gap algorithm as the generator

**Reward Configuration:**
```python
REWARD_CONFIG = {
    "correct_flag": 0.16,
    "false_positive": -0.26,      # 1.6x penalty ratio
    "duplicate_flag": -0.08,
    "overconfidence_multiplier": 1.8,  # wrong + confident = very bad
    "cost_per_step": 0.004,       # escalating per-step cost
}
```

The asymmetric false positive penalty (1.6x the correct reward) is deliberate: in clinical auditing, false alarms consume human reviewer time and can trigger unnecessary protocol amendments.

### 3. Benchmark Scoring

The five-component rubric ensures agents can't game the score:

```
Score = 0.70 Γ— Recall + 0.15 Γ— Precision + 0.05 Γ— Workflow + 0.05 Γ— Efficiency + 0.05 Γ— Report
```

**Why Recall is 70%:** In clinical auditing, missing a real error (false negative) is far worse than flagging a non-error (false positive). The heavy recall weight aligns the benchmark with real regulatory priorities.

**Why Precision is only 15%:** We still penalize false positives to prevent "flag everything" strategies, but not so heavily that agents become overly conservative.

### 4. Agent Strategies (inference.py)

Three agents demonstrate the benchmark's difficulty gradient:

| Agent | Strategy | Key Weakness |
|:---|:---|:---|
| Naive | LLM prompt + 24-patient sample | Misses 95% of patients, uses generic 18-120 age range |
| Heuristic | Parses rules but applies them loosely | Off-by-3 age margins, ignores Stage IV window, uses overall (not stage-adjusted) bias gap |
| Reasoning | Full protocol parser + stage-aware tools | None β€” but limited to deterministic analysis |

### 5. Dashboard UI (`static/index.html`)

A zero-dependency dark mode command center that:
- Displays the episode-specific protocol with highlighted dynamic rules
- Streams the agent's reasoning loop (Thought β†’ Tool β†’ Observation β†’ Flag) in real time
- Shows live scoring gauges (precision, recall, workflow, efficiency)
- Visualizes the LLM capability gap across all three agents

## Data Flow

```
User clicks "Start Audit"
    β”‚
    β”œβ”€β”€ POST /api/audit/reset    β†’ New episode (seed + task_id)
    β”‚     └── Returns: protocol excerpt, patient count, step budget
    β”‚
    β”œβ”€β”€ POST /api/audit/plan     β†’ Agent plans actions + traces
    β”‚     └── Returns: [{action, trace}, ...]
    β”‚
    └── For each action:
          POST /api/audit/step   β†’ Execute action, get feedback + score
                └── UI renders: log card + updated gauges
```

## Reproducibility

All randomness flows through a single `random.Random(seed)` instance in the generator. This guarantees:
- `reset(seed=42, task_id="task_easy")` produces identical results across machines
-  Ground truth, traps, protocol excerpt, and patient ordering are all deterministic
- Different seeds produce measurably different protocols and datasets (verified by assertion)

## Resource Constraints

The environment is designed to run within:
- **2 vCPU / 8GB memory** (Hugging Face Space free tier)
- **< 3 minutes** for full inference run (3 agents Γ— 3 tasks)
- **Zero external dependencies** at runtime (no database, no GPU, no network calls)