Spaces:

Timusgeorge
/

clinical_trial_auditor

Sleeping

App Files Files Community

Sumit Saraswat commited on Apr 5

Commit

bfa8604

1 Parent(s): 4b5cda3

feat: added Enterprise UI dashboard, ReAct reasoning traces, and health endpoints

Browse files

Files changed (7) hide show

Dockerfile +1 -1
README.md +278 -165
docs/architecture.md +126 -0
inference.py +54 -14
requirements.txt +2 -1
server/app.py +513 -0
server/static/index.html +818 -0

Dockerfile CHANGED Viewed

@@ -9,7 +9,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-# Copy all server files
 COPY . .
 EXPOSE 8000

 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
+# Copy all project files
 COPY . .
 EXPOSE 8000

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Clinical Trial Auditor
-emoji: 🏥
 colorFrom: blue
 colorTo: green
 sdk: docker
@@ -10,259 +10,372 @@ tags:
   - openenv
 ---
-# Clinical Trial Auditor (OpenEnv)
-Clinical Trial Auditor is a protocol-aware OpenEnv benchmark for clinical data auditing. The agent acts as a Senior Clinical Data Manager reviewing procedurally generated Phase III oncology trial data under dynamic per-episode rules.
-This is not a static spreadsheet puzzle. Every `reset()` samples a new protocol excerpt and a new dataset, so the agent must read the rules for that episode and then audit the records accordingly.
-## Why This Matters
-Real clinical audits are messy:
-- eligibility criteria vary by protocol,
-- timeline rules include exceptions,
-- suspicious subgroup outcomes are not always evidence of bias,
-- false positives waste reviewer time and can trigger unnecessary escalations.
-This environment is built to evaluate exactly those failure modes. It targets the gap between "can parse a table" and "can follow a high-stakes auditing workflow with protocol friction and adversarial traps."
-## What Makes This Benchmark Different
-- Dynamic protocol reasoning: each episode exposes a new `trial_protocol_excerpt` with episode-specific age ranges and treatment-start windows.
-- Cross-modal audit logic: the agent must apply text rules from the protocol to tabular patient data.
-- Stage-aware timing exceptions: Stage IV patients can have a longer enrollment-to-treatment window, which creates valid edge cases that trap shortcut heuristics.
-- Hallucination traps: hard episodes can contain a confounded high-risk cohort that looks biased overall but is not actionable after stage-adjusted review.
-- Dense reward plus benchmark rubric: step rewards encourage learning, while `score_so_far` tracks a judge-facing episode rubric emphasizing recall, precision, workflow discipline, efficiency, and report quality.
-## OpenEnv Compliance
-This project implements the required OpenEnv interface:
-- typed `Action`, `Observation`, and `State` models with Pydantic,
-- `reset(seed, task_id, ...) -> Observation`,
-- `step(action) -> Observation`,
-- `state -> current state`,
-- `openenv.yaml` at the repo root.
-Validation:
-```bash
-openenv validate .
-```
-Local validation result:
-```text
-[OK] : Ready for multi-mode deployment
 ```
 ## Task Suite
 ### Task 1: `task_easy` — Dynamic Eligibility Screening
-- Dataset size: about `300` patients
-- Goal: flag `invalid_age`
-- Difficulty source: the age bounds are episode-specific, not fixed at 18-120
-- Traps: valid edge ages at the protocol boundary
 ### Task 2: `task_medium` — Protocol Timeline Audit
-- Dataset size: about `480` patients
-- Goal: flag `invalid_age`, `temporal_inconsistency`, and `protocol_window_violation`
-- Difficulty source: the treatment-start window is protocol-specific and Stage IV has a longer valid window
-- Traps: valid near-boundary start delays and near-immediate but valid deaths
 ### Task 3: `task_hard` — Equity + Protocol Audit
-- Dataset size: about `720` patients
-- Goal: flag record-level issues and determine whether actionable `selection_bias` exists
-- Difficulty source: some hard episodes contain real control-arm bias, while others contain a confounded high-risk cohort that only looks biased before stage adjustment
-- Traps: treatment-arm skew, high-risk outreach sites, and false-positive bias patterns
 ## Action Space
 ```python
 class AuditAction(Action):
-    action_type: str  # investigate_pattern | compute_distribution | flag_error | propose_fix | submit_report
-    variable: Optional[str]
-    patient_id: Optional[str]
-    error_type: Optional[str]  # invalid_age | temporal_inconsistency | protocol_window_violation | selection_bias
-    reason: Optional[str]
     proposed_value: Optional[str]
-    report: Optional[str]
-    confidence: Optional[float]
 ```
 ## Observation Space
 ```python
 class AuditObservation(Observation):
-    done: bool
-    reward: float
-    task_id: str
-    task_type: str
-    task_description: str
-    protocol_title: str
-    trial_protocol_excerpt: str
-    dataset: list[dict]
-    errors_found: list[str]
-    patterns_investigated: list[str]
-    distributions_computed: list[str]
-    feedback: str
-    score_so_far: float
-    dense_reward_total: float
-    score_breakdown: dict[str, float]
-    attempts_remaining: int
-    phase: str
 ```
-## Reward Design and Benchmark Score
-The environment uses two scoring layers:
-- Dense step reward:
-  - correct flags,
-  - false-positive penalties,
-  - duplicate penalties,
-  - investigation/distribution bonuses,
-  - confidence penalties for overconfident wrong flags,
-  - per-step costs.
-- Episode benchmark score (`score_so_far`):
-  - recall: `70%`
-  - precision: `15%`
-  - workflow discipline: `5%`
-  - efficiency: `5%`
-  - report quality: `5%`
-This separation keeps the RL signal dense while preventing early score saturation from hiding later mistakes.
-## Procedural Generation and Reproducibility
-Run the generator self-test:
 ```bash
 python3 server/dataset_generator.py
 ```
-What it guarantees:
-- same seed -> same dataset, same protocol excerpt, same ground truth,
-- different seeds -> different protocols and different datasets,
-- deterministic grading compatibility,
-- hard mode can alternate between `true_bias` and `confounded_no_bias`.
-Example validated seeded profile:
-- Easy: `300` patients, `8` record-level errors, `13` traps
-- Medium: `480` patients, `23` record-level errors, `25` traps
-- Hard: `720` patients, `34` total issues including protocol/timing/bias logic, `40` traps
-## Baseline Inference (`inference.py`)
-`inference.py` now demonstrates a clean difficulty gradient:
-- `naive`: raw sample-level behavior
-- `heuristic`: rule-based but trap-prone
-- `full`: protocol parser + stage-aware detectors + structured reporting
-- `all`: side-by-side comparison
-HTTP mode:
-```bash
-python3 inference.py --mode all
-```
-Isolated local validation mode with no socket bind:
 ```bash
-ENV_BASE_URL=inprocess python3 inference.py --mode all
 ```
-LLM integration:
-- When `OPENAI_API_KEY` or `HF_TOKEN` is present, naive mode and report generation use the OpenAI-compatible client pointed at `API_BASE_URL`.
-- Without a key, the script falls back to deterministic local behavior so validation still runs end-to-end.
-Current reproducible local benchmark result:
-Command:
 ```bash
-ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402
 ```
-Scores:
-| Agent | Easy | Medium | Hard | Average |
-|---|---:|---:|---:|---:|
-| Naive | 0.36 | 0.08 | 0.09 | 0.18 |
-| Heuristic | 0.81 | 0.56 | 0.45 | 0.60 |
-| Full | 0.98 | 0.99 | 0.99 | 0.99 |
-This is the intended story:
-- naive agents underperform badly,
-- shallow heuristics get trapped by dynamic protocol edges and confounded bias signals,
-- protocol-aware agents perform strongly.
-## Local Usage
-### 1) Start the server
 ```bash
-cd server
-PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
-### 2) Health check
 ```bash
-curl -s http://localhost:8000/health
 ```
-### 3) Run the baseline
-```bash
-cd ..
-python3 inference.py --mode all
-```
-## Docker
-Build and run:
-```bash
-cd server
-docker build -t clinical-trial-auditor:latest .
-docker run -p 8000:8000 clinical-trial-auditor:latest
-```
-The container exposes `/health` for health checks and is ready for Hugging Face Spaces container deployment.
-## Hugging Face Space Readiness Checklist
-- [x] OpenEnv interface implemented
-- [x] typed models for action/observation/state
-- [x] `openenv.yaml` present
-- [x] 3 tasks with deterministic graders and scores in `[0.0, 1.0]`
-- [x] dense reward shaping and benchmark rubric
-- [x] reproducible `inference.py` at repo root
-- [x] dockerized server
 - [x] `openenv validate .` passes
 ## Project Structure
-```text
 clinical_trial_auditor/
-├── openenv.yaml
-├── inference.py
-├── client.py
-├── models.py
 ├── README.md
 └── server/
-    ├── app.py
     ├── clinical_trial_auditor_environment.py
-    ├── dataset_generator.py
     ├── models.py
     ├── requirements.txt
-    └── Dockerfile
 ```
-## Motivation
-This benchmark is built to test whether an agent can read a changing clinical protocol, audit patient records against that protocol, avoid hallucinated escalations, and write a grounded operational report under a limited action budget.

 ---
+title: ClinicalBench
+emoji: 🔬
 colorFrom: blue
 colorTo: green
 sdk: docker
   - openenv
 ---
+<div align="center">
+# 🔬 ClinicalBench
+### A Benchmark for Evaluating Agentic Reasoning in Safety-Critical Clinical Workflows
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-v3-blue?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld0JveD0iMCAwIDI0IDI0IiBmaWxsPSJ3aGl0ZSI+PHBhdGggZD0iTTEyIDJDNi40OCAyIDIgNi40OCAyIDEyczQuNDggMTAgMTAgMTAgMTAtNC40OCAxMC0xMFMxNy41MiAyIDEyIDJ6Ii8+PC9zdmc+)](https://github.com/meta-pytorch/OpenEnv)
+[![HF Space](https://img.shields.io/badge/%F0%9F%A4%97-Live%20Space-orange?style=flat-square)](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor)
+[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](#docker)
+[![License](https://img.shields.io/badge/License-BSD%203--Clause-green?style=flat-square)](LICENSE)
+**Modern AI systems fail silently in high-stakes domains like clinical trials due to inability to reason about protocol constraints, temporal causality, and fairness simultaneously. ClinicalBench is an OpenEnv benchmark that exposes these failure modes.**
+[Live Demo](https://huggingface.co/spaces/Timusgeorge/clinical_trial_auditor) · [Architecture](#architecture) · [Results](#benchmark-results) · [Quick Start](#quick-start)
+</div>
+---
+## The Problem
+Clinical data auditing is one of medicine's most consequential workflows. A single undetected protocol violation can invalidate years of trial data, delay drug approvals, and — in worst cases — put patients at risk. Today's AI systems fail at this task in three specific ways:
+| Failure Mode | What Happens | Why It Matters |
+|:---|:---|:---|
+| **Overflagging** | LLMs flag valid edge cases (e.g., Stage IV patients with extended treatment windows) as violations | False alarms waste reviewer time and erode trust in AI-assisted auditing |
+| **Temporal Confusion** | Models miss impossible date orderings (death before treatment) while fixating on superficial anomalies | Critical safety signals go undetected |
+| **Bias Misinterpretation** | Models detect demographic skew in raw statistics but cannot distinguish genuine selection bias from confounded high-risk cohorts | Naive bias detection causes incorrect escalations or dangerous dismissals |
+ClinicalBench is designed to evaluate and train agents that can overcome all three failure modes simultaneously.
+---
+## Why ClinicalBench Exists
+Existing RL benchmarks for agents fall into two categories: **game-like environments** (code golf, math puzzles) where memorization helps, and **static dataset tasks** (classification, extraction) where the answer is fixed. Neither captures the reality of clinical auditing, where:
+- **Rules change every episode** — eligibility criteria, timing windows, and bias thresholds are protocol-specific
+- **Edge cases are not errors** — Stage IV patients legitimately have longer treatment windows
+- **Statistics lie without context** — a minority group's higher mortality rate may reflect disease severity, not unfair sampling
+- **The step budget is limited** — agents must prioritize which patients and which patterns to investigate
+ClinicalBench fills this gap by generating a new procedural dataset and protocol for every `reset()`, forcing agents to **read and reason** rather than memorize.
+---
+## Architecture
 ```
+┌─────────────────────────────────────────────────────────────────┐
+│                    ClinicalBench Architecture                   │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  reset(seed, task_id)                                          │
+│        │                                                        │
+│        ▼                                                        │
+│  ┌──────────────────────┐    ┌─────────────────────────────┐   │
+│  │  Procedural Dataset  │───▶│  Episode-Specific Protocol  │   │
+│  │  Generator           │    │  Excerpt                    │   │
+│  │  • 300-720 patients  │    │  • Dynamic age range        │   │
+│  │  • Seeded RNG        │    │  • Variable timing windows  │   │
+│  │  • Adversarial traps │    │  • Stage IV exceptions      │   │
+│  │  • Hidden confounders│    │  • Bias thresholds          │   │
+│  └──────────────────────┘    └─────────────────────────────┘   │
+│        │                              │                         │
+│        ▼                              ▼                         │
+│  ┌─────────────────────────────────────────────────────────┐   │
+│  │              Agent Interaction Loop                      │   │
+│  │  Thought → Tool → Observation → Flag → Report           │   │
+│  ├─────────────────────────────────────────────────────────┤   │
+│  │  investigate_pattern(var)   → distribution summary      │   │
+│  │  compute_distribution(var) → cohort breakdown           │   │
+│  │  flag_error(patient, type) → correct/false positive     │   │
+│  │  submit_report(text)       → quality score              │   │
+│  └─────────────────────────────────────────────────────────┘   │
+│        │                                                        │
+│        ▼                                                        │
+│  ┌─────────────────────────────────────────────────────────┐   │
+│  │              Multi-Dimensional Grading                   │   │
+│  │  Recall (70%) + Precision (15%) + Workflow (5%)         │   │
+│  │  + Efficiency (5%) + Report Quality (5%)                │   │
+│  │  Dense step rewards + episode benchmark score           │   │
+│  └─────────────────────────────────────────────────────────┘   │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+### Key Design Decisions
+1. **Procedural Generation** — Each `reset()` samples a new protocol with different age ranges, timing windows, and bias thresholds using seeded stochastic processes. No two environments are identical, preventing memorization.
+2. **Adversarial Traps** — Valid edge cases (boundary ages, near-window delays, valid Stage IV exceptions) are deliberately injected to punish agents that use naive threshold-based heuristics.
+3. **Confounder-Aware Bias** — Hard episodes may contain either genuine selection bias OR a confounded high-risk cohort. The confounder (high-risk outreach site with more late-stage patients) creates an overall mortality gap that disappears after stage-stratified analysis. Agents must perform this adjustment before flagging.
+4. **Phase-Gated Workflow** — Agents must investigate variables before flagging errors, and compute distributions before claiming bias. Skipping phases is penalized, encouraging structured reasoning over guessing.
+---
 ## Task Suite
 ### Task 1: `task_easy` — Dynamic Eligibility Screening
+| Property | Value |
+|:---|:---|
+| Dataset | ~300 patients |
+| Error types | `invalid_age` |
+| Difficulty source | Age bounds are episode-specific (e.g., 35-75, 45-85), not fixed at 18-120 |
+| Traps | Valid boundary ages at exact protocol limits |
+| Step budget | 18 |
 ### Task 2: `task_medium` — Protocol Timeline Audit
+| Property | Value |
+|:---|:---|
+| Dataset | ~480 patients |
+| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation` |
+| Difficulty source | Treatment-start window is protocol-specific; Stage IV has a longer valid window |
+| Traps | Near-boundary delays, valid Stage IV exceptions, near-immediate valid deaths |
+| Step budget | 34 |
 ### Task 3: `task_hard` — Equity + Protocol Audit
+| Property | Value |
+|:---|:---|
+| Dataset | ~720 patients |
+| Error types | `invalid_age`, `temporal_inconsistency`, `protocol_window_violation`, `selection_bias` |
+| Difficulty source | Some episodes have genuine bias; others have a confounded high-risk cohort that only looks biased before stage adjustment |
+| Traps | Treatment-arm skew, high-risk outreach sites, false-positive bias patterns |
+| Step budget | 46 |
+---
+## Why ClinicalBench Is Hard
+This benchmark is designed to expose fundamental limitations in current AI systems:
+| Challenge | Why It Breaks Naive Agents |
+|:---|:---|
+| **Dynamic protocols** | Rules embedded in natural language change every episode — hardcoded thresholds fail |
+| **Non-linear constraints** | Stage IV exception creates a conditional rule that requires cross-referencing two fields |
+| **Conflicting signals** | High-risk sites inflate mortality for minorities, but the cause is disease severity, not sampling bias |
+| **Limited step budget** | Agents cannot check every patient — they must prioritize investigations and triage efficiently |
+| **Phased workflow** | Flagging before investigating is blocked and penalized — forces structured reasoning |
+| **Overconfidence penalty** | High-confidence wrong flags are penalized 1.8× — discourages guessing |
+---
+## Benchmark Results
+Reproducible baseline scores (`seed=20260402`):
+| Agent | Easy | Medium | Hard | Average | Precision | Description |
+|:---|:---:|:---:|:---:|:---:|:---:|:---|
+| **Naive LLM** | 0.19 | 0.06 | 0.06 | **0.10** | 5% | Raw prompt + small sample, no structured reasoning |
+| **Heuristic** | 0.81 | 0.56 | 0.45 | **0.60** | 61% | Parses rules but ignores Stage IV exceptions, uses overall (not stage-adjusted) bias |
+| **Reasoning Agent** | 0.97 | 0.97 | 0.98 | **0.98** | 100% | Full protocol parsing + stage-aware detectors + structured workflow |
+**The 88-point gap** between the naive LLM (0.10) and the tool-augmented reasoning agent (0.98) demonstrates the necessity of structured protocol comprehension and staged investigation. The heuristic agent's mediocre performance (0.60) shows that even rule-based approaches fail when they don't account for conditional exceptions and confounded statistics.
+### What This Tells Us
+- **Language understanding alone is insufficient** — the naive LLM reads the protocol but cannot systematically apply it across hundreds of records
+- **Heuristics miss conditional logic** — ignoring the Stage IV exception and using raw (not stage-adjusted) mortality gaps causes cascading false positives and missed real violations
+- **Structured reasoning closes the gap** — the reasoning agent's workflow (parse protocol → investigate → flag → verify → report) achieves near-perfect scores by respecting the environment's phase constraints
+---
 ## Action Space
 ```python
 class AuditAction(Action):
+    action_type: str           # investigate_pattern | compute_distribution |
+                                # flag_error | propose_fix | submit_report
+    variable: Optional[str]     # Field to investigate or compute
+    patient_id: Optional[str]   # Patient to flag
+    error_type: Optional[str]   # invalid_age | temporal_inconsistency |
+                                # protocol_window_violation | selection_bias
+    reason: Optional[str]       # Justification text
     proposed_value: Optional[str]
+    report: Optional[str]       # Final audit report
+    confidence: Optional[float] # 0.0-1.0 confidence in the flag
 ```
 ## Observation Space
 ```python
 class AuditObservation(Observation):
+    done: bool                          # Episode finished?
+    reward: float                       # Dense step reward
+    task_id: str                        # task_easy | task_medium | task_hard
+    task_type: str                      # Audit category
+    task_description: str               # Task instructions
+    protocol_title: str                 # Episode protocol ID
+    trial_protocol_excerpt: str         # Natural language protocol rules
+    dataset: list[dict]                 # Full patient records
+    errors_found: list[str]             # Correctly flagged patients
+    patterns_investigated: list[str]    # Variables investigated
+    distributions_computed: list[str]   # Distributions computed
+    feedback: str                       # Step-by-step feedback
+    score_so_far: float                 # Current benchmark score [0, 1]
+    dense_reward_total: float           # Cumulative dense reward
+    score_breakdown: dict[str, float]   # {recall, precision, workflow, efficiency, report}
+    attempts_remaining: int             # Steps left in budget
+    phase: str                          # investigation | flagging
 ```
+---
+## Reward Design
+ClinicalBench uses **two scoring layers** to separate RL training signal from benchmark evaluation:
+### Dense Step Reward (for RL training)
+- **Correct flag**: +0.16
+- **False positive**: −0.26 (asymmetric to penalize guessing)
+- **Duplicate flag**: −0.08
+- **New investigation**: +0.04
+- **Overconfident wrong flag**: reward × −1.8
+- **Per-step cost**: −0.004 × step_count (increasing pressure)
+### Episode Benchmark Score (for evaluation)
+| Component | Weight | Signal |
+|:---|:---:|:---|
+| Recall | 70% | What fraction of real errors were caught? |
+| Precision | 15% | How many flags were correct? |
+| Workflow Discipline | 5% | Did the agent investigate before flagging? |
+| Efficiency | 5% | Ratio of useful actions to total actions |
+| Report Quality | 5% | Does the report cite protocol, root cause, risk, corrective action, fairness? |
+This separation keeps the RL signal dense (partial progress on every step) while preventing early score saturation from hiding later mistakes.
+---
+## Procedural Generation
+Each episode generates a unique dataset with new protocol constraints:
 ```bash
 python3 server/dataset_generator.py
 ```
+**Guarantees:**
+- Same seed → identical dataset, protocol, and ground truth
+- Different seeds → different protocols with different rules
+- Deterministic grading: reproducible scores across machines
+- Hard mode alternates between `true_bias` and `confounded_no_bias`
+**Example validated profile (seed=42):**
+- Easy: 300 patients, 8 errors, 13 traps
+- Medium: 480 patients, 23 errors, 25 traps
+- Hard: 720 patients, 34 errors, 40 traps
+---
+## Quick Start
+### 1. Start the Server
 ```bash
+cd server
+PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
+### 2. Open the Dashboard
+Navigate to [http://localhost:8000](http://localhost:8000) to see the enterprise audit command center. Select an agent and task, then click **Start Audit** to watch the reasoning loop in real time.
+### 3. Health Check
 ```bash
+curl -s http://localhost:8000/health
 ```
+### 4. Run Baseline Inference
+```bash
+# Full comparison (all 3 agents × all 3 tasks)
+ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402
+# Single agent mode
+python3 inference.py --mode full
+```
+### 5. OpenEnv Validation
 ```bash
+openenv validate .
 ```
+---
+## Docker
 ```bash
+docker build -t clinical-bench:latest .
+docker run -p 8000:8000 clinical-bench:latest
 ```
+The container exposes:
+- `/health` for health checks
+- `/` for the enterprise dashboard
+- WebSocket endpoints for OpenEnv `reset()` / `step()` / `state()`
+---
+## Real-World Relevance
+ClinicalBench models tasks that clinical data managers perform daily:
+| Real-World Task | ClinicalBench Equivalent |
+|:---|:---|
+| ICH-E6(R2) protocol compliance review | Age eligibility + treatment window verification |
+| FDA 21 CFR Part 11 data integrity audit | Temporal consistency checking |
+| DSMB safety signal assessment | Stage-adjusted outcome disparity analysis |
+| IRB equity review | Confounder-aware selection bias detection |
+This benchmark is immediately useful for evaluating whether an LLM-based agent can be safely deployed in a clinical data management workflow — one of healthcare AI's highest-value, highest-risk applications.
+---
+## OpenEnv Compliance
+- [x] Typed `Action`, `Observation`, `State` models (Pydantic)
+- [x] `reset(seed, task_id) → Observation`
+- [x] `step(action) → Observation`
+- [x] `state → current state`
+- [x] `openenv.yaml` with metadata and 3 tasks
 - [x] `openenv validate .` passes
+- [x] 3 tasks with deterministic graders, scores in `[0.0, 1.0]`
+- [x] Dense reward shaping + benchmark rubric
+- [x] Reproducible `inference.py` at repo root
+- [x] Dockerized with health check
+- [x] Inference runtime < 3 minutes
+- [x] Runs on 2 vCPU / 8GB memory
 ## Project Structure
+```
 clinical_trial_auditor/
+├── openenv.yaml              # OpenEnv manifest with 3 tasks
+├── inference.py              # Baseline inference (naive/heuristic/full)
+├── client.py                 # EnvClient implementation
+├── models.py                 # Typed Action/Observation/State
 ├── README.md
+├── Dockerfile
+├── requirements.txt
+├── pyproject.toml
+├── docs/
+│   └── architecture.md       # Detailed system architecture
 └── server/
+    ├── app.py                # FastAPI + dashboard API
     ├── clinical_trial_auditor_environment.py
+    ├── dataset_generator.py  # Procedural adversarial data engine
     ├── models.py
     ├── requirements.txt
+    └── static/
+        └── index.html        # Enterprise audit dashboard
 ```
+---
+<div align="center">
+**Built for the Meta × Scaler School of Technology OpenEnv Hackathon**
+*ClinicalBench: because the hardest thing about AI in healthcare isn't the model — it's knowing when to trust it.*
+</div>

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# ClinicalBench — System Architecture
+## Overview
+ClinicalBench is a procedurally generated, protocol-aware benchmark for evaluating agentic reasoning in clinical trial data auditing. This document describes the system architecture, data flow, and design rationale.
+## System Components
+### 1. Procedural Dataset Generator (`dataset_generator.py`)
+The generator creates a new clinical trial dataset for every `reset()` call. It is the core of ClinicalBench's non-memorization guarantee.
+**Pipeline:**
+```
+Seed → Protocol Sampling → Patient Generation → Error Injection → Trap Injection → Bias/Confounder Injection → Shuffle
+```
+**Protocol Sampling:**
+- Age eligibility ranges drawn from difficulty-specific rulesets (e.g., `[35-75, 40-80, 45-85]` for easy)
+- Treatment-start windows randomized per episode (e.g., 14-28 days)
+- Stage IV exception window = standard + random [7, 10, 14] days
+- Hard mode: bias thresholds (dominance %, male %, stage-adjusted gap %) are protocol-specific
+**Error Types:**
+| Error | Injection Method | Detection Difficulty |
+|:---|:---|:---|
+| `invalid_age` | Set age to protocol_min-1, -2, -5, -1 or protocol_max+1, +2, +5, 999 or None | Low (if agent reads protocol) |
+| `temporal_inconsistency` | Set death_date = treatment_start - random(10, 240) days | Medium (requires date parsing) |
+| `protocol_window_violation` | Set treatment_start = enrollment + allowed_days + random(2, 18) | High (requires stage-aware window) |
+| `selection_bias` | Skew control-arm ethnicity/gender + inflate stage-adjusted mortality gap | Very High (requires stratified analysis) |
+**Adversarial Traps:**
+| Trap Type | Mechanism | Purpose |
+|:---|:---|:---|
+| Boundary age | Set age to exact protocol_min or protocol_max | Catches agents that use `<` instead of `≤` |
+| Temporal near-miss | Deceased patient with death 1-3 days AFTER treatment (valid) | Catches agents that flag all deceased |
+| Window trap | Treatment delay = allowed_days - [0,1] (just within window) | Catches agents with off-by-one errors |
+| Confounder cohort | Minorities have more Stage IV → higher mortality (but stage-adjusted gap is small) | Catches agents that don't stratify |
+### 2. Environment (`clinical_trial_auditor_environment.py`)
+Implements the OpenEnv `Environment` base class with:
+**Phase System:**
+- `investigation` phase: must investigate required variables before flagging
+- `flagging` phase: can flag errors; automatically transitions when investigations complete
+- Phase violations are penalized (-0.06 reward, workflow discipline score reduced)
+**Grading Logic:**
+- Ground truth is maintained as `{patient_id: [error_type, ...]}` dict from the generator
+- Each flag attempt is checked against ground truth
+- Bias flag requires computing ethnicity, gender, and outcome distributions first
+- Bias signal uses the same stage-adjusted gap algorithm as the generator
+**Reward Configuration:**
+```python
+REWARD_CONFIG = {
+    "correct_flag": 0.16,
+    "false_positive": -0.26,      # 1.6x penalty ratio
+    "duplicate_flag": -0.08,
+    "overconfidence_multiplier": 1.8,  # wrong + confident = very bad
+    "cost_per_step": 0.004,       # escalating per-step cost
+}
+```
+The asymmetric false positive penalty (1.6x the correct reward) is deliberate: in clinical auditing, false alarms consume human reviewer time and can trigger unnecessary protocol amendments.
+### 3. Benchmark Scoring
+The five-component rubric ensures agents can't game the score:
+```
+Score = 0.70 × Recall + 0.15 × Precision + 0.05 × Workflow + 0.05 × Efficiency + 0.05 × Report
+```
+**Why Recall is 70%:** In clinical auditing, missing a real error (false negative) is far worse than flagging a non-error (false positive). The heavy recall weight aligns the benchmark with real regulatory priorities.
+**Why Precision is only 15%:** We still penalize false positives to prevent "flag everything" strategies, but not so heavily that agents become overly conservative.
+### 4. Agent Strategies (inference.py)
+Three agents demonstrate the benchmark's difficulty gradient:
+| Agent | Strategy | Key Weakness |
+|:---|:---|:---|
+| Naive | LLM prompt + 24-patient sample | Misses 95% of patients, uses generic 18-120 age range |
+| Heuristic | Parses rules but applies them loosely | Off-by-3 age margins, ignores Stage IV window, uses overall (not stage-adjusted) bias gap |
+| Reasoning | Full protocol parser + stage-aware tools | None — but limited to deterministic analysis |
+### 5. Dashboard UI (`static/index.html`)
+A zero-dependency dark mode command center that:
+- Displays the episode-specific protocol with highlighted dynamic rules
+- Streams the agent's reasoning loop (Thought → Tool → Observation → Flag) in real time
+- Shows live scoring gauges (precision, recall, workflow, efficiency)
+- Visualizes the LLM capability gap across all three agents
+## Data Flow
+```
+User clicks "Start Audit"
+    │
+    ├── POST /api/audit/reset    → New episode (seed + task_id)
+    │     └── Returns: protocol excerpt, patient count, step budget
+    │
+    ├── POST /api/audit/plan     → Agent plans actions + traces
+    │     └── Returns: [{action, trace}, ...]
+    │
+    └── For each action:
+          POST /api/audit/step   → Execute action, get feedback + score
+                └── UI renders: log card + updated gauges
+```
+## Reproducibility
+All randomness flows through a single `random.Random(seed)` instance in the generator. This guarantees:
+- `reset(seed=42, task_id="task_easy")` produces identical results across machines
+-  Ground truth, traps, protocol excerpt, and patient ordering are all deterministic
+- Different seeds produce measurably different protocols and datasets (verified by assertion)
+## Resource Constraints
+The environment is designed to run within:
+- **2 vCPU / 8GB memory** (Hugging Face Space free tier)
+- **< 3 minutes** for full inference run (3 agents × 3 tasks)
+- **Zero external dependencies** at runtime (no database, no GPU, no network calls)

inference.py CHANGED Viewed

@@ -1,11 +1,14 @@
 """
-Clinical Trial Auditor — Baseline Inference
-===========================================
-Demonstrates a deliberate difficulty gradient on the protocol-aware benchmark:
-  1. NAIVE     — raw prompt + small sample, weak structure
-  2. HEURISTIC — parses obvious rules but ignores key exceptions
-  3. FULL      — parses protocol, honors stage exceptions, stage-adjusts bias
 """
 from __future__ import annotations
@@ -682,6 +685,7 @@ def run_heuristic_task(client_unused: Optional[OpenAI], task_id: str, task_name:
 def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
     print(f"\n  Task: {task_name}")
     print("  " + "-" * 54)
     metrics = MetricsTracker()
@@ -699,22 +703,56 @@ def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed:
             f"stage IV <= {rules.stage_iv_window_days}d"
         )
         findings = []
-        findings.extend(AgeDetector().detect(dataset, rules))
-        findings.extend(TemporalDetector().detect(dataset))
         if task_id in {"task_medium", "task_hard"}:
-            findings.extend(ProtocolWindowDetector().detect(dataset, rules, ignore_stage_exception=False))
         if task_id == "task_hard":
-            findings.extend(BiasAnalyzer().detect_full(dataset, rules))
         age_count = sum(f.error_type == "invalid_age" for f in findings)
         temporal_count = sum(f.error_type == "temporal_inconsistency" for f in findings)
         window_count = sum(f.error_type == "protocol_window_violation" for f in findings)
         bias_count = sum(f.error_type == "selection_bias" for f in findings)
         print(
-            f"  Detected: age={age_count} | temporal={temporal_count} | "
             f"window={window_count} | bias={bias_count}"
         )
         extra_checks = {
             "task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
@@ -736,7 +774,9 @@ def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed:
             if action.action_type == "flag_error":
                 metrics.record(obs["feedback"])
             if action.action_type == "flag_error" or metrics.steps <= 5:
-                print(f"  Step {metrics.steps}: score={final_score:.2f} | {obs['feedback'][:80]}")
         if not result.done:
             result = env.step(AuditAction(action_type="submit_report", report=report))
@@ -785,8 +825,8 @@ def main():
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
     print("=" * 70)
-    print("  Clinical Trial Auditor — Protocol-Aware Baseline Inference")
-    print("  Dynamic Rules | Adversarial Traps | Stage-Adjusted Fairness Review")
     print(f"  Model: {MODEL_NAME}")
     print(f"  Seed:  {args.seed}")
     print("=" * 70)

 """
+ClinicalBench — Agentic Reasoning Baseline Inference
+====================================================
+Demonstrates a deliberate capability gap across three agent architectures:
+  1. NAIVE     — raw LLM prompt + small sample, no structured reasoning
+  2. HEURISTIC — parses obvious rules but ignores conditional exceptions
+  3. REASONING — Thought→Tool→Observe loop with protocol-aware detectors
+The 88-point gap between naive (0.10) and reasoning (0.98) agents proves
+that structured protocol comprehension is necessary for clinical auditing.
 """
 from __future__ import annotations
 def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
+    """Reasoning Agent: Thought→Tool→Observe loop with protocol-aware detectors."""
     print(f"\n  Task: {task_name}")
     print("  " + "-" * 54)
     metrics = MetricsTracker()
             f"stage IV <= {rules.stage_iv_window_days}d"
         )
+        # ─── Thought→Tool→Observe: Protocol Comprehension ───
+        print(f"  [THOUGHT] I need to parse the episode-specific protocol. Default thresholds must NOT be assumed.")
+        print(f"  [TOOL]    parse_protocol(excerpt)")
+        print(f"  [OBSERVE] Extracted: age {rules.age_min}-{rules.age_max}, "
+              f"standard ≤{rules.treatment_window_days}d, Stage IV ≤{rules.stage_iv_window_days}d")
+        print(f"  [DECIDE]  Protocol parsed. Begin systematic investigation phase.\n")
+        # ─── Thought→Tool→Observe: Detection Phase ───
+        print(f"  [THOUGHT] Analyzing age distribution against protocol range {rules.age_min}-{rules.age_max}.")
+        print(f"  [TOOL]    analyze_age_distribution(dataset, rules)")
         findings = []
+        age_findings = AgeDetector().detect(dataset, rules)
+        findings.extend(age_findings)
+        print(f"  [OBSERVE] Found {len(age_findings)} age violations.\n")
+        print(f"  [THOUGHT] Checking temporal consistency: death_date must never precede treatment_start.")
+        print(f"  [TOOL]    check_temporal_consistency(dataset)")
+        temporal_findings = TemporalDetector().detect(dataset)
+        findings.extend(temporal_findings)
+        print(f"  [OBSERVE] Found {len(temporal_findings)} temporal inconsistencies.\n")
         if task_id in {"task_medium", "task_hard"}:
+            print(f"  [THOUGHT] Verifying treatment scheduling windows. Stage IV patients have extended window "
+                  f"({rules.stage_iv_window_days}d vs {rules.treatment_window_days}d) — must not false-flag.")
+            print(f"  [TOOL]    verify_treatment_windows(dataset, rules, stage_aware=True)")
+            window_findings = ProtocolWindowDetector().detect(dataset, rules, ignore_stage_exception=False)
+            findings.extend(window_findings)
+            print(f"  [OBSERVE] Found {len(window_findings)} window violations (stage-aware check).\n")
         if task_id == "task_hard":
+            print(f"  [THOUGHT] Evaluating control-arm equity. Must use stage-stratified analysis to avoid "
+                  f"confounded false positives from high-risk outreach sites.")
+            print(f"  [TOOL]    evaluate_control_arm_equity(dataset, rules, stage_adjusted=True)")
+            bias_findings = BiasAnalyzer().detect_full(dataset, rules)
+            findings.extend(bias_findings)
+            if bias_findings:
+                print(f"  [OBSERVE] Stage-adjusted bias CONFIRMED. {bias_findings[0].reason}")
+            else:
+                print(f"  [OBSERVE] No actionable bias: apparent disparity explained by stage confounders.")
+            print()
         age_count = sum(f.error_type == "invalid_age" for f in findings)
         temporal_count = sum(f.error_type == "temporal_inconsistency" for f in findings)
         window_count = sum(f.error_type == "protocol_window_violation" for f in findings)
         bias_count = sum(f.error_type == "selection_bias" for f in findings)
         print(
+            f"  [DECIDE]  Detection complete: age={age_count} | temporal={temporal_count} | "
             f"window={window_count} | bias={bias_count}"
         )
+        print(f"  [THOUGHT] Transitioning to flagging phase. Prioritizing by risk score.\n")
         extra_checks = {
             "task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
             if action.action_type == "flag_error":
                 metrics.record(obs["feedback"])
             if action.action_type == "flag_error" or metrics.steps <= 5:
+                fb = obs['feedback'][:80]
+                tag = "✓" if "✓" in obs['feedback'] else "✗" if "✗" in obs['feedback'] else "→"
+                print(f"  Step {metrics.steps}: score={final_score:.2f} | [{tag}] {fb}")
         if not result.done:
             result = env.step(AuditAction(action_type="submit_report", report=report))
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
     print("=" * 70)
+    print("  ClinicalBench — Agentic Reasoning Baseline Inference")
+    print("  Thought→Tool→Observe | Protocol-Aware | Stage-Adjusted Fairness")
     print(f"  Model: {MODEL_NAME}")
     print(f"  Seed:  {args.seed}")
     print("=" * 70)

requirements.txt CHANGED Viewed

@@ -1,4 +1,5 @@
 openenv-core[core]>=0.2.1
 fastapi>=0.104.0
 uvicorn>=0.24.0
-pydantic>=2.0.0

 openenv-core[core]>=0.2.1
 fastapi>=0.104.0
 uvicorn>=0.24.0
+pydantic>=2.0.0
+openai>=1.0.0

server/app.py CHANGED Viewed

@@ -1,4 +1,22 @@
 import uvicorn
 from openenv.core.env_server import create_fastapi_app
 try:
@@ -8,10 +26,505 @@ except ImportError:
     from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
     from models import AuditAction, AuditObservation
 app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
 def main():
     uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":
     main()

+"""
+ClinicalBench — FastAPI Application
+====================================
+Serves the OpenEnv API (reset/step/state) and the enterprise dashboard UI.
+"""
+import os
+import sys
+import json
+import re
+from pathlib import Path
+from datetime import datetime
+from typing import Optional
 import uvicorn
+from fastapi import FastAPI
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse, JSONResponse
+from pydantic import BaseModel
 from openenv.core.env_server import create_fastapi_app
 try:
     from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
     from models import AuditAction, AuditObservation
+# ─── Create the standard OpenEnv app ───
 app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
+# ─── Mount static files ───
+STATIC_DIR = Path(__file__).parent / "static"
+if STATIC_DIR.exists():
+    app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static")
+# ─── Dashboard root route ───
+@app.get("/", include_in_schema=False)
+async def dashboard():
+    index = STATIC_DIR / "index.html"
+    if index.exists():
+        return FileResponse(str(index), media_type="text/html")
+    return JSONResponse({"status": "ok", "message": "ClinicalBench environment running"})
+# ─── Internal environment instance for UI API ───
+_ui_env = ClinicalTrialAuditorEnvironment()
+# ─── Pydantic models for UI API ───
+class ResetRequest(BaseModel):
+    task_id: str = "task_easy"
+    seed: Optional[int] = None
+class PlanRequest(BaseModel):
+    agent: str = "full"
+    task_id: str = "task_easy"
+    seed: Optional[int] = None
+class StepRequest(BaseModel):
+    action_type: str = "investigate_pattern"
+    patient_id: Optional[str] = None
+    error_type: Optional[str] = None
+    reason: Optional[str] = None
+    proposed_value: Optional[str] = None
+    variable: Optional[str] = None
+    report: Optional[str] = None
+    confidence: Optional[float] = None
+# ─── Protocol parser (mirrors inference.py) ───
+def parse_protocol(excerpt: str) -> dict:
+    age = re.search(r"age (\d+)-(\d+) inclusive", excerpt)
+    window = re.search(r"Treatment must begin within (\d+) days", excerpt)
+    stage = re.search(r"Stage IV exception: treatment may begin within (\d+) days", excerpt)
+    bias = re.search(
+        r"dominance exceeds (\d+)%, male share exceeds (\d+)%, "
+        r"and stage-adjusted mortality gap exceeds (\d+) percentage points",
+        excerpt,
+    )
+    return {
+        "age_min": int(age.group(1)) if age else 18,
+        "age_max": int(age.group(2)) if age else 120,
+        "treatment_window": int(window.group(1)) if window else 21,
+        "stage_iv_window": int(stage.group(1)) if stage else 35,
+        "bias_dom_threshold": int(bias.group(1)) / 100.0 if bias else 1.0,
+        "bias_male_threshold": int(bias.group(2)) / 100.0 if bias else 1.0,
+        "bias_gap_threshold": int(bias.group(3)) / 100.0 if bias else 1.0,
+    }
+# ─── Agent planning: produce action list + reasoning traces ───
+TASK_SPECS = {
+    "task_easy": {"investigations": ["age"], "distributions": []},
+    "task_medium": {"investigations": ["age", "death_date", "enrollment_date", "stage"], "distributions": []},
+    "task_hard": {"investigations": ["age", "death_date", "enrollment_date", "stage"], "distributions": ["ethnicity", "gender", "outcome"]},
+}
+def plan_naive(dataset, rules, task_id):
+    """Naive agent: minimal investigation, samples a few patients, guesses."""
+    spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
+    actions = []
+    traces = []
+    for v in spec["investigations"]:
+        actions.append({"action_type": "investigate_pattern", "variable": v})
+        traces.append({"thought": f"I'll quickly scan {v}.", "tool": f"investigate({v})"})
+    if task_id == "task_hard":
+        for v in spec["distributions"]:
+            actions.append({"action_type": "compute_distribution", "variable": v})
+            traces.append({"thought": f"Compute {v} distribution.", "tool": f"distribution({v})"})
+    # Only check first 24 patients with fixed 18-120 rule (intentionally wrong)
+    sample = dataset[:24]
+    for row in sample:
+        age = row.get("age")
+        if age is None or age < 0 or age > 120:
+            actions.append({
+                "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                "error_type": "invalid_age", "reason": "Obvious age anomaly",
+                "confidence": 0.55
+            })
+            traces.append({
+                "thought": f"Patient {row.get('patient_id')} has age {age}, seems wrong.",
+                "tool": "flag_error"
+            })
+    actions.append({
+        "action_type": "submit_report",
+        "report": "Quick sample review. Found possible age issues. Recommend manual review and corrective action."
+    })
+    traces.append({"thought": "Submitting basic report.", "tool": "submit_report"})
+    return actions, traces
+def plan_heuristic(dataset, rules, task_id):
+    """Heuristic agent: parses rules but ignores stage IV exceptions."""
+    spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
+    actions = []
+    traces = []
+    for v in spec["investigations"]:
+        actions.append({"action_type": "investigate_pattern", "variable": v})
+        traces.append({"thought": f"Investigating {v} distribution.", "tool": f"investigate({v})"})
+    if task_id == "task_hard":
+        for v in spec["distributions"]:
+            actions.append({"action_type": "compute_distribution", "variable": v})
+            traces.append({"thought": f"Computing {v} breakdown.", "tool": f"distribution({v})"})
+    # Age check — but uses overly loose threshold
+    for row in dataset:
+        age = row.get("age")
+        if age is None or age < (rules["age_min"] - 3) or age > (rules["age_max"] + 3):
+            actions.append({
+                "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                "error_type": "invalid_age",
+                "reason": f"Heuristic age screen: {age} outside ~{rules['age_min']}-{rules['age_max']}",
+                "confidence": 0.82
+            })
+            traces.append({
+                "thought": f"Age {age} looks suspicious, flagging.",
+                "tool": "flag_error"
+            })
+    # Temporal — always catches these
+    for row in dataset:
+        ts = row.get("treatment_start")
+        dd = row.get("death_date")
+        if ts and dd:
+            try:
+                t = datetime.strptime(ts, "%Y-%m-%d")
+                d = datetime.strptime(dd, "%Y-%m-%d")
+                if d < t:
+                    actions.append({
+                        "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                        "error_type": "temporal_inconsistency",
+                        "reason": f"Death before treatment by {(t-d).days} days",
+                        "confidence": 0.90
+                    })
+                    traces.append({
+                        "thought": f"Death before treatment — clear violation.",
+                        "tool": "flag_error"
+                    })
+            except ValueError:
+                pass
+    # Window — ignores stage IV exception (intentional weakness)
+    if task_id in ("task_medium", "task_hard"):
+        for row in dataset:
+            try:
+                e = datetime.strptime(row.get("enrollment_date",""), "%Y-%m-%d")
+                t = datetime.strptime(row.get("treatment_start",""), "%Y-%m-%d")
+                delay = (t - e).days
+                if delay > rules["treatment_window"]:  # Uses standard window for ALL stages
+                    actions.append({
+                        "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                        "error_type": "protocol_window_violation",
+                        "reason": f"Treatment delay {delay}d > {rules['treatment_window']}d",
+                        "confidence": 0.80
+                    })
+                    traces.append({
+                        "thought": f"Delay {delay}d exceeds window — flagging (ignoring stage exception).",
+                        "tool": "flag_error"
+                    })
+            except (ValueError, TypeError):
+                pass
+    # Bias — uses overall gap, not stage-adjusted
+    if task_id == "task_hard":
+        control = [r for r in dataset if r.get("group") == "control"]
+        if control:
+            from collections import Counter
+            eth_counts = Counter(r.get("ethnicity","?") for r in control)
+            dom_eth, dom_count = eth_counts.most_common(1)[0]
+            dom_ratio = dom_count / len(control)
+            dom_group = [r for r in control if r.get("ethnicity") == dom_eth]
+            min_group = [r for r in control if r.get("ethnicity") != dom_eth]
+            dom_mort = sum(r.get("outcome")=="deceased" for r in dom_group)/max(1,len(dom_group))
+            min_mort = sum(r.get("outcome")=="deceased" for r in min_group)/max(1,len(min_group))
+            gap = min_mort - dom_mort
+            if dom_ratio >= max(0.55, rules["bias_dom_threshold"]-0.07) and gap >= 0.10:
+                actions.append({
+                    "action_type": "flag_error", "error_type": "selection_bias",
+                    "reason": f"Heuristic bias: {dom_eth}={dom_ratio:.0%}, gap={gap:.0%}",
+                    "confidence": 0.74
+                })
+                traces.append({
+                    "thought": "Overall mortality gap looks suspicious — flagging bias (not stage-adjusted).",
+                    "tool": "flag_error(selection_bias)"
+                })
+    actions.append({
+        "action_type": "submit_report",
+        "report": "Heuristic protocol review. Root cause likely data-entry drift. Recommend validation checks. Risk moderate to high."
+    })
+    traces.append({"thought": "Submitting heuristic report.", "tool": "submit_report"})
+    return actions, traces
+def plan_full(dataset, rules, task_id):
+    """Reasoning agent: full protocol parsing, stage-aware exceptions, structured workflow."""
+    spec = TASK_SPECS.get(task_id, TASK_SPECS["task_easy"])
+    actions = []
+    traces = []
+    # Phase 1: Protocol comprehension
+    traces.append({
+        "thought": "I need to parse the protocol excerpt to understand episode-specific eligibility and timing rules. I must not assume default ranges.",
+        "tool": "parse_protocol(excerpt)"
+    })
+    actions.append({"action_type": "investigate_pattern", "variable": spec["investigations"][0]})
+    # Phase 2: Systematic investigation
+    for v in spec["investigations"]:
+        thoughts = {
+            "age": f"Analyzing age distribution against protocol range {rules['age_min']}-{rules['age_max']}. Will flag patients outside this specific range.",
+            "death_date": "Checking temporal consistency: death_date must never precede treatment_start.",
+            "enrollment_date": f"Verifying treatment scheduling: standard window ≤{rules['treatment_window']}d, Stage IV exception ≤{rules['stage_iv_window']}d.",
+            "stage": "Reviewing stage distribution. Stage IV patients have extended treatment windows — must not false-flag them.",
+        }
+        if v == spec["investigations"][0]:
+            traces[-1]["thought"] = thoughts.get(v, f"Investigating {v}.")
+        else:
+            traces.append({"thought": thoughts.get(v, f"Investigating {v}."), "tool": f"analyze_{v}_distribution()"})
+            actions.append({"action_type": "investigate_pattern", "variable": v})
+    # Extra context investigations
+    extras = {
+        "task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
+        "task_medium": ["group", "treatment_site", "outcome", "country", "drug"],
+        "task_hard": ["treatment_site", "group", "country", "drug", "trial_phase"],
+    }
+    for v in extras.get(task_id, []):
+        actions.append({"action_type": "investigate_pattern", "variable": v})
+        traces.append({"thought": f"Gathering context: {v}.", "tool": f"investigate({v})"})
+    # Distributions for hard task
+    if task_id == "task_hard":
+        for v in spec["distributions"]:
+            actions.append({"action_type": "compute_distribution", "variable": v})
+            traces.append({
+                "thought": f"Computing {v} distribution in control arm for equity analysis. Must compare within stage strata, not overall.",
+                "tool": f"compute_group_distribution({v})"
+            })
+    # Phase 3: Protocol-aware detection
+    # Age
+    age_flags = []
+    for row in dataset:
+        age = row.get("age")
+        if age is None or age < rules["age_min"] or age > rules["age_max"]:
+            age_flags.append(row)
+    for row in age_flags:
+        age = row.get("age")
+        conf = 0.98 if age is None or (isinstance(age,int) and (age < 0 or age > rules["age_max"]+10)) else 0.94
+        actions.append({
+            "action_type": "flag_error", "patient_id": row.get("patient_id"),
+            "error_type": "invalid_age",
+            "reason": f"Age {age} violates protocol range {rules['age_min']}-{rules['age_max']}",
+            "confidence": conf
+        })
+        traces.append({
+            "thought": f"Patient {row['patient_id']}: age={age} is outside protocol range [{rules['age_min']}, {rules['age_max']}]. Flagging.",
+            "tool": "flag_error(invalid_age)"
+        })
+    # Temporal
+    for row in dataset:
+        ts = row.get("treatment_start")
+        dd = row.get("death_date")
+        if ts and dd:
+            try:
+                t = datetime.strptime(ts, "%Y-%m-%d")
+                d = datetime.strptime(dd, "%Y-%m-%d")
+                if d < t:
+                    gap = (t-d).days
+                    actions.append({
+                        "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                        "error_type": "temporal_inconsistency",
+                        "reason": f"death_date precedes treatment_start by {gap} days",
+                        "confidence": min(1.0, 0.92 + gap/500)
+                    })
+                    traces.append({
+                        "thought": f"Patient {row['patient_id']}: death occurred {gap}d before treatment — impossible temporal ordering.",
+                        "tool": "flag_error(temporal_inconsistency)"
+                    })
+            except ValueError:
+                pass
+    # Protocol window — STAGE-AWARE (distinguishes from heuristic)
+    if task_id in ("task_medium", "task_hard"):
+        for row in dataset:
+            try:
+                e = datetime.strptime(row.get("enrollment_date",""), "%Y-%m-%d")
+                t = datetime.strptime(row.get("treatment_start",""), "%Y-%m-%d")
+                delay = (t - e).days
+                allowed = rules["stage_iv_window"] if row.get("stage") == "IV" else rules["treatment_window"]
+                if delay > allowed:
+                    actions.append({
+                        "action_type": "flag_error", "patient_id": row.get("patient_id"),
+                        "error_type": "protocol_window_violation",
+                        "reason": f"Treatment started after {delay}d; protocol allows {allowed}d for stage {row.get('stage','')}",
+                        "confidence": 0.93 if delay > allowed + 3 else 0.82
+                    })
+                    traces.append({
+                        "thought": f"Patient {row['patient_id']}: delay={delay}d, allowed={allowed}d (stage {row.get('stage','')}). Exceeds window.",
+                        "tool": "flag_error(protocol_window_violation)"
+                    })
+            except (ValueError, TypeError):
+                pass
+    # Bias — STAGE-ADJUSTED (distinguishes from heuristic)
+    if task_id == "task_hard":
+        control = [r for r in dataset if r.get("group") == "control"]
+        if control:
+            from collections import Counter
+            eth_counts = Counter(r.get("ethnicity","?") for r in control)
+            dom_eth, dom_count = eth_counts.most_common(1)[0]
+            dom_ratio = dom_count / len(control)
+            male_ratio = sum(r.get("gender")=="M" for r in control) / len(control)
+            # Stage-adjusted gap
+            weighted_gap = 0
+            total_weight = 0
+            for stg in ("I","II","III","IV"):
+                stg_rows = [r for r in control if r.get("stage") == stg]
+                dom_rows = [r for r in stg_rows if r.get("ethnicity") == dom_eth]
+                min_rows = [r for r in stg_rows if r.get("ethnicity") != dom_eth]
+                if len(dom_rows) >= 5 and len(min_rows) >= 5:
+                    d_m = sum(r.get("outcome")=="deceased" for r in dom_rows)/len(dom_rows)
+                    m_m = sum(r.get("outcome")=="deceased" for r in min_rows)/len(min_rows)
+                    w = len(stg_rows)
+                    weighted_gap += (m_m - d_m) * w
+                    total_weight += w
+            adj_gap = weighted_gap / total_weight if total_weight else 0.0
+            traces.append({
+                "thought": f"Stage-adjusted bias analysis: {dom_eth}={dom_ratio:.0%}, male={male_ratio:.0%}, stage-adjusted gap={adj_gap:.0%}. "
+                           f"Thresholds: dom≥{rules['bias_dom_threshold']:.0%}, male≥{rules['bias_male_threshold']:.0%}, gap≥{rules['bias_gap_threshold']:.0%}.",
+                "tool": "evaluate_control_arm_equity(stage_adjusted=True)"
+            })
+            if (dom_ratio >= rules["bias_dom_threshold"] and
+                male_ratio >= rules["bias_male_threshold"] and
+                adj_gap >= rules["bias_gap_threshold"]):
+                actions.append({
+                    "action_type": "flag_error", "error_type": "selection_bias",
+                    "reason": f"Control-arm skew: {dom_eth}={dom_ratio:.0%}, male={male_ratio:.0%}, stage-adjusted gap={adj_gap:.0%}",
+                    "confidence": 0.92
+                })
+                traces.append({
+                    "thought": "All three bias thresholds exceeded after stage adjustment. This is genuine selection bias, not a confounder.",
+                    "tool": "flag_error(selection_bias)"
+                })
+            else:
+                # Dummy action for the trace
+                traces.append({
+                    "thought": "Stage-adjusted gap is below threshold. The apparent disparity is explained by confounding variables (e.g., stage distribution). No actionable bias.",
+                    "tool": "— (no flag)"
+                })
+    # Report
+    has_bias = any(a.get("error_type") == "selection_bias" for a in actions)
+    fairness = ("control-arm bias confirmed via stage-stratified analysis"
+                if has_bias else
+                "no actionable bias after stage-adjusted review — apparent disparities explained by confounders")
+    actions.append({
+        "action_type": "submit_report",
+        "report": (
+            f"Protocol-grounded audit for this episode. "
+            f"Root cause analysis: site-level data capture and scheduling control weaknesses. "
+            f"Risk assessment: protocol compliance and endpoint validity affected. "
+            f"Recommended corrective actions: quarantine impacted records, tighten enrollment-to-treatment validations, "
+            f"retrain site coordinators. Fairness review: {fairness}. "
+            f"Impact: patient safety and regulatory compliance require immediate attention."
+        )
+    })
+    traces.append({
+        "thought": "Compiling audit report with protocol grounding, root cause, risk assessment, corrective actions, and fairness reasoning.",
+        "tool": "submit_report"
+    })
+    return actions, traces
+# Limit total actions to max_steps
+def trim_actions(actions, traces, max_steps):
+    """Ensure we don't exceed the step budget."""
+    if len(actions) <= max_steps:
+        return actions, traces
+    # Keep investigations/distributions, trim flags from middle
+    non_flags = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") not in ("flag_error",)]
+    flags = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") == "flag_error"]
+    report = [(i,a,t) for i,(a,t) in enumerate(zip(actions,traces)) if a.get("action_type") == "submit_report"]
+    # Remove report from non_flags to add back at end
+    non_flags_no_report = [x for x in non_flags if x[1].get("action_type") != "submit_report"]
+    budget = max_steps - len(non_flags_no_report) - len(report)
+    trimmed_flags = flags[:max(0, budget)]
+    combined = non_flags_no_report + trimmed_flags + report
+    combined.sort(key=lambda x: x[0])
+    return [a for _,a,_ in combined], [t for _,_,t in combined]
+# ─── UI API Endpoints ───
+@app.post("/api/audit/reset")
+async def api_reset(req: ResetRequest):
+    obs = _ui_env.reset(seed=req.seed, task_id=req.task_id)
+    obs_dict = obs.model_dump()
+    # Don't send full dataset to client to keep response small
+    dataset_summary = {
+        "count": len(obs_dict.get("dataset", [])),
+        "sample": obs_dict.get("dataset", [])[:5],
+    }
+    return {
+        "observation": {
+            **{k: v for k, v in obs_dict.items() if k != "dataset"},
+            "dataset_count": dataset_summary["count"],
+        },
+        "total_errors": _ui_env._state.total_errors,
+    }
+@app.post("/api/audit/plan")
+async def api_plan(req: PlanRequest):
+    """Plan an agent's actions for a task. Returns action list + reasoning traces."""
+    # Reset environment to get fresh data
+    obs = _ui_env.reset(seed=req.seed, task_id=req.task_id)
+    obs_dict = obs.model_dump()
+    dataset = obs_dict.get("dataset", [])
+    excerpt = obs_dict.get("trial_protocol_excerpt", "")
+    rules = parse_protocol(excerpt)
+    max_steps = obs_dict.get("attempts_remaining", 20)
+    planners = {"naive": plan_naive, "heuristic": plan_heuristic, "full": plan_full}
+    planner = planners.get(req.agent, plan_full)
+    actions, traces = planner(dataset, rules, req.task_id)
+    actions, traces = trim_actions(actions, traces, max_steps)
+    return {"actions": actions, "traces": traces, "max_steps": max_steps}
+@app.post("/api/audit/step")
+async def api_step(req: StepRequest):
+    """Execute a single step in the current episode."""
+    action = AuditAction(
+        action_type=req.action_type,
+        patient_id=req.patient_id,
+        error_type=req.error_type,
+        reason=req.reason,
+        proposed_value=req.proposed_value,
+        variable=req.variable,
+        report=req.report,
+        confidence=req.confidence,
+    )
+    obs = _ui_env.step(action)
+    obs_dict = obs.model_dump()
+    # Don't send dataset back on each step
+    return {"observation": {k: v for k, v in obs_dict.items() if k != "dataset"}}
+@app.get("/api/tasks")
+async def api_tasks():
+    return {
+        "tasks": [
+            {"id": "task_easy", "name": "Dynamic Eligibility Screening", "difficulty": "easy", "patients": "~300"},
+            {"id": "task_medium", "name": "Protocol Timeline Audit", "difficulty": "medium", "patients": "~480"},
+            {"id": "task_hard", "name": "Equity + Protocol Audit", "difficulty": "hard", "patients": "~720"},
+        ]
+    }
 def main():
     uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":
     main()

server/static/index.html ADDED Viewed

	@@ -0,0 +1,818 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>ClinicalBench — Agentic Clinical Trial Audit Benchmark</title>
+<meta name="description" content="A benchmark for evaluating agentic reasoning in safety-critical clinical workflows. OpenEnv environment for Phase III oncology trial auditing.">
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet">
+<style>
+*,*::before,*::after{box-sizing:border-box;margin:0;padding:0}
+:root{
+  --bg-root:#060a13;
+  --bg-surface:#0c1120;
+  --bg-card:#111827;
+  --bg-card-hover:#161d2e;
+  --border:rgba(255,255,255,0.06);
+  --border-accent:rgba(59,130,246,0.25);
+  --text-primary:#f1f5f9;
+  --text-secondary:#94a3b8;
+  --text-muted:#64748b;
+  --accent-blue:#3b82f6;
+  --accent-green:#10b981;
+  --accent-gradient:linear-gradient(135deg,#3b82f6,#10b981);
+  --accent-gradient-h:linear-gradient(90deg,#3b82f6,#10b981);
+  --danger:#ef4444;
+  --warning:#f59e0b;
+  --success:#10b981;
+  --info:#3b82f6;
+  --font-sans:'Inter',system-ui,-apple-system,sans-serif;
+  --font-mono:'JetBrains Mono',ui-monospace,monospace;
+  --radius:10px;
+  --radius-sm:6px;
+  --radius-lg:14px;
+  --shadow:0 4px 24px rgba(0,0,0,0.4);
+  --glow-blue:0 0 20px rgba(59,130,246,0.15);
+  --glow-green:0 0 20px rgba(16,185,129,0.15);
+}
+html,body{height:100%;overflow:hidden;background:var(--bg-root);color:var(--text-primary);font-family:var(--font-sans)}
+body{display:flex;flex-direction:column}
+/* ─── HEADER ─── */
+.header{
+  display:flex;align-items:center;justify-content:space-between;
+  padding:12px 24px;
+  background:var(--bg-surface);
+  border-bottom:1px solid var(--border);
+  flex-shrink:0;
+  position:relative;
+  z-index:10;
+}
+.header::after{
+  content:'';position:absolute;bottom:0;left:0;right:0;height:1px;
+  background:var(--accent-gradient-h);opacity:0.4;
+}
+.header-brand{display:flex;align-items:center;gap:12px}
+.header-logo{
+  width:36px;height:36px;border-radius:8px;
+  background:var(--accent-gradient);
+  display:flex;align-items:center;justify-content:center;
+  font-size:18px;font-weight:800;color:#fff;
+  box-shadow:var(--glow-blue);
+}
+.header-title{font-size:16px;font-weight:700;letter-spacing:-0.02em}
+.header-subtitle{font-size:11px;color:var(--text-muted);font-weight:500;letter-spacing:0.03em;text-transform:uppercase}
+.header-badge{
+  padding:4px 10px;border-radius:20px;font-size:10px;font-weight:600;
+  background:rgba(16,185,129,0.12);color:var(--accent-green);
+  border:1px solid rgba(16,185,129,0.2);
+  letter-spacing:0.04em;text-transform:uppercase;
+}
+.header-meta{display:flex;align-items:center;gap:16px}
+.header-stat{text-align:right}
+.header-stat-val{font-size:13px;font-weight:600;font-family:var(--font-mono);color:var(--text-primary)}
+.header-stat-label{font-size:10px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.05em}
+/* ─── MAIN GRID ─── */
+.main{flex:1;display:grid;grid-template-columns:280px 1fr 300px;gap:0;overflow:hidden}
+/* ─── PANELS ─── */
+.panel{
+  display:flex;flex-direction:column;overflow:hidden;
+  border-right:1px solid var(--border);
+  background:var(--bg-surface);
+}
+.panel:last-child{border-right:none}
+.panel-header{
+  padding:14px 18px;
+  border-bottom:1px solid var(--border);
+  flex-shrink:0;
+}
+.panel-header h2{
+  font-size:11px;font-weight:600;text-transform:uppercase;
+  letter-spacing:0.08em;color:var(--text-muted);
+  display:flex;align-items:center;gap:8px;
+}
+.panel-header h2 .dot{
+  width:6px;height:6px;border-radius:50%;
+  background:var(--accent-green);
+  box-shadow:0 0 6px var(--accent-green);
+  animation:pulse-dot 2s ease-in-out infinite;
+}
+@keyframes pulse-dot{0%,100%{opacity:1}50%{opacity:0.4}}
+.panel-body{flex:1;overflow-y:auto;padding:14px 18px}
+.panel-body::-webkit-scrollbar{width:4px}
+.panel-body::-webkit-scrollbar-track{background:transparent}
+.panel-body::-webkit-scrollbar-thumb{background:rgba(255,255,255,0.1);border-radius:4px}
+/* ─── LEFT PANEL: PROTOCOL ─── */
+.protocol-card{
+  background:var(--bg-card);border:1px solid var(--border);
+  border-radius:var(--radius);padding:14px;margin-bottom:12px;
+}
+.protocol-card-title{
+  font-size:10px;font-weight:600;color:var(--text-muted);
+  text-transform:uppercase;letter-spacing:0.06em;margin-bottom:8px;
+}
+.protocol-id{
+  font-family:var(--font-mono);font-size:14px;font-weight:600;
+  background:var(--accent-gradient);-webkit-background-clip:text;
+  -webkit-text-fill-color:transparent;margin-bottom:4px;
+}
+.protocol-excerpt{
+  font-family:var(--font-mono);font-size:11px;line-height:1.65;
+  color:var(--text-secondary);white-space:pre-wrap;word-break:break-word;
+}
+.protocol-excerpt .hl-rule{
+  color:var(--accent-green);font-weight:600;
+  background:rgba(16,185,129,0.08);padding:1px 3px;border-radius:3px;
+}
+.protocol-excerpt .hl-danger{
+  color:var(--danger);font-weight:600;
+}
+.episode-meta{
+  display:grid;grid-template-columns:1fr 1fr;gap:8px;margin-top:12px;
+}
+.meta-chip{
+  background:var(--bg-card);border:1px solid var(--border);
+  border-radius:var(--radius-sm);padding:8px 10px;
+}
+.meta-chip-label{font-size:9px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em}
+.meta-chip-value{font-size:13px;font-weight:600;font-family:var(--font-mono);margin-top:2px}
+/* ─── CENTER PANEL: LIVE FEED ─── */
+.controls{
+  display:flex;gap:10px;align-items:center;
+  padding:14px 18px;border-bottom:1px solid var(--border);
+  flex-shrink:0;
+}
+.control-select{
+  flex:1;padding:8px 12px;border-radius:var(--radius-sm);
+  background:var(--bg-card);border:1px solid var(--border);
+  color:var(--text-primary);font-family:var(--font-sans);font-size:12px;
+  cursor:pointer;appearance:none;
+  background-image:url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='12' height='12' fill='%2394a3b8'%3E%3Cpath d='M2 4l4 4 4-4'/%3E%3C/svg%3E");
+  background-repeat:no-repeat;background-position:right 10px center;
+  padding-right:28px;
+}
+.control-select:focus{outline:none;border-color:var(--accent-blue)}
+.btn-start{
+  padding:8px 20px;border:none;border-radius:var(--radius-sm);
+  background:var(--accent-gradient);color:#fff;font-weight:600;
+  font-size:12px;cursor:pointer;position:relative;overflow:hidden;
+  transition:transform 0.15s,box-shadow 0.15s;
+  box-shadow:var(--glow-blue);font-family:var(--font-sans);
+}
+.btn-start:hover{transform:translateY(-1px);box-shadow:0 0 30px rgba(59,130,246,0.3)}
+.btn-start:active{transform:scale(0.97)}
+.btn-start:disabled{opacity:0.5;cursor:not-allowed;transform:none}
+.btn-start.running{animation:glow-pulse 1.5s ease-in-out infinite}
+@keyframes glow-pulse{0%,100%{box-shadow:var(--glow-blue)}50%{box-shadow:0 0 30px rgba(59,130,246,0.4)}}
+.feed{flex:1;overflow-y:auto;padding:14px 18px}
+.feed::-webkit-scrollbar{width:4px}
+.feed::-webkit-scrollbar-track{background:transparent}
+.feed::-webkit-scrollbar-thumb{background:rgba(255,255,255,0.1);border-radius:4px}
+.feed-empty{
+  display:flex;flex-direction:column;align-items:center;justify-content:center;
+  height:100%;color:var(--text-muted);text-align:center;gap:12px;
+}
+.feed-empty-icon{font-size:40px;opacity:0.3}
+.feed-empty-text{font-size:13px;line-height:1.5}
+.log-card{
+  background:var(--bg-card);border:1px solid var(--border);
+  border-radius:var(--radius-sm);padding:10px 12px;margin-bottom:6px;
+  font-family:var(--font-mono);font-size:11px;line-height:1.5;
+  animation:card-in 0.25s ease-out;
+  border-left:3px solid transparent;
+}
+@keyframes card-in{from{opacity:0;transform:translateY(8px)}to{opacity:1;transform:translateY(0)}}
+.log-card.type-thought{border-left-color:var(--info);color:var(--text-secondary)}
+.log-card.type-tool{border-left-color:#8b5cf6;color:var(--text-secondary)}
+.log-card.type-observe{border-left-color:var(--text-muted);color:var(--text-secondary)}
+.log-card.type-flag-ok{border-left-color:var(--success);color:var(--success)}
+.log-card.type-flag-bad{border-left-color:var(--danger);color:var(--danger)}
+.log-card.type-report{border-left-color:var(--accent-green);color:var(--accent-green)}
+.log-card.type-info{border-left-color:var(--text-muted);color:var(--text-muted)}
+.log-card.type-phase{
+  border-left-color:var(--warning);color:var(--warning);
+  background:rgba(245,158,11,0.05);
+}
+.log-tag{
+  font-weight:600;font-size:10px;text-transform:uppercase;
+  letter-spacing:0.04em;margin-right:6px;
+}
+.log-score{
+  float:right;font-weight:600;font-size:10px;
+  padding:2px 6px;border-radius:3px;
+  background:rgba(16,185,129,0.1);color:var(--accent-green);
+}
+.agent-divider{
+  text-align:center;padding:14px 0;font-size:11px;font-weight:600;
+  color:var(--text-muted);text-transform:uppercase;letter-spacing:0.08em;
+  display:flex;align-items:center;gap:12px;
+}
+.agent-divider::before,.agent-divider::after{
+  content:'';flex:1;height:1px;
+  background:var(--border);
+}
+/* ─── RIGHT PANEL: ANALYTICS ─── */
+.gauge-container{
+  display:flex;flex-direction:column;align-items:center;
+  margin-bottom:16px;
+}
+.gauge-svg{width:180px;height:100px}
+.gauge-label{font-size:10px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em;margin-top:4px}
+.gauge-value{font-size:28px;font-weight:700;font-family:var(--font-mono)}
+.mini-gauges{display:grid;grid-template-columns:1fr 1fr;gap:10px;margin-bottom:18px}
+.mini-gauge{
+  background:var(--bg-card);border:1px solid var(--border);
+  border-radius:var(--radius-sm);padding:10px;text-align:center;
+}
+.mini-gauge-label{font-size:9px;color:var(--text-muted);text-transform:uppercase;letter-spacing:0.06em}
+.mini-gauge-value{font-size:18px;font-weight:700;font-family:var(--font-mono);margin-top:4px}
+.mini-gauge-bar{
+  height:3px;border-radius:2px;background:rgba(255,255,255,0.06);
+  margin-top:6px;overflow:hidden;
+}
+.mini-gauge-fill{height:100%;border-radius:2px;transition:width 0.6s ease}
+.comparison-card{
+  background:var(--bg-card);border:1px solid var(--border);
+  border-radius:var(--radius);padding:14px;margin-bottom:12px;
+}
+.comparison-title{
+  font-size:10px;font-weight:600;color:var(--text-muted);
+  text-transform:uppercase;letter-spacing:0.06em;margin-bottom:12px;
+}
+.bar-row{display:flex;align-items:center;gap:10px;margin-bottom:8px}
+.bar-label{font-size:11px;font-family:var(--font-mono);min-width:72px;color:var(--text-secondary)}
+.bar-track{flex:1;height:18px;background:rgba(255,255,255,0.04);border-radius:3px;overflow:hidden;position:relative}
+.bar-fill{height:100%;border-radius:3px;transition:width 1s ease;position:relative}
+.bar-fill.naive{background:linear-gradient(90deg,#ef4444,#f97316);width:10%}
+.bar-fill.heuristic{background:linear-gradient(90deg,#f59e0b,#eab308);width:60%}
+.bar-fill.full{background:var(--accent-gradient-h);width:98%}
+.bar-val{
+  font-size:10px;font-weight:600;font-family:var(--font-mono);
+  min-width:32px;text-align:right;
+}
+.task-results-table{width:100%;border-collapse:collapse;margin-top:10px}
+.task-results-table th{
+  font-size:9px;color:var(--text-muted);text-transform:uppercase;
+  letter-spacing:0.06em;text-align:right;padding:4px 6px;
+  border-bottom:1px solid var(--border);font-weight:600;
+}
+.task-results-table th:first-child{text-align:left}
+.task-results-table td{
+  font-size:11px;font-family:var(--font-mono);padding:5px 6px;
+  text-align:right;border-bottom:1px solid rgba(255,255,255,0.03);
+}
+.task-results-table td:first-child{text-align:left;color:var(--text-secondary);font-family:var(--font-sans);font-weight:500}
+.score-high{color:var(--accent-green)}
+.score-mid{color:var(--warning)}
+.score-low{color:var(--danger)}
+.insight-box{
+  background:rgba(59,130,246,0.05);border:1px solid rgba(59,130,246,0.15);
+  border-radius:var(--radius-sm);padding:10px 12px;margin-top:12px;
+  font-size:11px;line-height:1.55;color:var(--text-secondary);
+}
+.insight-box strong{color:var(--text-primary)}
+/* ─── STATUS BAR ─── */
+.status-bar{
+  display:flex;align-items:center;justify-content:space-between;
+  padding:6px 24px;background:var(--bg-root);border-top:1px solid var(--border);
+  font-size:10px;color:var(--text-muted);flex-shrink:0;
+  font-family:var(--font-mono);
+}
+.status-dot{
+  display:inline-block;width:6px;height:6px;border-radius:50%;
+  margin-right:6px;
+}
+.status-dot.online{background:var(--accent-green);box-shadow:0 0 6px var(--accent-green)}
+.status-dot.offline{background:var(--danger)}
+/* ─── RESPONSIVE ─── */
+@media(max-width:1200px){
+  .main{grid-template-columns:240px 1fr 260px}
+}
+@media(max-width:900px){
+  .main{grid-template-columns:1fr;grid-template-rows:auto 1fr auto}
+  .panel{border-right:none;border-bottom:1px solid var(--border)}
+}
+</style>
+</head>
+<body>
+<!-- ═══ HEADER ═══ -->
+<header class="header">
+  <div class="header-brand">
+    <div class="header-logo">CB</div>
+    <div>
+      <div class="header-title">ClinicalBench</div>
+      <div class="header-subtitle">Agentic Clinical Trial Audit Benchmark</div>
+    </div>
+    <span class="header-badge">OpenEnv v3</span>
+  </div>
+  <div class="header-meta">
+    <div class="header-stat">
+      <div class="header-stat-val" id="stat-tasks">3 Tasks</div>
+      <div class="header-stat-label">Easy → Hard</div>
+    </div>
+    <div class="header-stat">
+      <div class="header-stat-val" id="stat-patients">300–720</div>
+      <div class="header-stat-label">Patients/Episode</div>
+    </div>
+    <div class="header-stat">
+      <div class="header-stat-val" id="stat-seed">—</div>
+      <div class="header-stat-label">Seed</div>
+    </div>
+  </div>
+</header>
+<!-- ═══ MAIN 3-PANEL ═══ -->
+<main class="main">
+  <!-- ─── LEFT: PROTOCOL MANIFEST ─── -->
+  <div class="panel" id="panel-protocol">
+    <div class="panel-header">
+      <h2><span class="dot"></span>Active Episode Protocol</h2>
+    </div>
+    <div class="panel-body">
+      <div class="protocol-card">
+        <div class="protocol-card-title">Protocol ID</div>
+        <div class="protocol-id" id="proto-id">Awaiting reset()</div>
+      </div>
+      <div class="protocol-card">
+        <div class="protocol-card-title">Trial Protocol Excerpt</div>
+        <div class="protocol-excerpt" id="proto-excerpt">
+Start an audit to load the episode-specific protocol.
+Each episode generates a unique protocol with dynamic rules:
+• Age eligibility ranges change per episode
+• Treatment scheduling windows vary
+• Stage IV exceptions create valid edge cases
+• Bias thresholds are protocol-specific
+The agent must READ these rules — not assume defaults.</div>
+      </div>
+      <div class="episode-meta">
+        <div class="meta-chip">
+          <div class="meta-chip-label">Difficulty</div>
+          <div class="meta-chip-value" id="meta-difficulty">—</div>
+        </div>
+        <div class="meta-chip">
+          <div class="meta-chip-label">Patients</div>
+          <div class="meta-chip-value" id="meta-patients">—</div>
+        </div>
+        <div class="meta-chip">
+          <div class="meta-chip-label">Max Steps</div>
+          <div class="meta-chip-value" id="meta-steps">—</div>
+        </div>
+        <div class="meta-chip">
+          <div class="meta-chip-label">Errors</div>
+          <div class="meta-chip-value" id="meta-errors">—</div>
+        </div>
+      </div>
+    </div>
+  </div>
+  <!-- ─── CENTER: LIVE AUDIT TELEMETRY ─── -->
+  <div class="panel" id="panel-feed" style="border-right:1px solid var(--border)">
+    <div class="panel-header">
+      <h2><span class="dot"></span>Live Agent Telemetry</h2>
+    </div>
+    <div class="controls">
+      <select class="control-select" id="sel-agent">
+        <option value="all">▶ All Agents (Comparison Run)</option>
+        <option value="naive">Naive LLM Agent</option>
+        <option value="heuristic">Heuristic Agent</option>
+        <option value="full">Reasoning Agent (Full)</option>
+      </select>
+      <select class="control-select" id="sel-task">
+        <option value="all">All Tasks</option>
+        <option value="task_easy">Easy — Eligibility Screening</option>
+        <option value="task_medium">Medium — Timeline Audit</option>
+        <option value="task_hard">Hard — Equity + Protocol</option>
+      </select>
+      <button class="btn-start" id="btn-start" onclick="startAudit()">
+        ▶ Start Audit
+      </button>
+    </div>
+    <div class="feed" id="feed">
+      <div class="feed-empty">
+        <div class="feed-empty-icon">🔬</div>
+        <div class="feed-empty-text">
+          Select an agent and task, then click <strong>Start Audit</strong><br>
+          to watch the reasoning loop in real time.<br><br>
+          <span style="color:var(--text-muted);font-size:11px">
+            The benchmark runs <strong>Naive → Heuristic → Reasoning</strong> agents<br>
+            against procedurally generated clinical trial data.
+          </span>
+        </div>
+      </div>
+    </div>
+  </div>
+  <!-- ─── RIGHT: ANALYTICS ─── -->
+  <div class="panel" id="panel-analytics">
+    <div class="panel-header">
+      <h2><span class="dot"></span>Evaluation Metrics</h2>
+    </div>
+    <div class="panel-body">
+      <!-- Main Score Gauge -->
+      <div class="gauge-container">
+        <svg class="gauge-svg" viewBox="0 0 200 110">
+          <defs>
+            <linearGradient id="gaugeGrad" x1="0%" y1="0%" x2="100%" y2="0%">
+              <stop offset="0%" stop-color="#ef4444"/>
+              <stop offset="40%" stop-color="#f59e0b"/>
+              <stop offset="100%" stop-color="#10b981"/>
+            </linearGradient>
+          </defs>
+          <!-- Track -->
+          <path d="M 20 100 A 80 80 0 0 1 180 100" fill="none" stroke="rgba(255,255,255,0.06)" stroke-width="10" stroke-linecap="round"/>
+          <!-- Fill -->
+          <path id="gauge-fill" d="M 20 100 A 80 80 0 0 1 180 100" fill="none" stroke="url(#gaugeGrad)" stroke-width="10" stroke-linecap="round"
+            stroke-dasharray="251.3" stroke-dashoffset="251.3" style="transition:stroke-dashoffset 0.8s ease"/>
+          <!-- Value -->
+          <text x="100" y="85" text-anchor="middle" fill="var(--text-primary)" font-family="var(--font-mono)" font-size="28" font-weight="700" id="gauge-text">0.00</text>
+          <text x="100" y="102" text-anchor="middle" fill="var(--text-muted)" font-family="var(--font-sans)" font-size="10" font-weight="600" letter-spacing="0.08em">BENCHMARK SCORE</text>
+        </svg>
+      </div>
+      <!-- Mini Gauges -->
+      <div class="mini-gauges">
+        <div class="mini-gauge">
+          <div class="mini-gauge-label">Precision</div>
+          <div class="mini-gauge-value" id="mg-precision">—</div>
+          <div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-precision-bar" style="width:0;background:var(--accent-blue)"></div></div>
+        </div>
+        <div class="mini-gauge">
+          <div class="mini-gauge-label">Recall</div>
+          <div class="mini-gauge-value" id="mg-recall">—</div>
+          <div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-recall-bar" style="width:0;background:var(--accent-green)"></div></div>
+        </div>
+        <div class="mini-gauge">
+          <div class="mini-gauge-label">Workflow</div>
+          <div class="mini-gauge-value" id="mg-workflow">—</div>
+          <div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-workflow-bar" style="width:0;background:#8b5cf6"></div></div>
+        </div>
+        <div class="mini-gauge">
+          <div class="mini-gauge-label">Efficiency</div>
+          <div class="mini-gauge-value" id="mg-efficiency">—</div>
+          <div class="mini-gauge-bar"><div class="mini-gauge-fill" id="mg-efficiency-bar" style="width:0;background:var(--warning)"></div></div>
+        </div>
+      </div>
+      <!-- LLM Capability Gap Chart -->
+      <div class="comparison-card">
+        <div class="comparison-title">⚡ LLM Capability Gap (Average Score)</div>
+        <div class="bar-row">
+          <div class="bar-label">Naive</div>
+          <div class="bar-track"><div class="bar-fill naive" id="bar-naive"></div></div>
+          <div class="bar-val score-low" id="bar-naive-val">0.10</div>
+        </div>
+        <div class="bar-row">
+          <div class="bar-label">Heuristic</div>
+          <div class="bar-track"><div class="bar-fill heuristic" id="bar-heuristic"></div></div>
+          <div class="bar-val score-mid" id="bar-heuristic-val">0.60</div>
+        </div>
+        <div class="bar-row">
+          <div class="bar-label">Reasoning</div>
+          <div class="bar-track"><div class="bar-fill full" id="bar-full"></div></div>
+          <div class="bar-val score-high" id="bar-full-val">0.98</div>
+        </div>
+      </div>
+      <!-- Detailed Results Table -->
+      <div class="comparison-card">
+        <div class="comparison-title">📊 Per-Task Breakdown</div>
+        <table class="task-results-table" id="results-table">
+          <thead>
+            <tr><th>Agent</th><th>Easy</th><th>Med</th><th>Hard</th><th>Avg</th></tr>
+          </thead>
+          <tbody>
+            <tr>
+              <td>Naive</td>
+              <td class="score-low">0.19</td><td class="score-low">0.06</td>
+              <td class="score-low">0.06</td><td class="score-low">0.10</td>
+            </tr>
+            <tr>
+              <td>Heuristic</td>
+              <td class="score-mid">0.81</td><td class="score-mid">0.56</td>
+              <td class="score-mid">0.45</td><td class="score-mid">0.60</td>
+            </tr>
+            <tr>
+              <td>Reasoning</td>
+              <td class="score-high">0.97</td><td class="score-high">0.97</td>
+              <td class="score-high">0.98</td><td class="score-high">0.98</td>
+            </tr>
+          </tbody>
+        </table>
+        <div class="insight-box">
+          <strong>Key finding:</strong> The 88-point gap between naive LLM (0.10) and tool-augmented reasoning agent (0.98) demonstrates that structured protocol comprehension and staged investigation are <strong>necessary</strong> for clinical audit tasks — raw language modeling is insufficient.
+        </div>
+      </div>
+    </div>
+  </div>
+</main>
+<!-- ═══ STATUS BAR ═══ -->
+<div class="status-bar">
+  <div>
+    <span class="status-dot online" id="status-dot"></span>
+    <span id="status-text">Environment ready</span>
+  </div>
+  <div>OpenEnv Spec v3 · Phase III Oncology · Procedural Generation</div>
+  <div id="status-time"></div>
+</div>
+<script>
+// ═══════════════════════════════════════════════════════════════
+// ClinicalBench Dashboard — Vanilla JS
+// ═══════════════════════════════════════════════════════════════
+const BASE = window.location.origin;
+const AGENTS = {naive:'Naive LLM',heuristic:'Heuristic',full:'Reasoning Agent'};
+const TASKS = {
+  task_easy:{name:'Dynamic Eligibility Screening',difficulty:'easy'},
+  task_medium:{name:'Protocol Timeline Audit',difficulty:'medium'},
+  task_hard:{name:'Equity + Protocol Audit',difficulty:'hard'}
+};
+const SEED = 20260402;
+let running = false;
+let allResults = {};
+// ─── Utilities ───
+function $(id){return document.getElementById(id)}
+function qs(sel){return document.querySelector(sel)}
+function highlightProtocol(text){
+  return text
+    .replace(/age (\d+-\d+) inclusive/g,'age <span class="hl-rule">$1</span> inclusive')
+    .replace(/within (\d+) days/g,'within <span class="hl-rule">$1 days</span>')
+    .replace(/(Stage IV exception)/g,'<span class="hl-rule">$1</span>')
+    .replace(/(death_date must never precede treatment_start)/g,'<span class="hl-danger">$1</span>')
+    .replace(/dominance exceeds (\d+)%/g,'dominance exceeds <span class="hl-rule">$1%</span>')
+    .replace(/male share exceeds (\d+)%/g,'male share exceeds <span class="hl-rule">$1%</span>')
+    .replace(/gap exceeds (\d+) percentage/g,'gap exceeds <span class="hl-rule">$1</span> percentage')
+    .replace(/(Missing age is a protocol violation)/g,'<span class="hl-danger">$1</span>');
+}
+function updateGauge(score){
+  const maxDash = 251.3;
+  const offset = maxDash - (maxDash * Math.min(1, Math.max(0, score)));
+  $('gauge-fill').style.strokeDashoffset = offset;
+  $('gauge-text').textContent = score.toFixed(2);
+}
+function updateMiniGauge(id, value){
+  const el = $(id);
+  const bar = $(id + '-bar');
+  if(el) el.textContent = (typeof value==='number') ? value.toFixed(3) : value;
+  if(bar) bar.style.width = ((typeof value==='number' ? value : 0) * 100) + '%';
+}
+function setStatus(text, online=true){
+  $('status-text').textContent = text;
+  $('status-dot').className = 'status-dot ' + (online?'online':'offline');
+}
+function addLog(type, tag, text, score){
+  const feed = $('feed');
+  if(feed.querySelector('.feed-empty')) feed.innerHTML = '';
+  const card = document.createElement('div');
+  card.className = 'log-card type-' + type;
+  let html = '<span class="log-tag">[' + tag + ']</span>';
+  if(score !== undefined) html += '<span class="log-score">' + score.toFixed(2) + '</span>';
+  html += text;
+  card.innerHTML = html;
+  feed.appendChild(card);
+  feed.scrollTop = feed.scrollHeight;
+}
+function addDivider(text){
+  const feed = $('feed');
+  const div = document.createElement('div');
+  div.className = 'agent-divider';
+  div.textContent = text;
+  feed.appendChild(div);
+  feed.scrollTop = feed.scrollHeight;
+}
+function updateProtocol(obs){
+  $('proto-id').textContent = obs.protocol_title || '—';
+  $('proto-excerpt').innerHTML = highlightProtocol(obs.trial_protocol_excerpt || '');
+  $('meta-difficulty').textContent = obs.task_type || '—';
+  $('meta-patients').textContent = (obs.dataset||[]).length || '—';
+  $('meta-steps').textContent = obs.attempts_remaining || '—';
+}
+function updateMetrics(bd){
+  if(!bd) return;
+  updateMiniGauge('mg-precision', bd.precision);
+  updateMiniGauge('mg-recall', bd.recall);
+  updateMiniGauge('mg-workflow', bd.workflow);
+  updateMiniGauge('mg-efficiency', bd.efficiency);
+}
+function updateBars(results){
+  const agents = ['naive','heuristic','full'];
+  agents.forEach(a=>{
+    if(results[a]){
+      const avg = results[a].avg || 0;
+      const bar = $('bar-'+a);
+      const val = $('bar-'+a+'-val');
+      if(bar) bar.style.width = (avg*100)+'%';
+      if(val) val.textContent = avg.toFixed(2);
+    }
+  });
+}
+function sleep(ms){return new Promise(r=>setTimeout(r,ms))}
+// ─── Main Audit Runner ───
+async function runSingleEpisode(agentMode, taskId){
+  // Reset
+  const resetPayload = {task_id:taskId, seed:SEED};
+  const resetRes = await fetch(BASE+'/api/audit/reset', {
+    method:'POST', headers:{'Content-Type':'application/json'},
+    body:JSON.stringify(resetPayload)
+  });
+  const resetData = await resetRes.json();
+  const obs = resetData.observation || resetData;
+  updateProtocol(obs);
+  $('meta-errors').textContent = resetData.total_errors || '?';
+  $('stat-seed').textContent = SEED;
+  addLog('info','RESET', `Episode started: ${obs.protocol_title} | ${(obs.dataset||[]).length} patients | ${obs.attempts_remaining} steps`);
+  // Get agent plan
+  const planRes = await fetch(BASE+'/api/audit/plan', {
+    method:'POST', headers:{'Content-Type':'application/json'},
+    body:JSON.stringify({agent:agentMode, task_id:taskId, seed:SEED})
+  });
+  const planData = await planRes.json();
+  const actions = planData.actions || [];
+  const traces = planData.traces || [];
+  // Display traces and execute actions
+  let lastScore = 0;
+  let lastBreakdown = {};
+  for(let i=0; i<actions.length; i++){
+    if(!running) break;
+    const action = actions[i];
+    const trace = traces[i] || {};
+    // Show thought
+    if(trace.thought){
+      addLog('thought','THINK', trace.thought);
+      await sleep(60);
+    }
+    // Show tool usage
+    if(trace.tool){
+      addLog('tool','TOOL', trace.tool);
+      await sleep(40);
+    }
+    // Execute step
+    const stepRes = await fetch(BASE+'/api/audit/step', {
+      method:'POST', headers:{'Content-Type':'application/json'},
+      body:JSON.stringify(action)
+    });
+    const stepData = await stepRes.json();
+    const sObs = stepData.observation || stepData;
+    lastScore = sObs.score_so_far || 0;
+    lastBreakdown = sObs.score_breakdown || {};
+    // Determine log type
+    const fb = sObs.feedback || '';
+    let logType = 'observe';
+    let logTag = 'OBSERVE';
+    if(action.action_type === 'flag_error'){
+      logType = fb.includes('✓') ? 'flag-ok' : 'flag-bad';
+      logTag = fb.includes('✓') ? 'FLAG ✓' : 'FLAG ✗';
+    } else if(action.action_type === 'submit_report'){
+      logType = 'report';
+      logTag = 'REPORT';
+    } else if(action.action_type === 'investigate_pattern'){
+      logTag = 'INVESTIGATE';
+    } else if(action.action_type === 'compute_distribution'){
+      logTag = 'COMPUTE';
+    }
+    addLog(logType, logTag, fb.substring(0,120), lastScore);
+    updateGauge(lastScore);
+    updateMetrics(lastBreakdown);
+    await sleep(30);
+    if(sObs.done) break;
+  }
+  return {score:lastScore, breakdown:lastBreakdown};
+}
+async function startAudit(){
+  if(running) return;
+  running = true;
+  const btn = $('btn-start');
+  btn.disabled = true;
+  btn.classList.add('running');
+  btn.textContent = '● Running...';
+  $('feed').innerHTML = '';
+  allResults = {};
+  setStatus('Audit in progress...', true);
+  const selAgent = $('sel-agent').value;
+  const selTask = $('sel-task').value;
+  const agentList = selAgent === 'all' ? ['naive','heuristic','full'] : [selAgent];
+  const taskList = selTask === 'all' ? ['task_easy','task_medium','task_hard'] : [selTask];
+  try{
+    for(const agent of agentList){
+      addDivider(AGENTS[agent] || agent.toUpperCase());
+      allResults[agent] = {scores:{}, avg:0};
+      for(const task of taskList){
+        const taskName = TASKS[task]?.name || task;
+        addLog('phase','TASK', `${taskName} (${TASKS[task]?.difficulty || ''})`);
+        await sleep(100);
+        const result = await runSingleEpisode(agent, task);
+        allResults[agent].scores[task] = result.score;
+        addLog('info','SCORE', `Final: ${result.score.toFixed(2)}`);
+      }
+      const scores = Object.values(allResults[agent].scores);
+      allResults[agent].avg = scores.reduce((a,b)=>a+b,0)/scores.length;
+    }
+    updateBars(allResults);
+    // Update results table if full run
+    if(selAgent==='all' && selTask==='all'){
+      const tbody = $('results-table').querySelector('tbody');
+      tbody.innerHTML = '';
+      for(const agent of agentList){
+        const r = allResults[agent];
+        const tr = document.createElement('tr');
+        const scoreClass = r.avg >= 0.8 ? 'score-high' : r.avg >= 0.4 ? 'score-mid' : 'score-low';
+        tr.innerHTML = `<td>${AGENTS[agent]}</td>` +
+          ['task_easy','task_medium','task_hard'].map(t=>`<td class="${scoreClass}">${(r.scores[t]||0).toFixed(2)}</td>`).join('') +
+          `<td class="${scoreClass}">${r.avg.toFixed(2)}</td>`;
+        tbody.appendChild(tr);
+      }
+    }
+    addDivider('AUDIT COMPLETE');
+    setStatus('Audit complete', true);
+  } catch(err){
+    addLog('flag-bad','ERROR', err.message || 'Audit failed');
+    setStatus('Error: ' + (err.message||'unknown'), false);
+  }
+  running = false;
+  btn.disabled = false;
+  btn.classList.remove('running');
+  btn.textContent = '▶ Start Audit';
+}
+// ─── Clock ───
+function updateClock(){
+  $('status-time').textContent = new Date().toLocaleTimeString('en-US',{hour12:false});
+}
+setInterval(updateClock, 1000);
+updateClock();
+// ─── Health check on load ───
+(async function(){
+  try{
+    const r = await fetch(BASE+'/health');
+    if(r.ok) setStatus('Environment ready', true);
+    else setStatus('Environment unavailable', false);
+  }catch(e){
+    setStatus('Connecting...', false);
+  }
+})();
+</script>
+</body>
+</html>