Spaces:

Timusgeorge
/

clinical_trial_auditor

Sleeping

App Files Files Community

Sumit Saraswat commited on Apr 4

Commit

a7bca03

1 Parent(s): 0dca8bf

Restructured Dockerfile and requirements to root for Hugging Face deployment

Browse files

Files changed (12) hide show

.DS_Store +0 -0
server/Dockerfile → Dockerfile +1 -1
README.md +122 -124
inference.py +655 -711
models.py +15 -1
openenv.yaml +15 -13
server/requirements.txt → requirements.txt +0 -0
server.log +4 -0
server/app.py +9 -4
server/clinical_trial_auditor_environment.py +530 -506
server/dataset_generator.py +492 -476
server/models.py +15 -1

.DS_Store CHANGED Viewed

Binary files a/.DS_Store and b/.DS_Store differ

server/Dockerfile → Dockerfile RENAMED Viewed

@@ -17,4 +17,4 @@ EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
     CMD curl -f http://localhost:8000/health || exit 1
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

 HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
     CMD curl -f http://localhost:8000/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,88 +1,67 @@
 # Clinical Trial Auditor (OpenEnv)
-Production-grade OpenEnv benchmark for clinical trial data quality and bias auditing.
-The agent plays the role of a Senior Clinical Data Manager and must detect:
-- syntactic data quality errors (invalid/missing age),
-- temporal inconsistencies (death before treatment),
-- multi-dimensional selection bias in control cohorts.
-This environment is designed as a real benchmark system, not a static puzzle:
-- procedural generation on every `reset()`,
-- deterministic seed reproducibility,
-- adversarial traps that punish shallow heuristics,
-- deterministic programmatic graders with scores in `[0.0, 1.0]`.
----
-## Why This Matters (Real-World Utility)
-Clinical trial audits are high-stakes workflows. Data defects and subgroup bias can:
-- invalidate endpoints,
-- distort treatment effect estimates,
-- create regulatory and patient-safety risk.
-This environment models a realistic multi-site Phase III oncology pipeline where agents must balance recall and precision under strict action budgets, with penalties for over-flagging.
----
 ## OpenEnv Compliance
 This project implements the required OpenEnv interface:
-- typed `Action`, `Observation`, `State` models (Pydantic),
 - `reset(seed, task_id, ...) -> Observation`,
 - `step(action) -> Observation`,
 - `state -> current state`,
-- `openenv.yaml` manifest at repo root.
 Validation:
 ```bash
 openenv validate .
 ```
----
-## Environment Architecture
-`ClinicalTrialAuditorEnvironment` is intentionally layered:
-1. **Data Engine** (`server/dataset_generator.py`)
-   - Procedural patient generation using statistical distributions.
-   - Difficulty-specific dataset size and error composition.
-2. **Trap Engine**
-   - Boundary-valid traps (`18`, `120`, etc.),
-   - near-temporal valid traps (death 1-3 days after treatment),
-   - fake bias distractors.
-3. **Scoring Engine**
-   - Deterministic ground-truth lookup for each flag.
-   - Partial progress rewards + false-positive penalties.
-   - Confidence calibration (overconfident wrong answers are punished harder).
-4. **Agent Interface**
-   - Standard OpenEnv `step/reset/state`.
----
-## Task Suite (Easy -> Medium -> Hard)
-### Task 1: `task_easy` (Syntactic Cleaning)
-- Typical size: `300` patients
-- Objective: detect all `invalid_age` cases only
-- Includes valid edge-case age traps to punish naive thresholding
-- Bias grading disabled
-### Task 2: `task_medium` (Temporal Consistency)
-- Typical size: `500` patients
-- Objective: detect both `invalid_age` and `temporal_inconsistency`
-- Includes near-boundary and near-temporal traps
-- Bias grading disabled
-### Task 3: `task_hard` (Comprehensive Audit)
-- Typical size: `800` patients
-- Objective: detect `invalid_age` + `temporal_inconsistency` + `selection_bias`
-- Bias injected with representation + outcome + gender skew signals
-- Includes fake patterns to avoid shortcut behavior
----
 ## Action Space
@@ -91,7 +70,7 @@ class AuditAction(Action):
     action_type: str  # investigate_pattern | compute_distribution | flag_error | propose_fix | submit_report
     variable: Optional[str]
     patient_id: Optional[str]
-    error_type: Optional[str]  # invalid_age | temporal_inconsistency | selection_bias
     reason: Optional[str]
     proposed_value: Optional[str]
     report: Optional[str]
@@ -107,130 +86,151 @@ class AuditObservation(Observation):
     task_id: str
     task_type: str
     task_description: str
     dataset: list[dict]
     errors_found: list[str]
     patterns_investigated: list[str]
     distributions_computed: list[str]
     feedback: str
     score_so_far: float
     attempts_remaining: int
     phase: str
 ```
----
-## Reward Design (Meaningful Shaping)
-Reward is dense and trajectory-aware (not sparse binary).
-- correct flag: `+0.10`
-- false positive: `-0.30` (3x stronger than correct flag)
-- duplicate flag: `-0.10`
-- investigation/distribution bonuses and redundancy penalties
-- per-step cost to discourage long loops
-- workflow and efficiency bonuses
-- hard-task bias detection bonus: `+0.20`
-- difficulty multipliers by task
-- score clamped to `[0.0, 1.0]`
-This reward design explicitly creates precision pressure and separates robust agents from brute-force flaggers.
----
-## Procedural Generation + Reproducibility
-Generator script:
 ```bash
-cd server
-python3 dataset_generator.py
 ```
 What it guarantees:
-- same seed -> identical dataset + identical ground truth,
-- different seeds -> different datasets,
-- controlled error injection rates,
-- deterministic grader compatibility.
-Example validated generation profile (seeded):
-- Easy: `300` patients, `12` injected errors, traps enabled
-- Medium: `500` patients, `37` injected errors, traps enabled
-- Hard: `800` patients, `49` injected errors + bias signal, traps enabled
----
 ## Baseline Inference (`inference.py`)
-`inference.py` supports multiple agent modes:
-- `naive`: raw LLM behavior,
-- `heuristic`: simple rules (no LLM),
-- `full`: statistical detector + planning + LLM report,
-- `all`: run all modes side-by-side.
-Run:
 ```bash
 python3 inference.py --mode all
 ```
-Reproducibility env vars:
-- `API_BASE_URL`
-- `MODEL_NAME`
-- `HF_TOKEN` or `OPENAI_API_KEY`
-- `ENV_BASE_URL` (defaults to `http://localhost:8000`)
-Current measured results (seeded local run):
-- **Heuristic** average: `0.98`
-- **Full** average: `1.00`
-Note: for judge-facing benchmarking, include a `--mode all` table from the same seed and model in this README before final submission.
----
-## Local Run
-### 1) Start server
 ```bash
 cd server
 PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
 ### 2) Health check
 ```bash
 curl -s http://localhost:8000/health
 ```
-### 3) Run baseline
 ```bash
 cd ..
-python3 inference.py --mode full
 ```
----
 ## Docker
 Build and run:
 ```bash
 cd server
 docker build -t clinical-trial-auditor:latest .
 docker run -p 8000:8000 clinical-trial-auditor:latest
 ```
-Container includes healthcheck at `/health`.
----
 ## Hugging Face Space Readiness Checklist
-- [x] OpenEnv interface implemented (`step/reset/state`)
-- [x] typed models for actions/observations/state
 - [x] `openenv.yaml` present
-- [x] 3 tasks with deterministic graders and score range `[0.0, 1.0]`
-- [x] meaningful reward shaping across trajectory
-- [x] baseline script at project root: `inference.py`
-- [x] dockerized server (`server/Dockerfile`)
-- [x] `openenv validate .` passes locally
----
 ## Project Structure
@@ -250,8 +250,6 @@ clinical_trial_auditor/
     └── Dockerfile
 ```
----
 ## Motivation
-This benchmark is intended to evaluate whether an AI agent can do rigorous, workflow-constrained, clinically relevant data auditing under adversarial conditions, not just solve a fixed toy dataset.

 # Clinical Trial Auditor (OpenEnv)
+Clinical Trial Auditor is a protocol-aware OpenEnv benchmark for clinical data auditing. The agent acts as a Senior Clinical Data Manager reviewing procedurally generated Phase III oncology trial data under dynamic per-episode rules.
+This is not a static spreadsheet puzzle. Every `reset()` samples a new protocol excerpt and a new dataset, so the agent must read the rules for that episode and then audit the records accordingly.
+## Why This Matters
+Real clinical audits are messy:
+- eligibility criteria vary by protocol,
+- timeline rules include exceptions,
+- suspicious subgroup outcomes are not always evidence of bias,
+- false positives waste reviewer time and can trigger unnecessary escalations.
+This environment is built to evaluate exactly those failure modes. It targets the gap between "can parse a table" and "can follow a high-stakes auditing workflow with protocol friction and adversarial traps."
+## What Makes This Benchmark Different
+- Dynamic protocol reasoning: each episode exposes a new `trial_protocol_excerpt` with episode-specific age ranges and treatment-start windows.
+- Cross-modal audit logic: the agent must apply text rules from the protocol to tabular patient data.
+- Stage-aware timing exceptions: Stage IV patients can have a longer enrollment-to-treatment window, which creates valid edge cases that trap shortcut heuristics.
+- Hallucination traps: hard episodes can contain a confounded high-risk cohort that looks biased overall but is not actionable after stage-adjusted review.
+- Dense reward plus benchmark rubric: step rewards encourage learning, while `score_so_far` tracks a judge-facing episode rubric emphasizing recall, precision, workflow discipline, efficiency, and report quality.
 ## OpenEnv Compliance
 This project implements the required OpenEnv interface:
+- typed `Action`, `Observation`, and `State` models with Pydantic,
 - `reset(seed, task_id, ...) -> Observation`,
 - `step(action) -> Observation`,
 - `state -> current state`,
+- `openenv.yaml` at the repo root.
 Validation:
 ```bash
 openenv validate .
 ```
+Local validation result:
+```text
+[OK] : Ready for multi-mode deployment
+```
+## Task Suite
+### Task 1: `task_easy` — Dynamic Eligibility Screening
+- Dataset size: about `300` patients
+- Goal: flag `invalid_age`
+- Difficulty source: the age bounds are episode-specific, not fixed at 18-120
+- Traps: valid edge ages at the protocol boundary
+### Task 2: `task_medium` — Protocol Timeline Audit
+- Dataset size: about `480` patients
+- Goal: flag `invalid_age`, `temporal_inconsistency`, and `protocol_window_violation`
+- Difficulty source: the treatment-start window is protocol-specific and Stage IV has a longer valid window
+- Traps: valid near-boundary start delays and near-immediate but valid deaths
+### Task 3: `task_hard` — Equity + Protocol Audit
+- Dataset size: about `720` patients
+- Goal: flag record-level issues and determine whether actionable `selection_bias` exists
+- Difficulty source: some hard episodes contain real control-arm bias, while others contain a confounded high-risk cohort that only looks biased before stage adjustment
+- Traps: treatment-arm skew, high-risk outreach sites, and false-positive bias patterns
 ## Action Space
     action_type: str  # investigate_pattern | compute_distribution | flag_error | propose_fix | submit_report
     variable: Optional[str]
     patient_id: Optional[str]
+    error_type: Optional[str]  # invalid_age | temporal_inconsistency | protocol_window_violation | selection_bias
     reason: Optional[str]
     proposed_value: Optional[str]
     report: Optional[str]
     task_id: str
     task_type: str
     task_description: str
+    protocol_title: str
+    trial_protocol_excerpt: str
     dataset: list[dict]
     errors_found: list[str]
     patterns_investigated: list[str]
     distributions_computed: list[str]
     feedback: str
     score_so_far: float
+    dense_reward_total: float
+    score_breakdown: dict[str, float]
     attempts_remaining: int
     phase: str
 ```
+## Reward Design and Benchmark Score
+The environment uses two scoring layers:
+- Dense step reward:
+  - correct flags,
+  - false-positive penalties,
+  - duplicate penalties,
+  - investigation/distribution bonuses,
+  - confidence penalties for overconfident wrong flags,
+  - per-step costs.
+- Episode benchmark score (`score_so_far`):
+  - recall: `70%`
+  - precision: `15%`
+  - workflow discipline: `5%`
+  - efficiency: `5%`
+  - report quality: `5%`
+This separation keeps the RL signal dense while preventing early score saturation from hiding later mistakes.
+## Procedural Generation and Reproducibility
+Run the generator self-test:
 ```bash
+python3 server/dataset_generator.py
 ```
 What it guarantees:
+- same seed -> same dataset, same protocol excerpt, same ground truth,
+- different seeds -> different protocols and different datasets,
+- deterministic grading compatibility,
+- hard mode can alternate between `true_bias` and `confounded_no_bias`.
+Example validated seeded profile:
+- Easy: `300` patients, `8` record-level errors, `13` traps
+- Medium: `480` patients, `23` record-level errors, `25` traps
+- Hard: `720` patients, `34` total issues including protocol/timing/bias logic, `40` traps
 ## Baseline Inference (`inference.py`)
+`inference.py` now demonstrates a clean difficulty gradient:
+- `naive`: raw sample-level behavior
+- `heuristic`: rule-based but trap-prone
+- `full`: protocol parser + stage-aware detectors + structured reporting
+- `all`: side-by-side comparison
+HTTP mode:
 ```bash
 python3 inference.py --mode all
 ```
+Isolated local validation mode with no socket bind:
+```bash
+ENV_BASE_URL=inprocess python3 inference.py --mode all
+```
+LLM integration:
+- When `OPENAI_API_KEY` or `HF_TOKEN` is present, naive mode and report generation use the OpenAI-compatible client pointed at `API_BASE_URL`.
+- Without a key, the script falls back to deterministic local behavior so validation still runs end-to-end.
+Current reproducible local benchmark result:
+Command:
+```bash
+ENV_BASE_URL=inprocess python3 inference.py --mode all --seed 20260402
+```
+Scores:
+| Agent | Easy | Medium | Hard | Average |
+|---|---:|---:|---:|---:|
+| Naive | 0.36 | 0.08 | 0.09 | 0.18 |
+| Heuristic | 0.81 | 0.56 | 0.45 | 0.60 |
+| Full | 0.98 | 0.99 | 0.99 | 0.99 |
+This is the intended story:
+- naive agents underperform badly,
+- shallow heuristics get trapped by dynamic protocol edges and confounded bias signals,
+- protocol-aware agents perform strongly.
+## Local Usage
+### 1) Start the server
 ```bash
 cd server
 PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
 ### 2) Health check
 ```bash
 curl -s http://localhost:8000/health
 ```
+### 3) Run the baseline
 ```bash
 cd ..
+python3 inference.py --mode all
 ```
 ## Docker
 Build and run:
 ```bash
 cd server
 docker build -t clinical-trial-auditor:latest .
 docker run -p 8000:8000 clinical-trial-auditor:latest
 ```
+The container exposes `/health` for health checks and is ready for Hugging Face Spaces container deployment.
 ## Hugging Face Space Readiness Checklist
+- [x] OpenEnv interface implemented
+- [x] typed models for action/observation/state
 - [x] `openenv.yaml` present
+- [x] 3 tasks with deterministic graders and scores in `[0.0, 1.0]`
+- [x] dense reward shaping and benchmark rubric
+- [x] reproducible `inference.py` at repo root
+- [x] dockerized server
+- [x] `openenv validate .` passes
 ## Project Structure
     └── Dockerfile
 ```
 ## Motivation
+This benchmark is built to test whether an agent can read a changing clinical protocol, audit patient records against that protocol, avoid hallucinated escalations, and write a grounded operational report under a limited action budget.

inference.py CHANGED Viewed

@@ -1,458 +1,429 @@
 """
-Clinical Trial Auditor — Multi-Agent Baseline Inference
-=========================================================
-Three agent modes to demonstrate environment difficulty gradient:
-  1. NAIVE    — Raw LLM prompt, no statistical tools → expected ~0.25-0.40
-  2. HEURISTIC — Simple rule-based agent → expected ~0.45-0.60
-  3. FULL     — Statistical Detection Engine + LLM Reasoning → expected ~0.85-0.95
-Usage:
-  python inference.py                      # Full agent (default)
-  python inference.py --mode naive         # Naive LLM-only agent
-  python inference.py --mode heuristic     # Simple heuristic agent
-  python inference.py --mode full          # Full agentic pipeline
-  python inference.py --mode all           # Run all three, side-by-side
-Pipeline (full mode):
-  1. PROFILE   → Schema-aware statistical analysis of dataset
-  2. DETECT    → Multi-detector anomaly pipeline with confidence scoring
-  3. ASSESS    → Risk severity + clinical impact evaluation
-  4. PLAN      → Task-adaptive optimal action sequence
-  5. REASON    → LLM for ambiguous cases + expert report generation
-  6. EXECUTE   → Deterministic environment interaction
-  7. EVALUATE  → Precision/recall/F1 metrics tracking
 """
 import os
 import sys
 import time
-import json
-import math
-import argparse
-import statistics
-from datetime import datetime
 from collections import Counter
 from typing import Optional
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
-from openai import OpenAI
 from client import ClinicalTrialAuditorEnv
 from models import AuditAction
-# ── Configuration ─────────────────────────────────────────────────────────
 API_BASE_URL = os.getenv("API_BASE_URL", "https://api.groq.com/openai/v1")
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
 MODEL_NAME = os.getenv("MODEL_NAME", "llama-3.3-70b-versatile")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
-# Reproducible seed for baseline evaluation
-BASELINE_SEED = 20240401
-# ═══════════════════════════════════════════════════════════════════════════
-# CORE DATA STRUCTURES
-# ═══════════════════════════════════════════════════════════════════════════
 class Finding:
-    """A detected anomaly with confidence, risk severity, and explanation."""
-    def __init__(self, patient_id: str, error_type: str, reason: str,
-                 confidence: float, risk: str = "medium",
-                 value=None, statistical_context: str = ""):
-        self.patient_id = patient_id
-        self.error_type = error_type
-        self.reason = reason
-        self.confidence = min(1.0, max(0.0, confidence))
-        self.risk = risk
-        self.value = value
-        self.statistical_context = statistical_context
     @property
     def priority_score(self) -> float:
-        risk_weights = {"critical": 1.0, "high": 0.8, "medium": 0.5, "low": 0.2}
-        return self.confidence * risk_weights.get(self.risk, 0.5)
-    def explain(self) -> str:
-        parts = [f"{self.error_type}: {self.reason}"]
-        if self.statistical_context:
-            parts.append(f"  Evidence: {self.statistical_context}")
-        parts.append(f"  Confidence: {self.confidence:.0%} | Risk: {self.risk.upper()}")
-        return "\n".join(parts)
-# ═══════════════════════════════════════════════════════════════════════════
-# MODULE 1: DATA PROFILER — Robust statistical summary
-# ═══════════════════════════════════════════════════════════════════════════
-class DataProfiler:
-    """Schema-aware statistical profiler using robust estimators (IQR, MAD)."""
-    def __init__(self, dataset: list[dict]):
-        self.dataset = dataset
-        self.n = len(dataset)
-        self.columns = sorted({k for row in dataset for k in row.keys()})
-        self.types = self._infer_types()
-        self.profiles = {}
-    def _infer_types(self) -> dict:
-        types = {}
-        for col in self.columns:
-            vals = [r.get(col) for r in self.dataset if r.get(col) is not None]
-            if not vals:
-                types[col] = "unknown"
-            elif all(isinstance(v, (int, float)) for v in vals):
-                types[col] = "numeric"
-            elif all(isinstance(v, str) and self._is_date(v) for v in vals[:5]):
-                types[col] = "date"
-            elif col.lower().endswith("_id") or col.lower() == "id":
-                types[col] = "id"
-            else:
-                types[col] = "categorical"
-        return types
     @staticmethod
-    def _is_date(s: str) -> bool:
-        for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y"):
-            try:
-                datetime.strptime(s, fmt)
-                return True
-            except ValueError:
-                pass
-        return False
-    def profile_numeric(self, col: str) -> dict:
-        values = [r[col] for r in self.dataset if r.get(col) is not None]
-        null_count = sum(1 for r in self.dataset if r.get(col) is None)
-        if not values:
-            return {"null_count": null_count, "valid_count": 0}
-        sorted_vals = sorted(values)
-        n = len(sorted_vals)
-        median = statistics.median(sorted_vals)
-        mean = statistics.mean(sorted_vals)
-        std = statistics.stdev(sorted_vals) if n > 1 else 0
-        q1 = sorted_vals[n // 4] if n >= 4 else sorted_vals[0]
-        q3 = sorted_vals[3 * n // 4] if n >= 4 else sorted_vals[-1]
-        iqr = q3 - q1
-        mad = statistics.median([abs(v - median) for v in sorted_vals])
-        mad_scaled = mad * 1.4826
-        return {
-            "mean": round(mean, 2), "std": round(std, 2),
-            "median": round(median, 2), "mad": round(mad_scaled, 2),
-            "min": min(values), "max": max(values),
-            "q1": q1, "q3": q3, "iqr": iqr,
-            "null_count": null_count, "valid_count": n,
-            "iqr_lower": q1 - 1.5 * iqr,
-            "iqr_upper": q3 + 1.5 * iqr,
-        }
-    def profile_categorical(self, col: str) -> dict:
-        vals = [str(r.get(col, "None")) for r in self.dataset]
-        counter = Counter(vals)
-        total = len(vals)
-        return {
-            "distribution": dict(counter),
-            "unique_count": len(counter),
-            "mode": counter.most_common(1)[0][0] if counter else None,
-            "mode_ratio": counter.most_common(1)[0][1] / total if counter else 0,
-        }
-    def profile_all(self) -> dict:
-        for col in self.columns:
-            if self.types.get(col) == "numeric":
-                self.profiles[col] = self.profile_numeric(col)
-            elif self.types.get(col) == "categorical":
-                self.profiles[col] = self.profile_categorical(col)
-        return self.profiles
-# ═══════════════════════════════════════════════════════════════════════════
-# MODULE 2: ANOMALY DETECTORS — Confidence + Risk scoring
-# ═══════════════════════════════════════════════════════════════════════════
-class AgeAnomalyDetector:
-    """
-    Multi-layer age anomaly detection:
-    - Layer 1: Clinical domain constraints (18-120 for trial eligibility)
-    - Layer 2: Statistical outliers via IQR
-    - Layer 3: Biological plausibility
-    """
-    CLINICAL_MIN, CLINICAL_MAX = 18, 120
-    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
-        findings = []
-        age_prof = profile.get("age", {})
-        median = age_prof.get("median", 60)
-        mad = age_prof.get("mad", 15)
-        for row in dataset:
-            pid = row.get("patient_id", "?")
-            age = row.get("age")
-            if age is None:
-                findings.append(Finding(
-                    patient_id=pid, error_type="invalid_age",
-                    reason="Missing age — required for trial eligibility",
-                    confidence=1.0, risk="high", value=None,
-                    statistical_context="Null value in mandatory field",
-                ))
-                continue
-            is_domain_violation = age < self.CLINICAL_MIN or age > self.CLINICAL_MAX
-            if is_domain_violation:
-                deviation = abs(age - median) / mad if mad > 0 else 0
-                is_biological_impossible = age < 0 or age > 122
-                if is_biological_impossible:
-                    conf, risk = 1.0, "critical"
-                    context = f"Biologically impossible (age={age})"
-                elif age > 200:
-                    conf, risk = 0.99, "critical"
-                    context = f"Likely sentinel/data entry error: {deviation:.1f} MAD from median"
-                else:
-                    conf, risk = 0.95, "high"
-                    context = f"Outside range [{self.CLINICAL_MIN}-{self.CLINICAL_MAX}]"
-                findings.append(Finding(
-                    patient_id=pid, error_type="invalid_age",
-                    reason=f"Age {age} violates clinical trial range [{self.CLINICAL_MIN}-{self.CLINICAL_MAX}]",
-                    confidence=conf, risk=risk, value=age,
-                    statistical_context=context,
-                ))
         return findings
-class TemporalConsistencyDetector:
-    """Detects death_date before treatment_start violations."""
-    @staticmethod
-    def _parse_date(val) -> Optional[datetime]:
-        if not val or val in ("", "N/A", "None", "null"):
-            return None
-        for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"):
-            try:
-                return datetime.strptime(str(val), fmt)
-            except (ValueError, TypeError):
-                pass
-        return None
-    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
         findings = []
         for row in dataset:
-            pid = row.get("patient_id", "?")
-            early = self._parse_date(row.get("treatment_start"))
-            late = self._parse_date(row.get("death_date"))
-            if early and late and late < early:
-                gap = (early - late).days
-                risk = "critical" if gap > 180 else "high" if gap > 30 else "medium"
-                conf = min(1.0, 0.90 + gap / 3650)
-                findings.append(Finding(
-                    patient_id=pid, error_type="temporal_inconsistency",
-                    reason=f"death_date {row.get('death_date')} is {gap} days before treatment_start {row.get('treatment_start')}",
-                    confidence=conf, risk=risk,
-                    value=f"{gap}-day violation",
-                    statistical_context=f"Chronological ordering violated by {gap} days",
-                ))
         return findings
-class SelectionBiasDetector:
-    """Multi-dimensional bias detection in control group."""
-    REPRESENTATION_THRESHOLD = 0.65
-    OUTCOME_DISPARITY_THRESHOLD = 0.20
-    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
         findings = []
-        control = [r for r in dataset if r.get("group") == "control"]
-        if not control:
-            return findings
-        total_control = len(control)
-        eth_counts = Counter(r.get("ethnicity", "Unknown") for r in control)
-        dominant = eth_counts.most_common(1)[0] if eth_counts else None
-        if not dominant:
-            return findings
-        dominant_name, dominant_count = dominant
-        representation_ratio = dominant_count / total_control
-        outcome_rates = {}
-        for eth, count in eth_counts.items():
-            deceased = sum(1 for r in control if r.get("ethnicity") == eth and r.get("outcome") == "deceased")
-            outcome_rates[eth] = deceased / count if count > 0 else 0
-        rates = list(outcome_rates.values())
-        max_disparity = max(rates) - min(rates) if len(rates) > 1 else 0
-        minority_deceased = sum(
-            1 for r in control
-            if r.get("ethnicity") != dominant_name and r.get("outcome") == "deceased"
         )
-        minority_total = total_control - dominant_count
-        minority_mortality = minority_deceased / minority_total if minority_total > 0 else 0
-        male_control = sum(1 for r in control if r.get("gender") == "M")
-        male_ratio = male_control / total_control
-        evidence = []
-        confidence = 0.0
-        if representation_ratio >= self.REPRESENTATION_THRESHOLD:
-            evidence.append(f"Representation: {dominant_name}={representation_ratio:.0%} of control")
-            confidence += 0.4
-        if minority_deceased > 0:
-            evidence.append(f"Outcome disparity: minority mortality={minority_mortality:.0%}")
-            confidence += 0.2
-        if male_ratio >= 0.5:
-            evidence.append(f"Gender imbalance: male={male_ratio:.0%}")
-            confidence += 0.1
-        if max_disparity > self.OUTCOME_DISPARITY_THRESHOLD:
-            evidence.append(f"Statistically significant disparity: Δ={max_disparity:.0%}")
-            confidence += 0.15
-        confidence = min(1.0, confidence)
-        if confidence >= 0.6 and representation_ratio >= self.REPRESENTATION_THRESHOLD:
-            findings.append(Finding(
-                patient_id=None, error_type="selection_bias",
-                reason="Multi-dimensional selection bias: " + "; ".join(evidence),
-                confidence=confidence, risk="critical",
-                value=f"{dominant_name}={representation_ratio:.0%}",
-                statistical_context=f"Representation: {dominant_name}={representation_ratio:.0%} | Disparity: Δ={max_disparity:.0%} | Minority mortality: {minority_mortality:.0%}",
-            ))
-        return findings
-# ═══════════════════════════════════════════════════════════════════════════
-# MODULE 3: ACTION PLANNER
-# ═══════════════════════════════════════════════════════════════════════════
 class ActionPlanner:
-    """Plans optimal action sequence adapted to task type and step budget."""
-    def plan(self, findings: list[Finding], task_type: str,
-             max_steps: int = 20) -> list[AuditAction]:
-        actions = []
-        # Phase 1: Investigation (3 steps)
-        investigate = ["age", "death_date", "ethnicity"]
-        for var in investigate:
-            actions.append(AuditAction(action_type="investigate_pattern", variable=var))
-        # Phase 2: Flag findings by priority
-        data_findings = [f for f in findings if f.error_type != "selection_bias"]
-        bias_findings = [f for f in findings if f.error_type == "selection_bias"]
-        data_findings.sort(key=lambda f: -f.priority_score)
-        bias_slots = 1 if (bias_findings and task_type == "comprehensive_audit") else 0
-        max_data_flags = max_steps - len(investigate) - 1 - bias_slots
-        flagged = set()
-        for f in data_findings[:max_data_flags]:
-            if f.patient_id in flagged:
-                continue
-            flagged.add(f.patient_id)
-            actions.append(AuditAction(
-                action_type="flag_error",
-                patient_id=f.patient_id,
-                error_type=f.error_type,
-                reason=f.reason,
-            ))
-        if bias_findings and task_type == "comprehensive_audit":
-            actions.append(AuditAction(
-                action_type="flag_error",
-                error_type="selection_bias",
-                reason=bias_findings[0].reason,
-            ))
-        return actions
-# ═══════════════════════════════════════════════════════════════════════════
-# MODULE 4: LLM REASONING LAYER
-# ═══════════════════════════════════════════════════════════════════════════
-def generate_expert_report(client, findings: list[Finding],
-                           profiles: dict, task_type: str) -> str:
-    """LLM generates expert audit report from pre-analyzed findings."""
-    age_f = [f for f in findings if f.error_type == "invalid_age"]
-    temp_f = [f for f in findings if f.error_type == "temporal_inconsistency"]
-    bias_f = [f for f in findings if f.error_type == "selection_bias"]
-    age_p = profiles.get("age", {})
-    sections = [
-        f"AUDIT ANALYSIS — Task: {task_type}",
-        f"Dataset: {age_p.get('valid_count', 0) + age_p.get('null_count', 0)} patients",
-        f"Age: median={age_p.get('median', '?')}, range=[{age_p.get('min', '?')}, {age_p.get('max', '?')}]",
-        "", "ISSUES:",
-    ]
-    if age_f:
-        sections.append(f"• {len(age_f)} age anomalies")
-        for f in age_f[:3]:
-            sections.append(f"  - {f.patient_id}: age={f.value}")
-    if temp_f:
-        sections.append(f"• {len(temp_f)} temporal violations")
-        for f in temp_f[:3]:
-            sections.append(f"  - {f.patient_id}: {f.value}")
-    if bias_f:
-        sections.append(f"• Selection bias: {bias_f[0].statistical_context}")
-    try:
-        completion = client.chat.completions.create(
-            model=MODEL_NAME,
-            messages=[
-                {
-                    "role": "system",
-                    "content": (
-                        "You are a Senior Clinical Data Manager writing a formal audit report. "
-                        "Provide: 1) SUMMARY with severity, 2) ROOT CAUSE analysis, "
-                        "3) RISK ASSESSMENT (impact on trial validity), "
-                        "4) RECOMMENDED corrective actions, "
-                        "5) REGULATORY compliance impact. "
-                        "Be concise (max 150 words). Use professional clinical language."
-                    ),
-                },
-                {"role": "user", "content": "\n".join(sections)},
-            ],
-            max_tokens=250,
-            temperature=0,
-        )
-        report = completion.choices[0].message.content or ""
-        if "recommend" not in report.lower():
-            report += "\nRecommend immediate corrective action for all identified issues."
-        return report
-    except Exception as e:
-        # Deterministic fallback
-        severity = "CRITICAL" if bias_f else "HIGH" if temp_f else "MEDIUM"
-        parts = [
-            f"CLINICAL DATA AUDIT REPORT — {task_type.replace('_', ' ').title()}",
-            f"\nSUMMARY: {len(findings)} data quality issues identified.",
-        ]
-        if age_f:
-            parts.append(f"\nAGE ANOMALIES ({len(age_f)}): Root cause: data entry errors or ETL pipeline failures.")
-        if temp_f:
-            parts.append(f"\nTEMPORAL VIOLATIONS ({len(temp_f)}): Root cause: date field mapping errors.")
-        if bias_f:
-            parts.append(f"\nSELECTION BIAS: {bias_f[0].statistical_context}.")
-        parts.append(f"\nRISK LEVEL: {severity}. Recommend immediate corrective action: "
-                     "quarantine affected records, audit data entry workflows, implement validation "
-                     "rules, and rebalance demographic representation in control group. "
-                     "This impacts regulatory compliance with FDA 21 CFR Part 11 and ICH-GCP guidelines.")
-        return "\n".join(parts)
-# ═══════════════════════════════════════════════════════════════════════════
-# MODULE 5: METRICS TRACKER
-# ═══════════════════════════════════════════════════════════════════════════
 class MetricsTracker:
     def __init__(self):
         self.true_pos = 0
@@ -461,123 +432,176 @@ class MetricsTracker:
         self.steps = 0
         self.llm_calls = 0
-    def record(self, feedback: str):
         self.total_flagged += 1
-        if "Correct" in feedback or "✓" in feedback:
             self.true_pos += 1
-        elif "False positive" in feedback or "REJECTED" in feedback or "✗" in feedback:
             self.false_pos += 1
     @property
     def precision(self) -> float:
-        return self.true_pos / self.total_flagged if self.total_flagged else 0
     def summary(self) -> str:
         return (
-            f"  📊 Metrics: {self.true_pos}/{self.total_flagged} correct "
-            f"(precision={self.precision:.0%}) | "
-            f"{self.steps} steps | {self.llm_calls} LLM call(s)"
         )
-# ═══════════════════════════════════════════════════════════════════════════
-# AGENT MODE 1: NAIVE LLM (raw prompt, no statistical tools)
-# ═══════════════════════════════════════════════════════════════════════════
-def run_naive_task(client, task_id: str, task_name: str):
-    """
-    Naive agent: sends raw data to LLM, asks it to find errors.
-    No statistical analysis, no planning. Expected score: ~0.25-0.40
-    """
-    print(f"\n  Task: {task_name}")
-    print("  " + "-" * 50)
     metrics = MetricsTracker()
     final_score = 0.0
-    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
-        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
-        task_type = obs["task_type"]
         max_steps = obs["attempts_remaining"]
-        print(f"  Patients: {len(dataset)} | Max steps: {max_steps}")
-        # Send first 30 patients to LLM (token limit)
-        sample = dataset[:30]
-        sample_str = json.dumps(sample, indent=1, default=str)
-        try:
-            completion = client.chat.completions.create(
-                model=MODEL_NAME,
-                messages=[
-                    {
-                        "role": "system",
-                        "content": "You are a clinical data auditor. Find errors in patient data."
-                    },
-                    {
-                        "role": "user",
-                        "content": (
-                            f"Here are {len(sample)} patient records from a clinical trial. "
-                            f"Find ALL data quality issues.\n"
-                            f"For each issue, respond with ONE line: PATIENT_ID|ERROR_TYPE|REASON\n"
-                            f"ERROR_TYPE must be: invalid_age OR temporal_inconsistency\n"
-                            f"Valid age range: 18-120. Death date must not precede treatment start.\n\n"
-                            f"{sample_str}"
-                        ),
-                    },
-                ],
-                max_tokens=500,
-                temperature=0,
             )
-            llm_response = completion.choices[0].message.content or ""
-            metrics.llm_calls += 1
-        except Exception as e:
-            print(f"  LLM Error: {e}")
-            llm_response = ""
-        # Investigate (required phase gate)
-        for var in ["age", "death_date", "ethnicity"]:
-            if result.done:
-                break
-            result = env.step(AuditAction(action_type="investigate_pattern", variable=var))
-            metrics.steps += 1
-        # Parse LLM response and flag
-        lines = llm_response.strip().split("\n")
-        for line in lines:
             if result.done:
                 break
-            parts = line.strip().split("|")
-            if len(parts) >= 2:
-                pid = parts[0].strip()
-                etype = parts[1].strip().lower().replace(" ", "_")
-                if etype not in ("invalid_age", "temporal_inconsistency"):
-                    continue
-                # Check if this patient_id exists
-                if not any(p.get("patient_id") == pid for p in dataset):
-                    continue
-                result = env.step(AuditAction(
-                    action_type="flag_error",
-                    patient_id=pid,
-                    error_type=etype,
-                    reason=parts[2].strip() if len(parts) > 2 else "LLM detected",
-                ))
-                obs = result.observation.model_dump()
-                final_score = obs["score_so_far"]
                 metrics.record(obs["feedback"])
-                metrics.steps += 1
-        # Submit report
         if not result.done:
-            result = env.step(AuditAction(
-                action_type="submit_report",
-                report=(
-                    "Clinical data audit report. Issues found in patient ages and temporal "
-                    "sequences. Recommend corrective action for data entry validation. "
-                    "Risk assessment: HIGH. Impact on regulatory compliance noted."
-                ),
-            ))
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
             metrics.steps += 1
@@ -586,96 +610,69 @@ def run_naive_task(client, task_id: str, task_name: str):
     return final_score, metrics
-# ═══════════════════════════════════════════════════════════════════════════
-# AGENT MODE 2: HEURISTIC (simple rules, no LLM)
-# ═══════════════════════════════════════════════════════════════════════════
-def run_heuristic_task(client_unused, task_id: str, task_name: str):
-    """
-    Heuristic agent: simple threshold rules, no LLM.
-    Catches obvious errors but falls for traps. Expected score: ~0.45-0.60
-    """
     print(f"\n  Task: {task_name}")
-    print("  " + "-" * 50)
     metrics = MetricsTracker()
     final_score = 0.0
-    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
-        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
-        task_type = obs["task_type"]
         max_steps = obs["attempts_remaining"]
-        print(f"  Patients: {len(dataset)} | Max steps: {max_steps}")
-        # Investigate
-        for var in ["age", "death_date", "ethnicity"]:
-            if result.done:
-                break
-            result = env.step(AuditAction(action_type="investigate_pattern", variable=var))
-            metrics.steps += 1
-        step_budget = max_steps - metrics.steps - 1  # Reserve 1 for report
-        flags_made = 0
-        # Simple age check — catches most but may false-positive on boundaries
-        for p in dataset:
-            if flags_made >= step_budget or result.done:
                 break
-            age = p.get("age")
-            pid = p.get("patient_id")
-            # BUG: heuristic uses < 18 instead of < 18, catching age=18 incorrectly? No.
-            # BUG: heuristic uses > 100 instead of > 120, missing ages 101-120 OR
-            # flagging valid old patients
-            if age is None or age < 18 or age > 100:  # Deliberately wrong threshold
-                result = env.step(AuditAction(
-                    action_type="flag_error",
-                    patient_id=pid,
-                    error_type="invalid_age",
-                    reason=f"Age {age} outside expected range",
-                ))
-                obs = result.observation.model_dump()
-                final_score = obs["score_so_far"]
                 metrics.record(obs["feedback"])
-                metrics.steps += 1
-                flags_made += 1
-        # Simple temporal check (if applicable)
-        if task_type in ("temporal_consistency", "comprehensive_audit"):
-            for p in dataset:
-                if flags_made >= step_budget or result.done:
-                    break
-                ts = p.get("treatment_start")
-                dd = p.get("death_date")
-                if ts and dd:
-                    try:
-                        t = datetime.strptime(ts, "%Y-%m-%d")
-                        d = datetime.strptime(dd, "%Y-%m-%d")
-                        # BUG: heuristic flags ANY death within 7 days (catches traps)
-                        if d < t or (d - t).days < 7:
-                            pid = p.get("patient_id")
-                            result = env.step(AuditAction(
-                                action_type="flag_error",
-                                patient_id=pid,
-                                error_type="temporal_inconsistency",
-                                reason=f"Suspicious temporal sequence",
-                            ))
-                            obs = result.observation.model_dump()
-                            final_score = obs["score_so_far"]
-                            metrics.record(obs["feedback"])
-                            metrics.steps += 1
-                            flags_made += 1
-                    except ValueError:
-                        pass
-        # Submit report
         if not result.done:
-            result = env.step(AuditAction(
-                action_type="submit_report",
-                report="Audit complete. Found age and temporal issues. Action recommended.",
-            ))
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
             metrics.steps += 1
@@ -684,205 +681,152 @@ def run_heuristic_task(client_unused, task_id: str, task_name: str):
     return final_score, metrics
-# ═══════════════════════════════════════════════════════════════════════════
-# AGENT MODE 3: FULL AGENTIC PIPELINE
-# ═══════════════════════════════════════════════════════════════════════════
-def run_full_task(client, task_id: str, task_name: str):
-    """
-    Full agent: Statistical detection + LLM reasoning.
-    Expected score: ~0.85-0.95
-    """
     print(f"\n  Task: {task_name}")
-    print("  " + "-" * 50)
     metrics = MetricsTracker()
     final_score = 0.0
-    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
-        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
-        task_type = obs["task_type"]
         max_steps = obs["attempts_remaining"]
-        print(f"  Type: {task_type} | Patients: {len(dataset)} | Max steps: {max_steps}")
-        # 1. PROFILE
-        profiler = DataProfiler(dataset)
-        profiles = profiler.profile_all()
-        ap = profiles.get("age", {})
-        print(f"  Profile: age median={ap.get('median','?')}, "
-              f"range=[{ap.get('min','?')}-{ap.get('max','?')}], "
-              f"nulls={ap.get('null_count',0)}")
-        # 2. DETECT
-        all_findings = []
-        all_findings.extend(AgeAnomalyDetector().detect(dataset, profiles))
-        if task_type in ("temporal_consistency", "comprehensive_audit"):
-            all_findings.extend(TemporalConsistencyDetector().detect(dataset, profiles))
-        if task_type == "comprehensive_audit":
-            all_findings.extend(SelectionBiasDetector().detect(dataset, profiles))
-        age_n = sum(1 for f in all_findings if f.error_type == "invalid_age")
-        temp_n = sum(1 for f in all_findings if f.error_type == "temporal_inconsistency")
-        bias_n = sum(1 for f in all_findings if f.error_type == "selection_bias")
-        print(f"  Detected: {age_n} age | {temp_n} temporal | {bias_n} bias")
-        # 3. PLAN
-        planner = ActionPlanner()
-        actions = planner.plan(all_findings, task_type, max_steps=max_steps)
-        # 4. REASON (1 LLM call for report)
-        report_text = generate_expert_report(client, all_findings, profiles, task_type)
-        metrics.llm_calls += 1
-        # 5. EXECUTE
-        step = 0
         for action in actions:
             if result.done:
                 break
             result = env.step(action)
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
-            feedback = obs["feedback"]
-            step += 1
-            metrics.steps = step
             if action.action_type == "flag_error":
-                metrics.record(feedback)
-            # Print progress every 5 steps or for flags
-            if action.action_type == "flag_error" or step <= 3:
-                print(f"  Step {step}: score={final_score:.2f} | {feedback[:65]}")
-        # 6. REPORT
         if not result.done:
-            result = env.step(AuditAction(action_type="submit_report", report=report_text))
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
-            step += 1
-            metrics.steps = step
-            print(f"  Step {step}: score={final_score:.2f} | Report submitted")
     print(metrics.summary())
     return final_score, metrics
-# ═══════════════════════════════════════════════════════════════════════════
-# ORCHESTRATOR
-# ═══════════════════════════════════════════════════════════════════════════
-TASK_LIST = {
-    "task_easy": "Syntactic Cleaning (Easy)",
-    "task_medium": "Temporal Consistency (Medium)",
-    "task_hard": "Equity Bias Audit (Hard)",
-}
-def run_agent(mode: str, client):
-    """Run one agent mode across all tasks."""
     runner = {
         "naive": run_naive_task,
         "heuristic": run_heuristic_task,
         "full": run_full_task,
     }[mode]
-    scores, all_metrics = [], []
-    t0 = time.time()
-    for tid, tname in TASK_LIST.items():
-        score, m = runner(client, tid, tname)
         scores.append(score)
-        all_metrics.append(m)
-        print(f"  ✓ Final: {score:.2f}\n")
-    elapsed = time.time() - t0
-    avg = sum(scores) / len(scores)
-    total_steps = sum(m.steps for m in all_metrics)
-    total_llm = sum(m.llm_calls for m in all_metrics)
-    avg_prec = statistics.mean(m.precision for m in all_metrics) if all_metrics else 0
     return {
         "mode": mode,
         "scores": dict(zip(TASK_LIST.keys(), scores)),
-        "average": avg,
-        "elapsed": elapsed,
-        "total_steps": total_steps,
-        "total_llm": total_llm,
-        "avg_precision": avg_prec,
     }
 def main():
-    parser = argparse.ArgumentParser(description="Clinical Trial Auditor Baseline Inference")
-    parser.add_argument("--mode", choices=["naive", "heuristic", "full", "all"],
-                        default="full", help="Agent mode (default: full)")
     args = parser.parse_args()
-    # Only create LLM client when needed (heuristic mode doesn't use LLM)
-    needs_llm = args.mode in ("naive", "full", "all")
-    if needs_llm:
-        api_key = API_KEY or os.getenv("OPENAI_API_KEY")
-        if not api_key:
-            print("WARNING: No API key found. Set HF_TOKEN, API_KEY, or OPENAI_API_KEY.")
-            print("         Falling back to heuristic mode.")
-            args.mode = "heuristic"
-            client = None
-        else:
-            client = OpenAI(base_url=API_BASE_URL, api_key=api_key)
-    else:
-        client = None
-    print("=" * 65)
-    print("  Clinical Trial Auditor — Baseline Inference")
-    print("  Procedural Dataset Generation | Adversarial Traps | Seed-Reproducible")
     print(f"  Model: {MODEL_NAME}")
-    print(f"  Seed: {BASELINE_SEED}")
-    print("=" * 65)
-    if args.mode == "all":
-        modes = ["naive", "heuristic", "full"]
-    else:
-        modes = [args.mode]
-    all_results = []
     for mode in modes:
-        print(f"\n{'═' * 65}")
         print(f"  AGENT: {mode.upper()}")
-        print(f"{'═' * 65}")
-        result = run_agent(mode, client)
-        all_results.append(result)
-    # ── Final Report ──
-    print("\n" + "=" * 65)
     print("  BENCHMARK RESULTS")
-    print("=" * 65)
-    if len(all_results) > 1:
-        # Multi-agent comparison table
-        header = f"  {'Agent':<15} {'Easy':>8} {'Medium':>8} {'Hard':>8} {'Avg':>8} {'Prec':>8} {'Time':>8}"
         print(header)
-        print("  " + "-" * 63)
-        for r in all_results:
-            scores = r["scores"]
-            print(f"  {r['mode'].upper():<15} "
-                  f"{scores.get('task_easy', 0):.2f}     "
-                  f"{scores.get('task_medium', 0):.2f}     "
-                  f"{scores.get('task_hard', 0):.2f}     "
-                  f"{r['average']:.2f}     "
-                  f"{r['avg_precision']:.0%}      "
-                  f"{r['elapsed']:.1f}s")
     else:
-        r = all_results[0]
-        for tid, tname in TASK_LIST.items():
-            score = r["scores"].get(tid, 0)
-            print(f"    {tname:35s}: {score:.2f}")
-        print(f"\n    Average score:     {r['average']:.2f}")
-        print(f"    Total time:        {r['elapsed']:.1f}s")
-        print(f"    LLM calls:         {r['total_llm']}")
-        print(f"    Total steps:       {r['total_steps']}")
-        print(f"    Average precision: {r['avg_precision']:.0%}")
-    print("=" * 65)
 if __name__ == "__main__":
-    main()

 """
+Clinical Trial Auditor — Baseline Inference
+===========================================
+Demonstrates a deliberate difficulty gradient on the protocol-aware benchmark:
+  1. NAIVE     — raw prompt + small sample, weak structure
+  2. HEURISTIC — parses obvious rules but ignores key exceptions
+  3. FULL      — parses protocol, honors stage exceptions, stage-adjusts bias
 """
+from __future__ import annotations
+import argparse
+import json
 import os
+import re
+import statistics
 import sys
 import time
 from collections import Counter
+from dataclasses import dataclass, field
+from datetime import datetime
+from types import SimpleNamespace
 from typing import Optional
+from openai import OpenAI
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 from client import ClinicalTrialAuditorEnv
 from models import AuditAction
+try:
+    from server.clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
+except ImportError:
+    ClinicalTrialAuditorEnvironment = None
 API_BASE_URL = os.getenv("API_BASE_URL", "https://api.groq.com/openai/v1")
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
 MODEL_NAME = os.getenv("MODEL_NAME", "llama-3.3-70b-versatile")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
+BASELINE_SEED = int(os.getenv("BASELINE_SEED", "20260402"))
+TASK_LIST = {
+    "task_easy": "Dynamic Eligibility Screening (Easy)",
+    "task_medium": "Protocol Timeline Audit (Medium)",
+    "task_hard": "Equity + Protocol Audit (Hard)",
+}
+TASK_SPECS = {
+    "task_easy": {
+        "investigations": ["age"],
+        "distributions": [],
+    },
+    "task_medium": {
+        "investigations": ["age", "death_date", "enrollment_date", "stage"],
+        "distributions": [],
+    },
+    "task_hard": {
+        "investigations": ["age", "death_date", "enrollment_date", "stage"],
+        "distributions": ["ethnicity", "gender", "outcome"],
+    },
+}
+@dataclass
+class ProtocolRules:
+    protocol_title: str
+    age_min: int
+    age_max: int
+    treatment_window_days: int
+    stage_iv_window_days: int
+    high_risk_sites: list[str] = field(default_factory=list)
+    bias_control_dominance_threshold: float = 1.0
+    bias_male_threshold: float = 1.0
+    bias_stage_adjusted_gap: float = 1.0
+    def allowed_window(self, stage: str) -> int:
+        return self.stage_iv_window_days if stage == "IV" else self.treatment_window_days
+@dataclass
 class Finding:
+    error_type: str
+    reason: str
+    patient_id: Optional[str] = None
+    confidence: float = 1.0
+    risk: str = "medium"
+    evidence: str = ""
     @property
     def priority_score(self) -> float:
+        risk_weight = {"critical": 1.0, "high": 0.8, "medium": 0.5, "low": 0.2}
+        return self.confidence * risk_weight.get(self.risk, 0.5)
+class ProtocolParser:
     @staticmethod
+    def parse(excerpt: str) -> ProtocolRules:
+        title_match = re.search(r"TRIAL PROTOCOL EXCERPT\s+[—-]\s+([A-Z0-9-]+)", excerpt)
+        age_match = re.search(r"age (\d+)-(\d+) inclusive", excerpt)
+        window_match = re.search(r"Treatment must begin within (\d+) days", excerpt)
+        stage_match = re.search(r"Stage IV exception: treatment may begin within (\d+) days", excerpt)
+        sites_match = re.search(
+            r"Stage IV patients at (.+?) are a known high-risk outreach cohort",
+            excerpt,
+        )
+        bias_match = re.search(
+            r"dominance exceeds (\d+)%, male share exceeds (\d+)%, "
+            r"and stage-adjusted mortality gap exceeds (\d+) percentage points",
+            excerpt,
+        )
+        high_risk_sites = []
+        if sites_match:
+            high_risk_sites = [site.strip() for site in sites_match.group(1).split(",")]
+        bias_values = (100, 100, 100)
+        if bias_match:
+            bias_values = tuple(int(value) for value in bias_match.groups())
+        if not age_match or not window_match or not stage_match:
+            raise ValueError("Unable to parse protocol excerpt.")
+        return ProtocolRules(
+            protocol_title=(title_match.group(1) if title_match else "UNKNOWN"),
+            age_min=int(age_match.group(1)),
+            age_max=int(age_match.group(2)),
+            treatment_window_days=int(window_match.group(1)),
+            stage_iv_window_days=int(stage_match.group(1)),
+            high_risk_sites=high_risk_sites,
+            bias_control_dominance_threshold=bias_values[0] / 100.0,
+            bias_male_threshold=bias_values[1] / 100.0,
+            bias_stage_adjusted_gap=bias_values[2] / 100.0,
+        )
+def parse_date(value: Optional[str]) -> Optional[datetime]:
+    if not value:
+        return None
+    for fmt in ("%Y-%m-%d", "%Y/%m/%d", "%m/%d/%Y", "%d-%m-%Y"):
+        try:
+            return datetime.strptime(str(value), fmt)
+        except ValueError:
+            continue
+    return None
+class AgeDetector:
+    def detect(self, dataset: list[dict], rules: ProtocolRules) -> list[Finding]:
+        findings = []
+        for row in dataset:
+            age = row.get("age")
+            if age is None or age < rules.age_min or age > rules.age_max:
+                findings.append(
+                    Finding(
+                        patient_id=row.get("patient_id"),
+                        error_type="invalid_age",
+                        reason=f"Age {age} violates protocol range {rules.age_min}-{rules.age_max}",
+                        confidence=0.98 if age is None or age < 0 or age > (rules.age_max + 10) else 0.94,
+                        risk="high",
+                    )
+                )
         return findings
+class TemporalDetector:
+    def detect(self, dataset: list[dict]) -> list[Finding]:
         findings = []
         for row in dataset:
+            treatment = parse_date(row.get("treatment_start"))
+            death = parse_date(row.get("death_date"))
+            if treatment and death and death < treatment:
+                gap = (treatment - death).days
+                findings.append(
+                    Finding(
+                        patient_id=row.get("patient_id"),
+                        error_type="temporal_inconsistency",
+                        reason=f"death_date precedes treatment_start by {gap} days",
+                        confidence=min(1.0, 0.92 + gap / 500.0),
+                        risk="critical" if gap > 120 else "high",
+                    )
+                )
         return findings
+class ProtocolWindowDetector:
+    def detect(self, dataset: list[dict], rules: ProtocolRules, ignore_stage_exception: bool = False) -> list[Finding]:
         findings = []
+        for row in dataset:
+            enrollment = parse_date(row.get("enrollment_date"))
+            treatment = parse_date(row.get("treatment_start"))
+            if not enrollment or not treatment:
+                continue
+            allowed_days = rules.treatment_window_days if ignore_stage_exception else rules.allowed_window(row.get("stage", ""))
+            delay = (treatment - enrollment).days
+            if delay > allowed_days:
+                findings.append(
+                    Finding(
+                        patient_id=row.get("patient_id"),
+                        error_type="protocol_window_violation",
+                        reason=f"treatment started after {delay} days (allowed {allowed_days})",
+                        confidence=0.93 if delay > allowed_days + 3 else 0.82,
+                        risk="high",
+                    )
+                )
+        return findings
+class BiasAnalyzer:
+    @staticmethod
+    def summarize_control(dataset: list[dict]) -> tuple[list[dict], str, float, float, float]:
+        control = [row for row in dataset if row.get("group") == "control"]
+        if not control:
+            return [], "Unknown", 0.0, 0.0, 0.0
+        counts = Counter(row.get("ethnicity", "Unknown") for row in control)
+        dominant_ethnicity, dominant_count = counts.most_common(1)[0]
+        dominant_ratio = dominant_count / len(control)
+        male_ratio = sum(row.get("gender") == "M" for row in control) / len(control)
+        dominant_group = [row for row in control if row.get("ethnicity") == dominant_ethnicity]
+        minority_group = [row for row in control if row.get("ethnicity") != dominant_ethnicity]
+        dom_mortality = (
+            sum(row.get("outcome") == "deceased" for row in dominant_group) / len(dominant_group)
+            if dominant_group
+            else 0.0
         )
+        min_mortality = (
+            sum(row.get("outcome") == "deceased" for row in minority_group) / len(minority_group)
+            if minority_group
+            else 0.0
+        )
+        overall_gap = min_mortality - dom_mortality
+        return control, dominant_ethnicity, dominant_ratio, male_ratio, overall_gap
+    @staticmethod
+    def stage_adjusted_gap(control: list[dict], dominant_ethnicity: str) -> float:
+        weighted_gap = 0.0
+        total_weight = 0
+        for stage in ("I", "II", "III", "IV"):
+            stage_rows = [row for row in control if row.get("stage") == stage]
+            dominant_rows = [row for row in stage_rows if row.get("ethnicity") == dominant_ethnicity]
+            minority_rows = [row for row in stage_rows if row.get("ethnicity") != dominant_ethnicity]
+            if len(dominant_rows) < 5 or len(minority_rows) < 5:
+                continue
+            dominant_mortality = sum(row.get("outcome") == "deceased" for row in dominant_rows) / len(dominant_rows)
+            minority_mortality = sum(row.get("outcome") == "deceased" for row in minority_rows) / len(minority_rows)
+            weight = len(stage_rows)
+            weighted_gap += (minority_mortality - dominant_mortality) * weight
+            total_weight += weight
+        return weighted_gap / total_weight if total_weight else 0.0
+    def detect_full(self, dataset: list[dict], rules: ProtocolRules) -> list[Finding]:
+        control, dominant_ethnicity, dominant_ratio, male_ratio, overall_gap = self.summarize_control(dataset)
+        if not control:
+            return []
+        adjusted_gap = self.stage_adjusted_gap(control, dominant_ethnicity)
+        if (
+            dominant_ratio >= rules.bias_control_dominance_threshold
+            and male_ratio >= rules.bias_male_threshold
+            and adjusted_gap >= rules.bias_stage_adjusted_gap
+        ):
+            return [
+                Finding(
+                    patient_id=None,
+                    error_type="selection_bias",
+                    reason=(
+                        f"Control-arm skew detected: {dominant_ethnicity}={dominant_ratio:.0%}, "
+                        f"male={male_ratio:.0%}, stage-adjusted mortality gap={adjusted_gap:.0%}"
+                    ),
+                    confidence=0.92,
+                    risk="critical",
+                    evidence=f"overall gap={overall_gap:.0%}",
+                )
+            ]
+        return []
+    def detect_heuristic(self, dataset: list[dict], rules: ProtocolRules) -> list[Finding]:
+        control, dominant_ethnicity, dominant_ratio, male_ratio, overall_gap = self.summarize_control(dataset)
+        if not control:
+            return []
+        loose_threshold = max(0.10, rules.bias_stage_adjusted_gap - 0.04)
+        if dominant_ratio >= max(0.55, rules.bias_control_dominance_threshold - 0.07) and overall_gap >= loose_threshold:
+            return [
+                Finding(
+                    patient_id=None,
+                    error_type="selection_bias",
+                    reason=(
+                        f"Heuristic bias concern: {dominant_ethnicity}={dominant_ratio:.0%}, "
+                        f"male={male_ratio:.0%}, overall mortality gap={overall_gap:.0%}"
+                    ),
+                    confidence=0.74,
+                    risk="high",
+                )
+            ]
+        return []
 class ActionPlanner:
+    def plan(
+        self,
+        task_id: str,
+        findings: list[Finding],
+        max_steps: int,
+        extra_investigations: Optional[list[str]] = None,
+    ) -> list[AuditAction]:
+        spec = TASK_SPECS[task_id]
+        actions: list[AuditAction] = []
+        investigations = list(spec["investigations"])
+        distributions = list(spec["distributions"])
+        if extra_investigations:
+            investigations.extend(extra_investigations)
+        for variable in investigations:
+            actions.append(AuditAction(action_type="investigate_pattern", variable=variable))
+        for variable in distributions:
+            actions.append(AuditAction(action_type="compute_distribution", variable=variable))
+        record_findings = [finding for finding in findings if finding.error_type != "selection_bias"]
+        bias_findings = [finding for finding in findings if finding.error_type == "selection_bias"]
+        record_findings.sort(key=lambda finding: -finding.priority_score)
+        max_flag_slots = max_steps - len(actions) - 1 - (1 if bias_findings else 0)
+        flagged_ids = set()
+        for finding in record_findings[:max_flag_slots]:
+            if not finding.patient_id or finding.patient_id in flagged_ids:
+                continue
+            flagged_ids.add(finding.patient_id)
+            actions.append(
+                AuditAction(
+                    action_type="flag_error",
+                    patient_id=finding.patient_id,
+                    error_type=finding.error_type,
+                    reason=finding.reason,
+                    confidence=finding.confidence,
+                )
+            )
+        if bias_findings:
+            bias = bias_findings[0]
+            actions.append(
+                AuditAction(
+                    action_type="flag_error",
+                    error_type="selection_bias",
+                    reason=bias.reason,
+                    confidence=bias.confidence,
+                )
+            )
+        return actions
+def generate_expert_report(
+    client: Optional[OpenAI],
+    rules: ProtocolRules,
+    findings: list[Finding],
+    task_name: str,
+) -> str:
+    finding_lines = []
+    for finding in findings[:8]:
+        if finding.patient_id:
+            finding_lines.append(f"- {finding.patient_id}: {finding.error_type} | {finding.reason}")
+        else:
+            finding_lines.append(f"- {finding.error_type}: {finding.reason}")
+    prompt = "\n".join(
+        [
+            f"Protocol: {rules.protocol_title}",
+            f"Task: {task_name}",
+            f"Key rules: age {rules.age_min}-{rules.age_max}, "
+            f"standard start <= {rules.treatment_window_days} days, "
+            f"stage IV <= {rules.stage_iv_window_days} days.",
+            "",
+            "Findings:",
+            *finding_lines,
+            "",
+            "Write a concise audit report with protocol grounding, root cause, risk, corrective actions, "
+            "and fairness reasoning when relevant.",
+        ]
+    )
+    if client is not None:
+        try:
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {
+                        "role": "system",
+                        "content": (
+                            "You are a senior clinical data manager. Produce a concise report "
+                            "with protocol grounding, root cause, risk, corrective action, and "
+                            "fairness reasoning when applicable."
+                        ),
+                    },
+                    {"role": "user", "content": prompt},
+                ],
+                temperature=0,
+                max_tokens=240,
+            )
+            content = completion.choices[0].message.content or ""
+            if content:
+                return content
+        except Exception:
+            pass
+    if any(finding.error_type == "selection_bias" for finding in findings):
+        fairness_line = (
+            "Fairness review: control-arm patterns were reviewed with stage-adjusted comparisons "
+            "before escalating the bias conclusion."
+        )
+    else:
+        fairness_line = (
+            "Fairness review: no actionable control-arm bias was confirmed after stage-adjusted review."
+        )
+    return (
+        f"Protocol-grounded audit for {rules.protocol_title}. Root cause analysis indicates site-level "
+        f"data capture and scheduling control weaknesses. Risk assessment: protocol compliance and endpoint "
+        f"validity are affected. Recommended corrective actions include quarantining impacted records, "
+        f"tightening enrollment-to-treatment validations, and retraining site coordinators. {fairness_line}"
+    )
 class MetricsTracker:
     def __init__(self):
         self.true_pos = 0
         self.steps = 0
         self.llm_calls = 0
+    def record(self, feedback: str) -> None:
         self.total_flagged += 1
+        if "✓" in feedback or "Correct" in feedback:
             self.true_pos += 1
+        elif "✗" in feedback or "REJECTED" in feedback:
             self.false_pos += 1
     @property
     def precision(self) -> float:
+        return self.true_pos / self.total_flagged if self.total_flagged else 0.0
     def summary(self) -> str:
         return (
+            f"  Metrics: {self.true_pos}/{self.total_flagged} correct "
+            f"(precision={self.precision:.0%}) | {self.steps} steps | {self.llm_calls} LLM call(s)"
         )
+class InProcessEnvSession:
+    def __init__(self):
+        if ClinicalTrialAuditorEnvironment is None:
+            raise RuntimeError("In-process environment is unavailable.")
+        self._env = ClinicalTrialAuditorEnvironment()
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc, tb):
+        return False
+    def reset(self, **kwargs):
+        observation = self._env.reset(**kwargs)
+        return SimpleNamespace(observation=observation, reward=observation.reward, done=observation.done)
+    def step(self, action: AuditAction):
+        observation = self._env.step(action)
+        return SimpleNamespace(observation=observation, reward=observation.reward, done=observation.done)
+def open_env_session():
+    if ENV_BASE_URL.lower() == "inprocess":
+        return InProcessEnvSession()
+    return ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync()
+def run_naive_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
+    print(f"\n  Task: {task_name}")
+    print("  " + "-" * 54)
     metrics = MetricsTracker()
     final_score = 0.0
+    with open_env_session() as env:
+        result = env.reset(task_id=task_id, seed=seed)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
+        protocol_excerpt = obs["trial_protocol_excerpt"]
         max_steps = obs["attempts_remaining"]
+        rules = ProtocolParser.parse(protocol_excerpt)
+        print(f"  Protocol: {rules.protocol_title} | Patients: {len(dataset)} | Max steps: {max_steps}")
+        sample = dataset[:24]
+        guessed_findings: list[Finding] = []
+        if client is not None:
+            try:
+                completion = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=[
+                        {
+                            "role": "system",
+                            "content": (
+                                "You are auditing patient records from a clinical trial. "
+                                "Return one issue per line as PATIENT_ID|ERROR_TYPE|REASON."
+                            ),
+                        },
+                        {
+                            "role": "user",
+                            "content": (
+                                f"Protocol excerpt:\n{protocol_excerpt}\n\n"
+                                f"Review only these {len(sample)} records:\n{json.dumps(sample, default=str)}\n\n"
+                                "Allowed ERROR_TYPE values: invalid_age, temporal_inconsistency, "
+                                "protocol_window_violation, selection_bias."
+                            ),
+                        },
+                    ],
+                    temperature=0,
+                    max_tokens=450,
+                )
+                metrics.llm_calls += 1
+                lines = (completion.choices[0].message.content or "").splitlines()
+                for line in lines:
+                    parts = [part.strip() for part in line.split("|")]
+                    if len(parts) >= 2:
+                        guessed_findings.append(
+                            Finding(
+                                patient_id=parts[0] if parts[0] and parts[0] != "None" else None,
+                                error_type=parts[1],
+                                reason=parts[2] if len(parts) > 2 else "LLM guess",
+                                confidence=0.65,
+                            )
+                        )
+            except Exception as exc:
+                print(f"  LLM error: {exc}")
+        if not guessed_findings:
+            for row in sample:
+                age = row.get("age")
+                treatment = parse_date(row.get("treatment_start"))
+                death = parse_date(row.get("death_date"))
+                enrollment = parse_date(row.get("enrollment_date"))
+                if age is None or age < 0 or age > 120:
+                    guessed_findings.append(
+                        Finding(
+                            patient_id=row.get("patient_id"),
+                            error_type="invalid_age",
+                            reason="Sample-level obvious age anomaly",
+                            confidence=0.55,
+                        )
+                    )
+                if treatment and death and death < treatment:
+                    guessed_findings.append(
+                        Finding(
+                            patient_id=row.get("patient_id"),
+                            error_type="temporal_inconsistency",
+                            reason="Sample-level temporal anomaly",
+                            confidence=0.60,
+                        )
+                    )
+        plan_actions = []
+        for variable in TASK_SPECS[task_id]["investigations"]:
+            plan_actions.append(AuditAction(action_type="investigate_pattern", variable=variable))
+        if task_id == "task_hard":
+            plan_actions.extend(
+                AuditAction(action_type="compute_distribution", variable=variable)
+                for variable in TASK_SPECS[task_id]["distributions"]
             )
+        max_flag_slots = max_steps - len(plan_actions) - 1
+        for finding in guessed_findings[:max_flag_slots]:
+            plan_actions.append(
+                AuditAction(
+                    action_type="flag_error",
+                    patient_id=finding.patient_id,
+                    error_type=finding.error_type,
+                    reason=finding.reason,
+                    confidence=finding.confidence,
+                )
+            )
+        for action in plan_actions:
             if result.done:
                 break
+            result = env.step(action)
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            metrics.steps += 1
+            if action.action_type == "flag_error":
                 metrics.record(obs["feedback"])
         if not result.done:
+            result = env.step(
+                AuditAction(
+                    action_type="submit_report",
+                    report=(
+                        f"Protocol grounding for {rules.protocol_title}. "
+                        "Sample review found possible age and timing issues. "
+                        "Recommend manual review and corrective action."
+                    ),
+                )
+            )
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
             metrics.steps += 1
     return final_score, metrics
+def run_heuristic_task(client_unused: Optional[OpenAI], task_id: str, task_name: str, seed: int):
     print(f"\n  Task: {task_name}")
+    print("  " + "-" * 54)
     metrics = MetricsTracker()
     final_score = 0.0
+    with open_env_session() as env:
+        result = env.reset(task_id=task_id, seed=seed)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
+        rules = ProtocolParser.parse(obs["trial_protocol_excerpt"])
         max_steps = obs["attempts_remaining"]
+        print(f"  Protocol: {rules.protocol_title} | Patients: {len(dataset)} | Max steps: {max_steps}")
+        actions: list[AuditAction] = []
+        for variable in TASK_SPECS[task_id]["investigations"]:
+            actions.append(AuditAction(action_type="investigate_pattern", variable=variable))
+        for variable in TASK_SPECS[task_id]["distributions"]:
+            actions.append(AuditAction(action_type="compute_distribution", variable=variable))
+        findings = []
+        for row in dataset:
+            age = row.get("age")
+            if age is None or age < (rules.age_min - 3) or age > (rules.age_max + 3):
+                findings.append(
+                    Finding(
+                        patient_id=row.get("patient_id"),
+                        error_type="invalid_age",
+                        reason=f"Heuristic age screen triggered on {age}",
+                        confidence=0.82,
+                        risk="high",
+                    )
+                )
+        findings.extend(TemporalDetector().detect(dataset))
+        if task_id in {"task_medium", "task_hard"}:
+            findings.extend(ProtocolWindowDetector().detect(dataset, rules, ignore_stage_exception=True))
+        if task_id == "task_hard":
+            findings.extend(BiasAnalyzer().detect_heuristic(dataset, rules))
+        planner = ActionPlanner()
+        planned_flags = planner.plan(task_id, findings, max_steps=max_steps)
+        actions = planned_flags
+        for action in actions:
+            if result.done:
                 break
+            result = env.step(action)
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            metrics.steps += 1
+            if action.action_type == "flag_error":
                 metrics.record(obs["feedback"])
         if not result.done:
+            result = env.step(
+                AuditAction(
+                    action_type="submit_report",
+                    report=(
+                        f"Protocol review for {rules.protocol_title}. Root cause is likely data-entry drift. "
+                        "Recommend validation checks and operational follow-up. Risk is moderate to high."
+                    ),
+                )
+            )
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
             metrics.steps += 1
     return final_score, metrics
+def run_full_task(client: Optional[OpenAI], task_id: str, task_name: str, seed: int):
     print(f"\n  Task: {task_name}")
+    print("  " + "-" * 54)
     metrics = MetricsTracker()
     final_score = 0.0
+    with open_env_session() as env:
+        result = env.reset(task_id=task_id, seed=seed)
         obs = result.observation.model_dump()
         dataset = obs["dataset"]
+        rules = ProtocolParser.parse(obs["trial_protocol_excerpt"])
         max_steps = obs["attempts_remaining"]
+        print(f"  Protocol: {rules.protocol_title} | Patients: {len(dataset)} | Max steps: {max_steps}")
+        print(
+            f"  Rules: age {rules.age_min}-{rules.age_max} | standard <= {rules.treatment_window_days}d | "
+            f"stage IV <= {rules.stage_iv_window_days}d"
+        )
+        findings = []
+        findings.extend(AgeDetector().detect(dataset, rules))
+        findings.extend(TemporalDetector().detect(dataset))
+        if task_id in {"task_medium", "task_hard"}:
+            findings.extend(ProtocolWindowDetector().detect(dataset, rules, ignore_stage_exception=False))
+        if task_id == "task_hard":
+            findings.extend(BiasAnalyzer().detect_full(dataset, rules))
+        age_count = sum(f.error_type == "invalid_age" for f in findings)
+        temporal_count = sum(f.error_type == "temporal_inconsistency" for f in findings)
+        window_count = sum(f.error_type == "protocol_window_violation" for f in findings)
+        bias_count = sum(f.error_type == "selection_bias" for f in findings)
+        print(
+            f"  Detected: age={age_count} | temporal={temporal_count} | "
+            f"window={window_count} | bias={bias_count}"
+        )
+        extra_checks = {
+            "task_easy": ["enrollment_date", "stage", "group", "treatment_site", "country"],
+            "task_medium": ["group", "treatment_site", "outcome", "country", "drug"],
+            "task_hard": ["treatment_site", "group", "country", "drug", "trial_phase"],
+        }.get(task_id, [])
+        actions = ActionPlanner().plan(task_id, findings, max_steps=max_steps, extra_investigations=extra_checks)
+        report = generate_expert_report(client, rules, findings, task_name)
+        if client is not None:
+            metrics.llm_calls += 1
         for action in actions:
             if result.done:
                 break
             result = env.step(action)
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
+            metrics.steps += 1
             if action.action_type == "flag_error":
+                metrics.record(obs["feedback"])
+            if action.action_type == "flag_error" or metrics.steps <= 5:
+                print(f"  Step {metrics.steps}: score={final_score:.2f} | {obs['feedback'][:80]}")
         if not result.done:
+            result = env.step(AuditAction(action_type="submit_report", report=report))
             obs = result.observation.model_dump()
             final_score = obs["score_so_far"]
+            metrics.steps += 1
+            print(f"  Step {metrics.steps}: score={final_score:.2f} | report submitted")
     print(metrics.summary())
     return final_score, metrics
+def run_agent(mode: str, client: Optional[OpenAI], seed: int):
     runner = {
         "naive": run_naive_task,
         "heuristic": run_heuristic_task,
         "full": run_full_task,
     }[mode]
+    scores = []
+    metrics_list = []
+    start = time.time()
+    for task_id, task_name in TASK_LIST.items():
+        score, metrics = runner(client, task_id, task_name, seed)
         scores.append(score)
+        metrics_list.append(metrics)
+        print(f"  Final score: {score:.2f}\n")
     return {
         "mode": mode,
         "scores": dict(zip(TASK_LIST.keys(), scores)),
+        "average": sum(scores) / len(scores),
+        "elapsed": time.time() - start,
+        "total_steps": sum(metric.steps for metric in metrics_list),
+        "total_llm": sum(metric.llm_calls for metric in metrics_list),
+        "avg_precision": statistics.mean(metric.precision for metric in metrics_list),
     }
 def main():
+    parser = argparse.ArgumentParser(description="Clinical Trial Auditor baseline inference")
+    parser.add_argument("--mode", choices=["naive", "heuristic", "full", "all"], default="full")
+    parser.add_argument("--seed", type=int, default=BASELINE_SEED)
     args = parser.parse_args()
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
+    print("=" * 70)
+    print("  Clinical Trial Auditor — Protocol-Aware Baseline Inference")
+    print("  Dynamic Rules | Adversarial Traps | Stage-Adjusted Fairness Review")
     print(f"  Model: {MODEL_NAME}")
+    print(f"  Seed:  {args.seed}")
+    print("=" * 70)
+    if client is None:
+        print("  Note: no API key detected. Naive/full report generation will use deterministic fallbacks.")
+    modes = ["naive", "heuristic", "full"] if args.mode == "all" else [args.mode]
+    results = []
     for mode in modes:
+        print(f"\n{'═' * 70}")
         print(f"  AGENT: {mode.upper()}")
+        print(f"{'═' * 70}")
+        results.append(run_agent(mode, client, args.seed))
+    print("\n" + "=" * 70)
     print("  BENCHMARK RESULTS")
+    print("=" * 70)
+    if len(results) > 1:
+        header = f"  {'Agent':<12} {'Easy':>8} {'Medium':>8} {'Hard':>8} {'Avg':>8} {'Prec':>8} {'Time':>8}"
         print(header)
+        print("  " + "-" * 66)
+        for result in results:
+            scores = result["scores"]
+            print(
+                f"  {result['mode'].upper():<12} "
+                f"{scores['task_easy']:.2f}     {scores['task_medium']:.2f}     "
+                f"{scores['task_hard']:.2f}     {result['average']:.2f}     "
+                f"{result['avg_precision']:.0%}      {result['elapsed']:.1f}s"
+            )
     else:
+        result = results[0]
+        for task_id, task_name in TASK_LIST.items():
+            print(f"    {task_name:38s}: {result['scores'][task_id]:.2f}")
+        print(f"\n    Average score:     {result['average']:.2f}")
+        print(f"    Total time:        {result['elapsed']:.1f}s")
+        print(f"    LLM calls:         {result['total_llm']}")
+        print(f"    Total steps:       {result['total_steps']}")
+        print(f"    Average precision: {result['avg_precision']:.0%}")
+    print("=" * 70)
 if __name__ == "__main__":
+    main()

models.py CHANGED Viewed

@@ -2,6 +2,7 @@ from typing import Optional, List, Dict, Any
 from pydantic import Field
 from openenv.core.env_server import Action, Observation, State
 class AuditAction(Action):
     action_type: str = "flag_error"
     patient_id: Optional[str] = None
@@ -12,30 +13,43 @@ class AuditAction(Action):
     report: Optional[str] = None
     confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
 class AuditObservation(Observation):
     done: bool = False
     reward: float = 0.0
     task_id: str = ""
     task_type: str = ""
     task_description: str = ""
     dataset: List[Dict[str, Any]] = Field(default_factory=list)
     errors_found: List[str] = Field(default_factory=list)
     patterns_investigated: List[str] = Field(default_factory=list)
     distributions_computed: List[str] = Field(default_factory=list)
     feedback: Optional[str] = None
     score_so_far: float = 0.0
     attempts_remaining: int = 15
     phase: str = "investigation"
 class AuditState(State):
     episode_id: str = ""
     step_count: int = 0
     task_id: str = ""
     task_type: str = ""
     total_errors: int = 0
     errors_found: int = 0
     current_score: float = 0.0
     attempts: int = 0
     phase: str = "investigation"
     patterns_investigated: List[str] = Field(default_factory=list)
-    distributions_computed: List[str] = Field(default_factory=list)

 from pydantic import Field
 from openenv.core.env_server import Action, Observation, State
 class AuditAction(Action):
     action_type: str = "flag_error"
     patient_id: Optional[str] = None
     report: Optional[str] = None
     confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
 class AuditObservation(Observation):
     done: bool = False
     reward: float = 0.0
     task_id: str = ""
     task_type: str = ""
     task_description: str = ""
+    protocol_title: str = ""
+    trial_protocol_excerpt: str = ""
     dataset: List[Dict[str, Any]] = Field(default_factory=list)
     errors_found: List[str] = Field(default_factory=list)
     patterns_investigated: List[str] = Field(default_factory=list)
     distributions_computed: List[str] = Field(default_factory=list)
     feedback: Optional[str] = None
     score_so_far: float = 0.0
+    dense_reward_total: float = 0.0
+    score_breakdown: Dict[str, float] = Field(default_factory=dict)
     attempts_remaining: int = 15
     phase: str = "investigation"
 class AuditState(State):
     episode_id: str = ""
     step_count: int = 0
     task_id: str = ""
     task_type: str = ""
+    protocol_title: str = ""
+    trial_protocol_excerpt: str = ""
     total_errors: int = 0
     errors_found: int = 0
     current_score: float = 0.0
+    dense_reward_total: float = 0.0
+    correct_flags: int = 0
+    false_positives: int = 0
+    duplicate_flags: int = 0
     attempts: int = 0
     phase: str = "investigation"
+    score_breakdown: Dict[str, float] = Field(default_factory=dict)
     patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)

openenv.yaml CHANGED Viewed

@@ -1,27 +1,29 @@
 name: clinical_trial_auditor
-version: "2.0.0"
 description: >
-  A production-grade Reinforcement Learning environment for medical AI alignment.
-  The agent acts as a Senior Clinical Data Manager, utilizing a strict multi-phase
-  workflow (Investigation → Flagging → Reporting) to identify syntactic errors,
-  temporal violations, and multi-dimensional intersectional bias in trial datasets.
 author: Sumit Saraswat
 tags:
   - openenv
   - clinical
-  - rl-benchmark
-  - medical-bias
   - ai-safety
 tasks:
   - id: task_easy
-    name: Syntactic Cleaning
     difficulty: easy
-    description: Investigate dataset distribution and flag patients with invalid age values (out of 18-120 range or null).
   - id: task_medium
-    name: Temporal Consistency
     difficulty: medium
-    description: Investigate temporal variables and flag patients where death_date precedes treatment_start.
   - id: task_hard
-    name: Equity Bias Audit
     difficulty: hard
-    description: Perform multi-dimensional statistical analysis to detect intersectional selection bias affecting minority control group outcomes.

 name: clinical_trial_auditor
+version: "3.0.0"
 description: >
+  A protocol-aware clinical audit benchmark for OpenEnv. The agent acts as a Senior
+  Clinical Data Manager and must read an episode-specific protocol excerpt, audit
+  tabular patient records against dynamic eligibility and timing rules, and decide
+  whether suspicious subgroup outcomes represent actionable control-arm bias or a
+  confounded high-risk cohort.
 author: Sumit Saraswat
 tags:
   - openenv
   - clinical
+  - benchmark
+  - protocol-reasoning
+  - bias-audit
   - ai-safety
 tasks:
   - id: task_easy
+    name: Dynamic Eligibility Screening
     difficulty: easy
+    description: Read the protocol excerpt for the episode and flag patients whose ages violate the protocol-specific eligibility range.
   - id: task_medium
+    name: Protocol Timeline Audit
     difficulty: medium
+    description: Audit dynamic age eligibility, death-before-treatment errors, and treatment-start window violations with a Stage IV timing exception.
   - id: task_hard
+    name: Equity + Protocol Audit
     difficulty: hard
+    description: Audit record-level protocol issues and determine whether control-arm bias is genuinely present or only confounded by a high-risk outreach cohort.

server/requirements.txt → requirements.txt RENAMED Viewed

File without changes

server.log CHANGED Viewed

@@ -34,3 +34,7 @@ INFO:     connection closed
 INFO:     127.0.0.1:53804 - "WebSocket /ws" [accepted]
 INFO:     connection open
 INFO:     connection closed

 INFO:     127.0.0.1:53804 - "WebSocket /ws" [accepted]
 INFO:     connection open
 INFO:     connection closed
+INFO:     127.0.0.1:56934 - "GET /health HTTP/1.1" 200 OK
+INFO:     127.0.0.1:56965 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed

server/app.py CHANGED Viewed

@@ -1,7 +1,12 @@
-from openenv.core.env_server import create_fastapi_app
-from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
-from models import AuditAction, AuditObservation
 import uvicorn
 app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
@@ -9,4 +14,4 @@ def main():
     uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":
-    main()

 import uvicorn
+from openenv.core.env_server import create_fastapi_app
+try:
+    from .clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
+    from .models import AuditAction, AuditObservation
+except ImportError:
+    from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
+    from models import AuditAction, AuditObservation
 app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
     uvicorn.run(app, host="0.0.0.0", port=8000)
 if __name__ == "__main__":
+    main()

server/clinical_trial_auditor_environment.py CHANGED Viewed

@@ -1,152 +1,102 @@
 """
 Clinical Trial Auditor — OpenEnv Environment
-=============================================
-A production-grade adversarial RL environment for medical AI alignment
-and clinical data quality evaluation.
-The agent acts as a Senior Clinical Data Manager auditing procedurally
-generated clinical trial datasets from a multi-site Phase III oncology trial.
-Architecture layers:
-  ┌─────────────────────────────────────────────┐
-  │       Agent Interface (OpenEnv API)         │
-  │      step() / reset() / state()             │
-  ├─────────────────────────────────────────────┤
-  │        Scoring Engine (Grader)              │
-  │  Ground-truth comparison, partial credit,   │
-  │  confidence calibration, score composition  │
-  ├─────────────────────────────────────────────┤
-  │       Trap Engine (Adversarial)             │
-  │  Boundary traps, temporal traps, fake       │
-  │  bias patterns, distractor injection        │
-  ├─────────────────────────────────────────────┤
-  │       Data Engine (Generator)               │
-  │  Statistical distributions, demographics,   │
-  │  reproducible seeds, configurable params    │
-  └─────────────────────────────────────────────┘
-Key design decisions:
-  - Procedural generation: every reset() → unique dataset → no memorization
-  - Ground-truth grading: errors are pre-computed, grading is O(1) lookup
-  - Confidence-calibrated scoring: overconfident + wrong = devastating penalty
-  - False positive cost 3× correct reward → forces precision over recall
-  - Adversarial traps: boundary-valid ages, near-temporal cases, fake patterns
-  - Multi-phase workflow: Investigation → Flagging → Reporting
-  - Seed-based reproducibility for deterministic evaluation
 """
 import uuid
 from datetime import datetime
 from openenv.core.env_server import Environment
-from models import AuditAction, AuditObservation, AuditState
-from dataset_generator import DatasetGenerator
-# ── Reward Configuration ──────────────────────────────────────────────────
-# Calibrated: optimal play → ~0.85-0.95, careless play → devastated
-# Key design: false_positive = 3× correct_flag → DESTROYS guessing strategies
 REWARD_CONFIG = {
-    "correct_flag": 0.10,            # +0.10 per correct error flag
-    "false_positive": -0.30,         # -0.30 per wrong flag (3x correct → destroys guessing)
-    "duplicate_flag": -0.10,         # -0.10 per duplicate flag
-    "investigate_new": 0.05,         # +0.05 for investigating a new variable
-    "investigate_redundant": -0.02,  # -0.02 for re-investigating (penalizes loops)
-    "distribution_new": 0.04,        # +0.04 for computing new distribution
     "distribution_redundant": -0.02,
-    "invalid_phase": -0.05,          # -0.05 for acting in wrong phase
-    "unknown_action": -0.05,         # -0.05 for invalid action types
-    "cost_per_step": 0.005,          # -0.005 per step (encourages efficiency)
-    "bonus_efficiency": 0.03,        # +0.03 when ≥3 investigated AND ≥3 flagged
-    "bonus_workflow": 0.03,          # +0.03 for correct workflow sequence
-    "bias_detected": 0.20,           # +0.20 for correctly identifying selection bias
-    "propose_fix_valid": 0.03,
-    "propose_fix_invalid": -0.05,
-    "report_bonus_base": 0.05,       # +0.05 base for submitting report
-    "overconfidence_multiplier": 2.0, # 2x penalty when wrong + confidence > 0.8
 }
-# ══════════════════════════���════════════════════════════════════════════════
-# TASK DEFINITIONS
-# ═══════════════════════════════════════════════════════════════════════════
 TASKS = {
     "task_easy": {
         "task_id": "task_easy",
-        "task_type": "syntactic_cleaning",
         "difficulty": "easy",
         "allow_bias": False,
-        "task_description": (
-            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
-            "Role: Senior Clinical Data Manager\n\n"
-            "PHASE 1 — INVESTIGATION:\n"
-            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
-            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n\n"
-            "PHASE 2 — FLAGGING:\n"
-            "  Use flag_error(patient_id=<id>, error_type='invalid_age') for age violations\n"
-            "  Valid age range for trial eligibility: 18-120 (inclusive)\n"
-            "  Missing age (null) is also invalid — required field\n"
-            "  CAUTION: Some ages are rare but valid (e.g., 95, 19, 120). Do NOT over-flag.\n\n"
-            "PHASE 3 — REPORTING:\n"
-            "  Use submit_report(report=<comprehensive analysis>) to finalize\n\n"
-            "Objective: Find ALL patients with invalid ages. Avoid false positives."
-        ),
     },
     "task_medium": {
         "task_id": "task_medium",
-        "task_type": "temporal_consistency",
         "difficulty": "medium",
         "allow_bias": False,
-        "task_description": (
-            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
-            "Role: Senior Clinical Data Manager\n\n"
-            "PHASE 1 — INVESTIGATION:\n"
-            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
-            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n\n"
-            "PHASE 2 — FLAGGING:\n"
-            "  Use flag_error with error_type='invalid_age' OR 'temporal_inconsistency'\n"
-            "  Age violations: outside range 18-120 (inclusive) or null\n"
-            "  Temporal violations: death_date MUST NOT precede treatment_start\n"
-            "  NOTE: A patient dying 1 day after treatment start IS valid (not an error)\n\n"
-            "PHASE 3 — REPORTING:\n"
-            "  Use submit_report(report=<comprehensive analysis>) to finalize\n\n"
-            "Objective: Find ALL age errors AND temporal inconsistencies."
-        ),
     },
     "task_hard": {
         "task_id": "task_hard",
-        "task_type": "comprehensive_audit",
         "difficulty": "hard",
         "allow_bias": True,
-        "task_description": (
-            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
-            "Role: Senior Clinical Data Manager\n\n"
-            "PHASE 1 — INVESTIGATION:\n"
-            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
-            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n"
-            "  IMPORTANT: Analyze ethnicity, gender, and outcome distributions in control group\n\n"
-            "PHASE 2 — FLAGGING:\n"
-            "  flag_error with error_type='invalid_age', 'temporal_inconsistency', or 'selection_bias'\n"
-            "  For selection_bias: Identify if the control group has demographic imbalance\n"
-            "  AND whether this correlates with outcome disparity across subgroups\n"
-            "  Look for: representation bias, outcome disparity, intersectional patterns\n\n"
-            "PHASE 3 — REPORTING:\n"
-            "  Use submit_report(report=<comprehensive analysis>) to finalize\n"
-            "  Include: statistical evidence, root cause analysis, corrective recommendations\n\n"
-            "Objective: Find ALL data errors AND demographic bias patterns."
-        ),
     },
 }
-# Maximum steps per episode — scales with dataset size
 MAX_STEPS = {
-    "task_easy": 20,
-    "task_medium": 30,
-    "task_hard": 40,
 }
-# ═══════════════════════════════════════════════════════════════════════════
-# ENVIRONMENT IMPLEMENTATION
-# ═══════════════════════════════════════════════════════════════════════════
 class ClinicalTrialAuditorEnvironment(Environment):
     SUPPORTS_CONCURRENT_SESSIONS = True
@@ -155,9 +105,12 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._state = AuditState()
         self._current_task = None
         self._dataset = []
-        self._ground_truth = {}     # {patient_id: [error_types]}
-        self._traps = set()         # valid-but-suspicious patient_ids
         self._bias_present = False
         self._flagged_patients = set()
         self._patterns_investigated = set()
         self._distributions_computed = set()
@@ -165,17 +118,185 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._max_steps = 15
         self._report_submitted = False
         self._phase = "investigation"
-        self._score_log = []        # Track score composition for transparency
     def reset(self, seed=None, episode_id=None, **kwargs) -> AuditObservation:
-        """
-        Reset the environment with a procedurally generated dataset.
-        Args:
-            seed: Random seed for reproducibility. Same seed = identical dataset.
-            episode_id: Optional episode identifier.
-            task_id: "task_easy" | "task_medium" | "task_hard"
-        """
         self._action_history = []
         task_id = kwargs.get("task_id", "task_easy")
         if task_id not in TASKS:
@@ -184,7 +305,6 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._current_task = TASKS[task_id]
         difficulty = self._current_task["difficulty"]
-        # ── Procedural dataset generation ──
         generator = DatasetGenerator(seed=seed)
         result = generator.generate(difficulty=difficulty)
@@ -192,7 +312,9 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._ground_truth = result["ground_truth"]
         self._traps = result["traps"]
         self._bias_present = result["bias_present"]
-        gen_stats = result["stats"]
         self._flagged_patients = set()
         self._patterns_investigated = set()
@@ -202,21 +324,32 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._report_submitted = False
         self._phase = "investigation"
         self._score_log = []
-        total_errs = gen_stats["total_errors"]
         self._state = AuditState(
             episode_id=episode_id or str(uuid.uuid4()),
             step_count=0,
             task_id=task_id,
             task_type=self._current_task["task_type"],
-            total_errors=total_errs,
             errors_found=0,
             current_score=0.0,
             attempts=0,
             phase="investigation",
             patterns_investigated=[],
             distributions_computed=[],
         )
         return AuditObservation(
@@ -224,17 +357,20 @@ class ClinicalTrialAuditorEnvironment(Environment):
             reward=0.0,
             task_id=task_id,
             task_type=self._current_task["task_type"],
-            task_description=self._current_task["task_description"],
             dataset=self._dataset,
             errors_found=[],
             patterns_investigated=[],
             distributions_computed=[],
             feedback=(
-                f"Audit started. Dataset: {len(self._dataset)} patients across "
-                f"multiple sites and countries. Begin with investigate_pattern "
-                f"to profile the dataset."
             ),
             score_so_far=0.0,
             attempts_remaining=self._max_steps,
             phase="investigation",
         )
@@ -242,11 +378,23 @@ class ClinicalTrialAuditorEnvironment(Environment):
     def step(self, action: AuditAction, **kwargs) -> AuditObservation:
         if self._current_task is None:
             return AuditObservation(
-                done=True, reward=0.0, task_id="", task_type="",
-                task_description="Call reset() first.", dataset=[],
-                errors_found=[], patterns_investigated=[],
-                distributions_computed=[], feedback="No active episode.",
-                score_so_far=0.0, attempts_remaining=0, phase="investigation",
             )
         self._action_history.append(action.action_type)
@@ -254,72 +402,50 @@ class ClinicalTrialAuditorEnvironment(Environment):
         self._state.step_count += 1
         self._state.attempts = self._attempts
-        # Core grading against ground truth
         step_reward, feedback = self._grade(action)
-        # ── Confidence-calibrated scoring ──
-        agent_confidence = action.confidence
-        if agent_confidence is not None and action.action_type == "flag_error":
-            agent_confidence = max(0.0, min(1.0, agent_confidence))
-            if step_reward < 0:  # Wrong answer
-                if agent_confidence > 0.8:
-                    step_reward *= REWARD_CONFIG["overconfidence_multiplier"]
-                    feedback += f" [OVERCONFIDENCE PENALTY: conf={agent_confidence:.0%}]"
-            elif step_reward > 0:  # Correct answer
-                step_reward *= max(0.5, agent_confidence)
-        # Step cost (progressive — later steps cost more)
-        step_cost = REWARD_CONFIG["cost_per_step"] * (1 + self._attempts * 0.05)
-        step_reward -= step_cost
-        # Anti brute-force (punish spinning without flagging)
-        if self._attempts > self._max_steps // 2 and len(self._flagged_patients) < 3:
-            step_reward -= 0.05
-        # Efficiency bonus
-        if len(self._patterns_investigated) >= 3 and len(self._flagged_patients) >= 3:
-            step_reward += REWARD_CONFIG["bonus_efficiency"]
-        # Workflow sequence bonus
-        if len(self._action_history) >= 3:
-            if self._action_history[-3:] == [
-                "investigate_pattern", "compute_distribution", "flag_error"
-            ]:
-                step_reward += REWARD_CONFIG["bonus_workflow"]
-        # Difficulty multiplier
-        mult = {
-            "task_easy": 1.0, "task_medium": 1.2, "task_hard": 1.5
-        }.get(self._current_task["task_id"], 1.0)
-        step_reward = round(step_reward * mult, 3)
-        step_reward = max(-0.5, step_reward)
-        self._state.current_score = max(
-            0.0, min(1.0, self._state.current_score + step_reward)
-        )
-        # Log score composition
-        self._score_log.append({
-            "step": self._attempts,
-            "action": action.action_type,
-            "reward": step_reward,
-            "cumulative": self._state.current_score,
-        })
         done = self._report_submitted or self._attempts >= self._max_steps
         return AuditObservation(
             done=done,
-            reward=step_reward,
             task_id=self._current_task["task_id"],
             task_type=self._current_task["task_type"],
-            task_description=self._current_task["task_description"],
             dataset=self._dataset,
-            errors_found=list(self._flagged_patients),
-            patterns_investigated=list(self._patterns_investigated),
-            distributions_computed=list(self._distributions_computed),
             feedback=feedback,
             score_so_far=self._state.current_score,
             attempts_remaining=max(0, self._max_steps - self._attempts),
             phase=self._phase,
         )
@@ -328,399 +454,297 @@ class ClinicalTrialAuditorEnvironment(Environment):
     def state(self) -> AuditState:
         return self._state
-    # ═══════════════════════════════════════════════════════════════════
-    # SCORING ENGINE — Deterministic grading against ground truth
-    # ═══════════════════════════════════════════════════════════════════
-    def _grade(self, action: AuditAction):
-        """Route action to appropriate grader with phase validation."""
-        # Phase validation
-        if self._phase == "investigation" and action.action_type in [
-            "flag_error", "submit_report"
-        ]:
             return (
                 REWARD_CONFIG["invalid_phase"],
-                "PHASE BLOCKED: Investigate variables before flagging. "
-                "Use investigate_pattern or compute_distribution first."
-            )
-        if (self._phase == "flagging"
-                and action.action_type == "submit_report"
-                and len(self._flagged_patients) == 0):
-            return (
-                REWARD_CONFIG["invalid_phase"],
-                "PHASE BLOCKED: Flag at least one issue before submitting report."
             )
         if action.action_type == "investigate_pattern":
             return self._grade_investigate(action)
-        elif action.action_type == "compute_distribution":
             return self._grade_distribution(action)
-        elif action.action_type == "flag_error":
             return self._grade_flag(action)
-        elif action.action_type == "propose_fix":
             return self._grade_propose_fix(action)
-        elif action.action_type == "submit_report":
             return self._grade_report(action)
-        else:
-            return (
-                REWARD_CONFIG["unknown_action"],
-                f"REJECTED: Unknown action '{action.action_type}'. "
-                f"Valid: investigate_pattern, compute_distribution, "
-                f"flag_error, propose_fix, submit_report."
-            )
-    def _grade_investigate(self, action: AuditAction):
-        variable = action.variable or ""
-        if not variable:
-            return REWARD_CONFIG["unknown_action"], "REJECTED: Variable cannot be empty."
         valid_vars = {
-            "age", "gender", "ethnicity", "treatment_start",
-            "death_date", "outcome", "treatment_site", "group",
-            "stage", "trial_phase", "drug", "country", "enrollment_date",
         }
         if variable not in valid_vars:
-            return (
-                REWARD_CONFIG["unknown_action"],
-                f"REJECTED: Unknown variable '{variable}'. "
-                f"Valid: {', '.join(sorted(valid_vars))}."
-            )
         if variable in self._patterns_investigated:
             return (
                 REWARD_CONFIG["investigate_redundant"],
-                f"Already investigated '{variable}'. Use flag_error to act on findings."
             )
         self._patterns_investigated.add(variable)
-        self._state.patterns_investigated.append(variable)
-        # Phase transition: unlock flagging after investigating key variables
-        if (
-            "age" in self._patterns_investigated
-            and "death_date" in self._patterns_investigated
-            and self._phase == "investigation"
-        ):
-            self._phase = "flagging"
-        # Dynamic statistics based on variable type
         if variable == "age":
-            ages = [p["age"] for p in self._dataset if p.get("age") is not None]
-            nulls = len([p for p in self._dataset if p.get("age") is None])
-            if ages:
-                min_age, max_age = min(ages), max(ages)
-                feedback = (
-                    f"Age Stats: min={min_age}, max={max_age}, "
-                    f"null_count={nulls}, n={len(ages)}."
-                )
-            else:
-                feedback = f"Age Stats: no valid ages found, null_count={nulls}."
-        elif variable in ["treatment_start", "death_date", "enrollment_date"]:
-            vals = [p[variable] for p in self._dataset if p.get(variable)]
-            feedback = f"Date field '{variable}': {len(vals)} non-null values found. Check temporal alignment."
-        elif variable == "outcome":
-            survived = sum(1 for p in self._dataset if p.get("outcome") == "survived")
-            deceased = sum(1 for p in self._dataset if p.get("outcome") == "deceased")
-            feedback = f"Outcomes: Survived={survived}, Deceased={deceased}, Total={survived + deceased}."
-        elif variable == "group":
-            control = sum(1 for p in self._dataset if p.get("group") == "control")
-            treatment = sum(1 for p in self._dataset if p.get("group") == "treatment")
-            feedback = f"Groups: Control={control}, Treatment={treatment}."
         else:
             counts = {}
-            for p in self._dataset:
-                val = str(p.get(variable, "None"))
-                counts[val] = counts.get(val, 0) + 1
-            # Sort by frequency descending
-            sorted_counts = dict(
-                sorted(counts.items(), key=lambda x: -x[1])
-            )
-            # Truncate if too many unique values
-            if len(sorted_counts) > 10:
-                top_10 = dict(list(sorted_counts.items())[:10])
-                feedback = (
-                    f"{variable.capitalize()} Distribution (top 10 of "
-                    f"{len(sorted_counts)}): {top_10}."
-                )
-            else:
-                feedback = f"{variable.capitalize()} Distribution: {sorted_counts}."
-        return REWARD_CONFIG["investigate_new"], f"Investigated '{variable}': {feedback}"
-    def _grade_distribution(self, action: AuditAction):
         variable = action.variable or ""
         if not variable:
             return REWARD_CONFIG["unknown_action"], "REJECTED: Variable cannot be empty."
         if variable in self._distributions_computed:
             return (
                 REWARD_CONFIG["distribution_redundant"],
-                f"Distribution for '{variable}' already computed."
             )
         self._distributions_computed.add(variable)
-        self._state.distributions_computed.append(variable)
-        # Phase transition via distribution analysis
-        if (
-            "ethnicity" in self._distributions_computed
-            and "outcome" in self._distributions_computed
-            and self._phase == "investigation"
-        ):
-            self._phase = "flagging"
         if variable == "ethnicity":
-            control = [p for p in self._dataset if p.get("group") == "control"]
-            if control:
-                eth_counts = {}
-                for p in control:
-                    eth = p.get("ethnicity", "Unknown")
-                    eth_counts[eth] = eth_counts.get(eth, 0) + 1
-                total = len(control)
-                breakdown = ", ".join(
-                    f"{k}={v} ({v / total * 100:.0f}%)"
-                    for k, v in sorted(eth_counts.items(), key=lambda x: -x[1])
-                )
-                feedback = f"Control group ethnicity: {breakdown}. Total={total}."
-            else:
-                feedback = "No control group patients found."
-        elif variable == "outcome":
-            control = [p for p in self._dataset if p.get("group") == "control"]
-            if control:
-                deceased_c = sum(
-                    1 for p in control if p.get("outcome") == "deceased"
-                )
-                total = len(control)
-                feedback = (
-                    f"Control group outcomes: deceased={deceased_c}/{total} "
-                    f"({deceased_c / total * 100:.0f}%). "
-                    f"Survived={total - deceased_c}/{total} "
-                    f"({(total - deceased_c) / total * 100:.0f}%)."
-                )
-            else:
-                feedback = "No control group patients found."
         elif variable == "gender":
-            control = [p for p in self._dataset if p.get("group") == "control"]
-            if control:
-                male_c = sum(1 for p in control if p.get("gender") == "M")
-                total = len(control)
-                feedback = (
-                    f"Control group gender: Male={male_c}/{total} "
-                    f"({male_c / total * 100:.0f}%), "
-                    f"Female={total - male_c}/{total} "
-                    f"({(total - male_c) / total * 100:.0f}%)."
-                )
-            else:
-                feedback = "No control group patients found."
         else:
-            feedback = f"Distribution computed for '{variable}'."
-        return REWARD_CONFIG["distribution_new"], f"Distribution '{variable}': {feedback}"
-    def _grade_flag(self, action: AuditAction):
-        """Grade flag action against pre-computed ground truth."""
-        patient_id = action.patient_id
         error_type = action.error_type or ""
-        # ── Selection bias flag (no patient_id needed) ──
         if error_type == "selection_bias":
             if not self._current_task["allow_bias"]:
                 return (
-                    REWARD_CONFIG["false_positive"],
-                    "✗ Selection bias analysis not required for this task."
                 )
             if "BIAS_FLAG" in self._flagged_patients:
-                return (
-                    REWARD_CONFIG["duplicate_flag"],
-                    "Selection bias already flagged."
-                )
-            if self._bias_present:
-                # Verify bias is actually detectable in the data
-                control = [p for p in self._dataset if p.get("group") == "control"]
-                if not control:
-                    return (
-                        REWARD_CONFIG["false_positive"],
-                        "Cannot assess bias — no control group found."
-                    )
-                white_count = sum(
-                    1 for p in control if p.get("ethnicity") == "White"
-                )
-                white_ratio = white_count / len(control)
-                minority_dead = sum(
-                    1 for p in control
-                    if p.get("ethnicity") != "White"
-                    and p.get("outcome") == "deceased"
-                )
-                male_count = sum(
-                    1 for p in control if p.get("gender") == "M"
-                )
-                male_ratio = male_count / len(control)
-                if white_ratio >= 0.65 and minority_dead > 0 and male_ratio >= 0.50:
-                    self._flagged_patients.add("BIAS_FLAG")
-                    self._state.errors_found += 1
-                    return (
-                        REWARD_CONFIG["bias_detected"],
-                        f"✓ Correct. Multi-dimensional selection bias confirmed: "
-                        f"White={white_ratio:.0%} of control, "
-                        f"minority mortality present ({minority_dead} deceased), "
-                        f"gender imbalance ({male_ratio:.0%} male)."
-                    )
-                else:
-                    return (
-                        REWARD_CONFIG["false_positive"],
-                        "✗ Statistical evidence insufficient for bias determination."
-                    )
-            else:
                 return (
-                    REWARD_CONFIG["false_positive"],
-                    "✗ False positive. No significant selection bias in this dataset."
                 )
-        # ── Data error flags (require patient_id) ──
-        if patient_id is None:
             return (
                 REWARD_CONFIG["false_positive"],
-                "REJECTED: Provide patient_id for data errors."
             )
         if patient_id in self._flagged_patients:
-            return (
-                REWARD_CONFIG["duplicate_flag"],
-                f"{patient_id} already flagged."
-            )
-        # Check if patient exists in dataset
-        patient = next(
-            (p for p in self._dataset if p.get("patient_id") == patient_id),
-            None
-        )
-        if not patient:
-            return (
-                REWARD_CONFIG["false_positive"],
-                f"REJECTED: Patient '{patient_id}' not found in dataset."
-            )
-        # ── Ground truth lookup (O(1) — deterministic) ──
         expected_errors = self._ground_truth.get(patient_id, [])
-        if error_type == "invalid_age":
-            if "invalid_age" in expected_errors:
-                self._flagged_patients.add(patient_id)
-                self._state.errors_found += 1
-                age = patient.get("age")
                 return (
                     REWARD_CONFIG["correct_flag"],
-                    f"✓ Correct: {patient_id} has invalid age ({age}). Good catch."
-                )
-            else:
-                age = patient.get("age")
-                return (
-                    REWARD_CONFIG["false_positive"],
-                    f"✗ False positive: {patient_id} age={age} is within valid range [18-120]."
                 )
-        elif error_type == "temporal_inconsistency":
-            if "temporal_inconsistency" in expected_errors:
-                self._flagged_patients.add(patient_id)
-                self._state.errors_found += 1
-                ts = patient.get("treatment_start", "")
-                dd = patient.get("death_date", "")
-                if ts and dd:
-                    t = datetime.strptime(ts, "%Y-%m-%d")
-                    d = datetime.strptime(dd, "%Y-%m-%d")
-                    gap = (t - d).days
-                    return (
-                        REWARD_CONFIG["correct_flag"],
-                        f"✓ Correct: {patient_id} death_date is {gap} days "
-                        f"before treatment_start."
-                    )
                 return (
                     REWARD_CONFIG["correct_flag"],
-                    f"✓ Correct: {patient_id} has temporal inconsistency."
                 )
-            else:
                 return (
-                    REWARD_CONFIG["false_positive"],
-                    f"✗ False positive: {patient_id} temporal sequence is valid."
                 )
-        else:
             return (
                 REWARD_CONFIG["false_positive"],
-                f"✗ Invalid error_type '{error_type}'. "
-                f"Valid: invalid_age, temporal_inconsistency, selection_bias."
             )
-    def _grade_propose_fix(self, action: AuditAction):
         patient_id = action.patient_id or ""
         if patient_id not in self._flagged_patients:
-            return (
-                REWARD_CONFIG["propose_fix_invalid"],
-                "Can only propose fix for flagged patients."
-            )
         proposed = action.proposed_value or ""
         if len(proposed) > 2:
-            return (
-                REWARD_CONFIG["propose_fix_valid"],
-                f"Fix proposed for {patient_id}."
-            )
-        return REWARD_CONFIG["propose_fix_invalid"], "Proposed fix too vague."
-    def _grade_report(self, action: AuditAction):
-        """Grade report quality using multi-dimensional rubric."""
         self._report_submitted = True
         report = (action.report or action.reason or "").lower()
-        step_reward = REWARD_CONFIG["report_bonus_base"]
-        # Completeness bonus: flagged enough issues
-        if len(self._flagged_patients) >= 3:
-            step_reward += 0.03
-        # ── Report quality rubric (tests clinical reasoning depth) ──
-        quality_score = 0
         quality_items = []
-        # Root cause analysis
-        if any(kw in report for kw in [
-            "root cause", "data entry", "etl", "pipeline", "system"
-        ]):
-            quality_score += 1
-            quality_items.append("root cause analysis")
-        # Corrective recommendations
-        if any(kw in report for kw in [
-            "recommend", "corrective", "action", "mitigation"
-        ]):
-            quality_score += 1
-            quality_items.append("corrective recommendations")
-        # Risk assessment
-        if any(kw in report for kw in [
-            "risk", "severity", "critical", "impact", "patient safety"
-        ]):
-            quality_score += 1
             quality_items.append("risk assessment")
-        # Regulatory compliance
-        if any(kw in report for kw in [
-            "regulatory", "compliance", "fda", "ich", "gcp", "validity"
-        ]):
-            quality_score += 1
-            quality_items.append("regulatory awareness")
-        # Quality bonus: +0.02 per dimension (max +0.08)
-        step_reward += quality_score * 0.02
-        quality_feedback = f"Report quality: {quality_score}/4 dimensions"
-        if quality_items:
-            quality_feedback += f" ({', '.join(quality_items)})"
-        return (
-            step_reward,
-            f"Report submitted. {quality_feedback}. Final evaluation complete."
-        )

 """
 Clinical Trial Auditor — OpenEnv Environment
+============================================
+Protocol-aware clinical audit benchmark with dynamic rules, adversarial traps,
+and stage-aware fairness evaluation.
 """
+from __future__ import annotations
 import uuid
 from datetime import datetime
 from openenv.core.env_server import Environment
+try:
+    from .dataset_generator import DatasetGenerator
+    from .models import AuditAction, AuditObservation, AuditState
+except ImportError:
+    from dataset_generator import DatasetGenerator
+    from models import AuditAction, AuditObservation, AuditState
 REWARD_CONFIG = {
+    "correct_flag": 0.16,
+    "false_positive": -0.26,
+    "duplicate_flag": -0.08,
+    "investigate_new": 0.04,
+    "investigate_redundant": -0.02,
+    "distribution_new": 0.04,
     "distribution_redundant": -0.02,
+    "invalid_phase": -0.06,
+    "unknown_action": -0.05,
+    "cost_per_step": 0.004,
+    "bonus_workflow": 0.03,
+    "bonus_protocol_window": 0.04,
+    "bias_detected": 0.20,
+    "propose_fix_valid": 0.02,
+    "propose_fix_invalid": -0.04,
+    "report_bonus_base": 0.03,
+    "overconfidence_multiplier": 1.8,
 }
+SCORE_WEIGHTS = {
+    "recall": 0.70,
+    "precision": 0.15,
+    "workflow": 0.05,
+    "efficiency": 0.05,
+    "report": 0.05,
+}
 TASKS = {
     "task_easy": {
         "task_id": "task_easy",
         "difficulty": "easy",
+        "task_type": "eligibility_screening",
+        "title": "Dynamic Eligibility Screening",
         "allow_bias": False,
+        "allowed_error_types": ["invalid_age"],
+        "required_investigations": ["age"],
+        "required_distributions": [],
     },
     "task_medium": {
         "task_id": "task_medium",
         "difficulty": "medium",
+        "task_type": "protocol_timeline_audit",
+        "title": "Protocol Timeline Audit",
         "allow_bias": False,
+        "allowed_error_types": [
+            "invalid_age",
+            "temporal_inconsistency",
+            "protocol_window_violation",
+        ],
+        "required_investigations": ["age", "death_date", "enrollment_date", "stage"],
+        "required_distributions": [],
     },
     "task_hard": {
         "task_id": "task_hard",
         "difficulty": "hard",
+        "task_type": "equity_and_protocol_audit",
+        "title": "Equity + Protocol Audit",
         "allow_bias": True,
+        "allowed_error_types": [
+            "invalid_age",
+            "temporal_inconsistency",
+            "protocol_window_violation",
+            "selection_bias",
+        ],
+        "required_investigations": ["age", "death_date", "enrollment_date", "stage"],
+        "required_distributions": ["ethnicity", "gender", "outcome"],
     },
 }
 MAX_STEPS = {
+    "task_easy": 18,
+    "task_medium": 34,
+    "task_hard": 46,
 }
 class ClinicalTrialAuditorEnvironment(Environment):
     SUPPORTS_CONCURRENT_SESSIONS = True
         self._state = AuditState()
         self._current_task = None
         self._dataset = []
+        self._ground_truth = {}
+        self._traps = set()
         self._bias_present = False
+        self._protocol = {}
+        self._protocol_title = ""
+        self._protocol_excerpt = ""
         self._flagged_patients = set()
         self._patterns_investigated = set()
         self._distributions_computed = set()
         self._max_steps = 15
         self._report_submitted = False
         self._phase = "investigation"
+        self._score_log = []
+        self._dense_reward_total = 0.0
+        self._correct_flags = 0
+        self._false_positive_flags = 0
+        self._duplicate_flags = 0
+        self._invalid_phase_actions = 0
+        self._report_quality = 0.0
+    def _task_description(self) -> str:
+        allowed = ", ".join(self._current_task["allowed_error_types"])
+        lines = [
+            f"CLINICAL TRIAL AUDIT — {self._current_task['title']}",
+            "Role: Senior Clinical Data Manager",
+            "",
+            "Use the protocol excerpt from the observation. Do not assume default clinical rules.",
+            f"Allowed error types for this task: {allowed}.",
+            "",
+            "Workflow",
+            "- Investigate the required variables before flagging records.",
+            "- Use compute_distribution for cohort-level review when the task asks for bias analysis.",
+            "- submit_report should summarize evidence, impact, and corrective action.",
+        ]
+        if self._current_task["allow_bias"]:
+            lines.append("- For selection_bias, determine whether actionable control-arm bias exists at all.")
+        return "\n".join(lines)
+    def _required_investigations(self) -> set[str]:
+        return set(self._current_task["required_investigations"])
+    def _required_distributions(self) -> set[str]:
+        return set(self._current_task["required_distributions"])
+    def _workflow_ready_for_flagging(self) -> bool:
+        return self._required_investigations().issubset(self._patterns_investigated)
+    def _bias_review_ready(self) -> bool:
+        return self._required_distributions().issubset(self._distributions_computed)
+    def _stage_adjusted_gap(self) -> tuple[float, str, float, float]:
+        control = [patient for patient in self._dataset if patient.get("group") == "control"]
+        if not control:
+            return 0.0, "Unknown", 0.0, 0.0
+        ethnicity_counts = {}
+        for patient in control:
+            ethnicity = patient.get("ethnicity", "Unknown")
+            ethnicity_counts[ethnicity] = ethnicity_counts.get(ethnicity, 0) + 1
+        dominant_ethnicity = max(ethnicity_counts.items(), key=lambda item: item[1])[0]
+        dominant_ratio = ethnicity_counts[dominant_ethnicity] / len(control)
+        male_ratio = sum(patient.get("gender") == "M" for patient in control) / len(control)
+        weighted_gap = 0.0
+        total_weight = 0
+        for stage in ("I", "II", "III", "IV"):
+            stage_patients = [patient for patient in control if patient.get("stage") == stage]
+            dominant_stage = [patient for patient in stage_patients if patient.get("ethnicity") == dominant_ethnicity]
+            minority_stage = [patient for patient in stage_patients if patient.get("ethnicity") != dominant_ethnicity]
+            if len(dominant_stage) < 5 or len(minority_stage) < 5:
+                continue
+            dom_mortality = sum(patient.get("outcome") == "deceased" for patient in dominant_stage) / len(dominant_stage)
+            min_mortality = sum(patient.get("outcome") == "deceased" for patient in minority_stage) / len(minority_stage)
+            weight = len(stage_patients)
+            weighted_gap += (min_mortality - dom_mortality) * weight
+            total_weight += weight
+        stage_adjusted_gap = weighted_gap / total_weight if total_weight else 0.0
+        return stage_adjusted_gap, dominant_ethnicity, dominant_ratio, male_ratio
+    def _bias_signal(self) -> dict:
+        control = [patient for patient in self._dataset if patient.get("group") == "control"]
+        if not control:
+            return {
+                "signal_present": False,
+                "stage_adjusted_gap": 0.0,
+                "dominant_ethnicity": "Unknown",
+                "dominant_ratio": 0.0,
+                "male_ratio": 0.0,
+                "overall_gap": 0.0,
+                "high_risk_note": "",
+            }
+        stage_adjusted_gap, dominant_ethnicity, dominant_ratio, male_ratio = self._stage_adjusted_gap()
+        dominant_group = [patient for patient in control if patient.get("ethnicity") == dominant_ethnicity]
+        minority_group = [patient for patient in control if patient.get("ethnicity") != dominant_ethnicity]
+        dom_mortality = (
+            sum(patient.get("outcome") == "deceased" for patient in dominant_group) / len(dominant_group)
+            if dominant_group
+            else 0.0
+        )
+        min_mortality = (
+            sum(patient.get("outcome") == "deceased" for patient in minority_group) / len(minority_group)
+            if minority_group
+            else 0.0
+        )
+        overall_gap = min_mortality - dom_mortality
+        signal_present = (
+            dominant_ratio >= self._protocol.get("bias_control_dominance_threshold", 1.0)
+            and male_ratio >= self._protocol.get("bias_male_threshold", 1.0)
+            and stage_adjusted_gap >= self._protocol.get("bias_stage_adjusted_gap", 1.0)
+        )
+        return {
+            "signal_present": signal_present,
+            "stage_adjusted_gap": stage_adjusted_gap,
+            "dominant_ethnicity": dominant_ethnicity,
+            "dominant_ratio": dominant_ratio,
+            "male_ratio": male_ratio,
+            "overall_gap": overall_gap,
+            "high_risk_note": ", ".join(self._protocol.get("high_risk_sites", [])),
+        }
+    def _build_breakdown(self) -> dict[str, float]:
+        total_targets = max(1, self._state.total_errors)
+        recall = min(1.0, self._correct_flags / total_targets)
+        precision = self._correct_flags / max(
+            1,
+            self._correct_flags + (2 * self._false_positive_flags) + self._duplicate_flags,
+        )
+        required_investigations = len(self._required_investigations())
+        required_distributions = len(self._required_distributions())
+        investigation_coverage = (
+            min(len(self._patterns_investigated & self._required_investigations()), required_investigations)
+            / required_investigations
+            if required_investigations
+            else 1.0
+        )
+        distribution_coverage = (
+            min(len(self._distributions_computed & self._required_distributions()), required_distributions)
+            / required_distributions
+            if required_distributions
+            else 1.0
+        )
+        if required_investigations and required_distributions:
+            workflow = (0.7 * investigation_coverage) + (0.3 * distribution_coverage)
+        elif required_investigations:
+            workflow = investigation_coverage
+        elif required_distributions:
+            workflow = distribution_coverage
+        else:
+            workflow = 0.0
+        workflow *= max(0.0, 1.0 - (0.12 * self._invalid_phase_actions))
+        useful_steps = (
+            min(len(self._patterns_investigated), required_investigations)
+            + min(len(self._distributions_computed), required_distributions)
+            + self._correct_flags
+            + (1 if self._report_submitted else 0)
+        )
+        efficiency = min(1.0, useful_steps / max(1, self._attempts))
+        report = self._report_quality / 5.0
+        score = (
+            SCORE_WEIGHTS["recall"] * recall
+            + SCORE_WEIGHTS["precision"] * precision
+            + SCORE_WEIGHTS["workflow"] * workflow
+            + SCORE_WEIGHTS["efficiency"] * efficiency
+            + SCORE_WEIGHTS["report"] * report
+        )
+        return {
+            "recall": round(recall, 3),
+            "precision": round(precision, 3),
+            "workflow": round(workflow, 3),
+            "efficiency": round(efficiency, 3),
+            "report": round(report, 3),
+            "benchmark_score": round(min(1.0, max(0.0, score)), 3),
+        }
+    def _sync_state(self) -> None:
+        breakdown = self._build_breakdown()
+        self._state.current_score = breakdown["benchmark_score"]
+        self._state.dense_reward_total = round(self._dense_reward_total, 3)
+        self._state.correct_flags = self._correct_flags
+        self._state.false_positives = self._false_positive_flags
+        self._state.duplicate_flags = self._duplicate_flags
+        self._state.patterns_investigated = sorted(self._patterns_investigated)
+        self._state.distributions_computed = sorted(self._distributions_computed)
+        self._state.phase = self._phase
+        self._state.errors_found = self._correct_flags
+        self._state.score_breakdown = breakdown
     def reset(self, seed=None, episode_id=None, **kwargs) -> AuditObservation:
         self._action_history = []
         task_id = kwargs.get("task_id", "task_easy")
         if task_id not in TASKS:
         self._current_task = TASKS[task_id]
         difficulty = self._current_task["difficulty"]
         generator = DatasetGenerator(seed=seed)
         result = generator.generate(difficulty=difficulty)
         self._ground_truth = result["ground_truth"]
         self._traps = result["traps"]
         self._bias_present = result["bias_present"]
+        self._protocol = result["protocol"]
+        self._protocol_title = result["protocol_title"]
+        self._protocol_excerpt = result["protocol_excerpt"]
         self._flagged_patients = set()
         self._patterns_investigated = set()
         self._report_submitted = False
         self._phase = "investigation"
         self._score_log = []
+        self._dense_reward_total = 0.0
+        self._correct_flags = 0
+        self._false_positive_flags = 0
+        self._duplicate_flags = 0
+        self._invalid_phase_actions = 0
+        self._report_quality = 0.0
         self._state = AuditState(
             episode_id=episode_id or str(uuid.uuid4()),
             step_count=0,
             task_id=task_id,
             task_type=self._current_task["task_type"],
+            protocol_title=self._protocol_title,
+            trial_protocol_excerpt=self._protocol_excerpt,
+            total_errors=result["stats"]["total_errors"],
             errors_found=0,
             current_score=0.0,
+            dense_reward_total=0.0,
+            correct_flags=0,
+            false_positives=0,
+            duplicate_flags=0,
             attempts=0,
             phase="investigation",
             patterns_investigated=[],
             distributions_computed=[],
+            score_breakdown=self._build_breakdown(),
         )
         return AuditObservation(
             reward=0.0,
             task_id=task_id,
             task_type=self._current_task["task_type"],
+            task_description=self._task_description(),
+            protocol_title=self._protocol_title,
+            trial_protocol_excerpt=self._protocol_excerpt,
             dataset=self._dataset,
             errors_found=[],
             patterns_investigated=[],
             distributions_computed=[],
             feedback=(
+                f"Audit started for {self._protocol_title}. Read the protocol excerpt first, "
+                "then investigate the required variables before flagging issues."
             ),
             score_so_far=0.0,
+            dense_reward_total=0.0,
+            score_breakdown=self._build_breakdown(),
             attempts_remaining=self._max_steps,
             phase="investigation",
         )
     def step(self, action: AuditAction, **kwargs) -> AuditObservation:
         if self._current_task is None:
             return AuditObservation(
+                done=True,
+                reward=0.0,
+                task_id="",
+                task_type="",
+                task_description="Call reset() first.",
+                protocol_title="",
+                trial_protocol_excerpt="",
+                dataset=[],
+                errors_found=[],
+                patterns_investigated=[],
+                distributions_computed=[],
+                feedback="No active episode.",
+                score_so_far=0.0,
+                dense_reward_total=0.0,
+                score_breakdown={},
+                attempts_remaining=0,
+                phase="investigation",
             )
         self._action_history.append(action.action_type)
         self._state.step_count += 1
         self._state.attempts = self._attempts
         step_reward, feedback = self._grade(action)
+        if action.action_type == "flag_error" and action.confidence is not None:
+            confidence = max(0.0, min(1.0, action.confidence))
+            if step_reward < 0 and confidence > 0.8:
+                step_reward *= REWARD_CONFIG["overconfidence_multiplier"]
+                feedback += f" [OVERCONFIDENCE PENALTY: conf={confidence:.0%}]"
+            elif step_reward > 0:
+                step_reward *= max(0.65, confidence)
+        step_reward -= REWARD_CONFIG["cost_per_step"] * self._attempts
+        self._dense_reward_total += step_reward
+        if self._workflow_ready_for_flagging():
+            self._phase = "flagging"
         done = self._report_submitted or self._attempts >= self._max_steps
+        self._sync_state()
+        self._score_log.append(
+            {
+                "step": self._attempts,
+                "action": action.action_type,
+                "reward": round(step_reward, 3),
+                "dense_reward_total": round(self._dense_reward_total, 3),
+                "benchmark_score": self._state.current_score,
+            }
+        )
         return AuditObservation(
             done=done,
+            reward=round(step_reward, 3),
             task_id=self._current_task["task_id"],
             task_type=self._current_task["task_type"],
+            task_description=self._task_description(),
+            protocol_title=self._protocol_title,
+            trial_protocol_excerpt=self._protocol_excerpt,
             dataset=self._dataset,
+            errors_found=sorted(self._flagged_patients),
+            patterns_investigated=sorted(self._patterns_investigated),
+            distributions_computed=sorted(self._distributions_computed),
             feedback=feedback,
             score_so_far=self._state.current_score,
+            dense_reward_total=self._state.dense_reward_total,
+            score_breakdown=self._state.score_breakdown,
             attempts_remaining=max(0, self._max_steps - self._attempts),
             phase=self._phase,
         )
     def state(self) -> AuditState:
         return self._state
+    def _grade(self, action: AuditAction) -> tuple[float, str]:
+        if self._phase == "investigation" and action.action_type in {"flag_error", "submit_report"}:
+            if not self._workflow_ready_for_flagging():
+                self._invalid_phase_actions += 1
+                return (
+                    REWARD_CONFIG["invalid_phase"],
+                    "PHASE BLOCKED: Investigate the required variables before flagging or reporting.",
+                )
+        if action.action_type == "submit_report" and not self._flagged_patients:
+            self._invalid_phase_actions += 1
             return (
                 REWARD_CONFIG["invalid_phase"],
+                "PHASE BLOCKED: Flag at least one issue before submitting the report.",
             )
         if action.action_type == "investigate_pattern":
             return self._grade_investigate(action)
+        if action.action_type == "compute_distribution":
             return self._grade_distribution(action)
+        if action.action_type == "flag_error":
             return self._grade_flag(action)
+        if action.action_type == "propose_fix":
             return self._grade_propose_fix(action)
+        if action.action_type == "submit_report":
             return self._grade_report(action)
+        return (
+            REWARD_CONFIG["unknown_action"],
+            "REJECTED: Unknown action. Valid actions are investigate_pattern, compute_distribution, "
+            "flag_error, propose_fix, submit_report.",
+        )
+    def _grade_investigate(self, action: AuditAction) -> tuple[float, str]:
+        variable = action.variable or ""
         valid_vars = {
+            "age",
+            "gender",
+            "ethnicity",
+            "treatment_start",
+            "death_date",
+            "outcome",
+            "treatment_site",
+            "group",
+            "stage",
+            "trial_phase",
+            "drug",
+            "country",
+            "enrollment_date",
         }
         if variable not in valid_vars:
+            return REWARD_CONFIG["unknown_action"], f"REJECTED: Unknown variable '{variable}'."
         if variable in self._patterns_investigated:
             return (
                 REWARD_CONFIG["investigate_redundant"],
+                f"Already investigated '{variable}'. Move to another variable or flag a finding.",
             )
         self._patterns_investigated.add(variable)
         if variable == "age":
+            ages = [patient["age"] for patient in self._dataset if patient.get("age") is not None]
+            null_count = sum(patient.get("age") is None for patient in self._dataset)
+            feedback = (
+                f"Age profile: min={min(ages) if ages else 'NA'}, max={max(ages) if ages else 'NA'}, "
+                f"null_count={null_count}, protocol_range={self._protocol['age_min']}-{self._protocol['age_max']}."
+            )
+        elif variable == "death_date":
+            non_null = [patient for patient in self._dataset if patient.get("death_date")]
+            feedback = f"death_date present for {len(non_null)} patients. Compare against treatment_start."
+        elif variable == "enrollment_date":
+            delays = [
+                (datetime.strptime(patient["treatment_start"], "%Y-%m-%d") - datetime.strptime(patient["enrollment_date"], "%Y-%m-%d")).days
+                for patient in self._dataset
+            ]
+            feedback = (
+                f"Enrollment-to-treatment delays: min={min(delays)}, max={max(delays)}, "
+                f"standard_window={self._protocol['treatment_window_days']} days."
+            )
+        elif variable == "stage":
+            counts = {stage: 0 for stage in ("I", "II", "III", "IV")}
+            for patient in self._dataset:
+                counts[patient["stage"]] = counts.get(patient["stage"], 0) + 1
+            feedback = f"Stage distribution: {counts}. Stage IV has a longer treatment-start window."
         else:
             counts = {}
+            for patient in self._dataset:
+                value = str(patient.get(variable, "None"))
+                counts[value] = counts.get(value, 0) + 1
+            top_counts = dict(sorted(counts.items(), key=lambda item: -item[1])[:8])
+            feedback = f"{variable} distribution snapshot: {top_counts}."
+        reward = REWARD_CONFIG["investigate_new"]
+        if variable in {"enrollment_date", "stage"}:
+            reward += REWARD_CONFIG["bonus_protocol_window"] / 2
+        return reward, f"Investigated '{variable}': {feedback}"
+    def _grade_distribution(self, action: AuditAction) -> tuple[float, str]:
         variable = action.variable or ""
         if not variable:
             return REWARD_CONFIG["unknown_action"], "REJECTED: Variable cannot be empty."
         if variable in self._distributions_computed:
             return (
                 REWARD_CONFIG["distribution_redundant"],
+                f"Distribution for '{variable}' already computed.",
             )
         self._distributions_computed.add(variable)
+        control = [patient for patient in self._dataset if patient.get("group") == "control"]
         if variable == "ethnicity":
+            counts = {}
+            for patient in control:
+                counts[patient["ethnicity"]] = counts.get(patient["ethnicity"], 0) + 1
+            total = len(control) or 1
+            feedback = ", ".join(
+                f"{key}={value} ({(value / total) * 100:.0f}%)"
+                for key, value in sorted(counts.items(), key=lambda item: -item[1])
+            )
+            message = f"Control-arm ethnicity distribution: {feedback}."
         elif variable == "gender":
+            male = sum(patient.get("gender") == "M" for patient in control)
+            total = len(control) or 1
+            message = (
+                f"Control-arm gender mix: male={male}/{total} ({(male / total) * 100:.0f}%), "
+                f"female={total - male}/{total} ({((total - male) / total) * 100:.0f}%)."
+            )
+        elif variable == "outcome":
+            deceased = sum(patient.get("outcome") == "deceased" for patient in control)
+            total = len(control) or 1
+            message = (
+                f"Control-arm outcomes: deceased={deceased}/{total} ({(deceased / total) * 100:.0f}%), "
+                f"survived={total - deceased}/{total} ({((total - deceased) / total) * 100:.0f}%)."
+            )
         else:
+            message = f"Distribution computed for '{variable}'."
+        return REWARD_CONFIG["distribution_new"], message
+    def _grade_flag(self, action: AuditAction) -> tuple[float, str]:
         error_type = action.error_type or ""
+        if error_type not in self._current_task["allowed_error_types"]:
+            self._false_positive_flags += 1
+            return (
+                REWARD_CONFIG["false_positive"],
+                f"✗ Invalid error_type '{error_type}' for this task.",
+            )
         if error_type == "selection_bias":
             if not self._current_task["allow_bias"]:
+                self._false_positive_flags += 1
+                return REWARD_CONFIG["false_positive"], "✗ Bias review is not part of this task."
+            if not self._bias_review_ready():
+                self._invalid_phase_actions += 1
                 return (
+                    REWARD_CONFIG["invalid_phase"],
+                    "PHASE BLOCKED: Compute ethnicity, gender, and outcome distributions before flagging bias.",
                 )
             if "BIAS_FLAG" in self._flagged_patients:
+                self._duplicate_flags += 1
+                return REWARD_CONFIG["duplicate_flag"], "Bias already flagged."
+            signal = self._bias_signal()
+            if self._bias_present and signal["signal_present"]:
+                self._flagged_patients.add("BIAS_FLAG")
+                self._correct_flags += 1
                 return (
+                    REWARD_CONFIG["bias_detected"],
+                    "✓ Correct. Control-arm bias confirmed: "
+                    f"{signal['dominant_ethnicity']}={signal['dominant_ratio']:.0%}, "
+                    f"male={signal['male_ratio']:.0%}, "
+                    f"stage-adjusted mortality gap={signal['stage_adjusted_gap']:.0%}.",
                 )
+            self._false_positive_flags += 1
             return (
                 REWARD_CONFIG["false_positive"],
+                "✗ False positive. Current data show either no actionable bias or only a confounded "
+                f"high-risk cohort at {signal['high_risk_note']}. "
+                f"Overall gap={signal['overall_gap']:.0%}, stage-adjusted gap={signal['stage_adjusted_gap']:.0%}.",
             )
+        patient_id = action.patient_id
+        if not patient_id:
+            self._false_positive_flags += 1
+            return REWARD_CONFIG["false_positive"], "REJECTED: patient_id is required for record-level errors."
         if patient_id in self._flagged_patients:
+            self._duplicate_flags += 1
+            return REWARD_CONFIG["duplicate_flag"], f"{patient_id} already flagged."
+        patient = next((row for row in self._dataset if row.get("patient_id") == patient_id), None)
+        if patient is None:
+            self._false_positive_flags += 1
+            return REWARD_CONFIG["false_positive"], f"REJECTED: Patient '{patient_id}' not found."
         expected_errors = self._ground_truth.get(patient_id, [])
+        if error_type in expected_errors:
+            self._flagged_patients.add(patient_id)
+            self._correct_flags += 1
+            if error_type == "invalid_age":
                 return (
                     REWARD_CONFIG["correct_flag"],
+                    f"✓ Correct: {patient_id} age={patient.get('age')} violates protocol range "
+                    f"{self._protocol['age_min']}-{self._protocol['age_max']}.",
                 )
+            if error_type == "temporal_inconsistency":
+                treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+                death_date = datetime.strptime(patient["death_date"], "%Y-%m-%d")
+                gap = (treatment_start - death_date).days
                 return (
                     REWARD_CONFIG["correct_flag"],
+                    f"✓ Correct: {patient_id} death_date is {gap} days before treatment_start.",
+                )
+            if error_type == "protocol_window_violation":
+                enrollment = datetime.strptime(patient["enrollment_date"], "%Y-%m-%d")
+                treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+                delay = (treatment_start - enrollment).days
+                allowed = (
+                    self._protocol["stage_iv_treatment_window_days"]
+                    if patient["stage"] == "IV"
+                    else self._protocol["treatment_window_days"]
                 )
                 return (
+                    REWARD_CONFIG["correct_flag"] + REWARD_CONFIG["bonus_protocol_window"] / 2,
+                    f"✓ Correct: {patient_id} started treatment after {delay} days; protocol allows only {allowed}.",
                 )
+        self._false_positive_flags += 1
+        if error_type == "invalid_age":
+            return (
+                REWARD_CONFIG["false_positive"],
+                f"✗ False positive: {patient_id} age={patient.get('age')} is valid for protocol range "
+                f"{self._protocol['age_min']}-{self._protocol['age_max']}.",
+            )
+        if error_type == "temporal_inconsistency":
+            return (
+                REWARD_CONFIG["false_positive"],
+                f"✗ False positive: {patient_id} has a valid death/treatment ordering.",
+            )
+        if error_type == "protocol_window_violation":
+            enrollment = datetime.strptime(patient["enrollment_date"], "%Y-%m-%d")
+            treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+            delay = (treatment_start - enrollment).days
+            allowed = (
+                self._protocol["stage_iv_treatment_window_days"]
+                if patient["stage"] == "IV"
+                else self._protocol["treatment_window_days"]
+            )
             return (
                 REWARD_CONFIG["false_positive"],
+                f"✗ False positive: {patient_id} started treatment after {delay} days, which is valid for stage "
+                f"{patient['stage']} (allowed {allowed}).",
             )
+        return REWARD_CONFIG["false_positive"], f"✗ Invalid error_type '{error_type}'."
+    def _grade_propose_fix(self, action: AuditAction) -> tuple[float, str]:
         patient_id = action.patient_id or ""
         if patient_id not in self._flagged_patients:
+            return REWARD_CONFIG["propose_fix_invalid"], "Can only propose a fix for a flagged patient."
         proposed = action.proposed_value or ""
         if len(proposed) > 2:
+            return REWARD_CONFIG["propose_fix_valid"], f"Fix proposed for {patient_id}."
+        return REWARD_CONFIG["propose_fix_invalid"], "Proposed fix is too vague."
+    def _grade_report(self, action: AuditAction) -> tuple[float, str]:
         self._report_submitted = True
         report = (action.report or action.reason or "").lower()
+        quality = 0
         quality_items = []
+        if any(keyword in report for keyword in ["protocol", "eligibility", "inclusion", "excerpt"]):
+            quality += 1
+            quality_items.append("protocol grounding")
+        if any(keyword in report for keyword in ["root cause", "data entry", "pipeline", "system", "site process"]):
+            quality += 1
+            quality_items.append("root cause")
+        if any(keyword in report for keyword in ["recommend", "corrective", "action", "mitigation"]):
+            quality += 1
+            quality_items.append("corrective action")
+        if any(keyword in report for keyword in ["risk", "severity", "impact", "patient safety"]):
+            quality += 1
             quality_items.append("risk assessment")
+        if any(keyword in report for keyword in ["bias", "stage-adjusted", "fairness", "control arm", "equity"]):
+            quality += 1
+            quality_items.append("fairness reasoning")
+        self._report_quality = float(quality)
+        reward = REWARD_CONFIG["report_bonus_base"] + (0.015 * quality)
+        return reward, (
+            f"Report submitted. Quality {quality}/5"
+            + (f" ({', '.join(quality_items)})" if quality_items else "")
+            + "."
+        )

server/dataset_generator.py CHANGED Viewed

@@ -1,34 +1,24 @@
 """
 Procedural Adversarial Clinical Trial Data Engine
-==================================================
-Generates statistically rigorous, adversarial patient datasets for each episode.
-Design philosophy:
-  - Every reset() → unique dataset → no memorization possible
-  - Controlled error injection with known ground truth
-  - Adversarial traps that punish shallow reasoning
-  - Seed-based reproducibility for deterministic judging
-  - Pure stdlib (no numpy) → minimal Docker image
-Architecture layers:
-  1. Base Patient Generator — realistic demographics via statistical distributions
-  2. Error Injector — controlled % of age/temporal/missing violations
-  3. Bias Injector — demographic skew + outcome disparity in control group
-  4. Trap Injector — boundary-valid, near-temporal, fake-pattern distractors
-  5. Ground Truth Tracker — records every injected error for deterministic grading
 """
-import random
-import math
 import hashlib
 from datetime import datetime, timedelta
 from typing import Optional
-# ═══════════════════════════════════════════════════════════════════════════
-# REFERENCE DATA — Realistic clinical trial metadata pools
-# ═══════════════════════════════════════════════════════════════════════════
 HOSPITAL_SITES = [
     ("Metro General Hospital", "US"),
     ("Cleveland Oncology Institute", "US"),
@@ -37,8 +27,8 @@ HOSPITAL_SITES = [
     ("MD Anderson Cancer Center", "US"),
     ("AIIMS Delhi", "India"),
     ("Tata Memorial Hospital", "India"),
-    ("Charité Berlin", "Germany"),
-    ("Hospital Clínic Barcelona", "Spain"),
     ("Tokyo Medical University", "Japan"),
     ("Seoul National University Hospital", "South Korea"),
     ("Royal Marsden Hospital", "UK"),
@@ -47,98 +37,97 @@ HOSPITAL_SITES = [
     ("Peter MacCallum Cancer Centre", "Australia"),
 ]
-# Sites considered "rural" or underrepresented for bias analysis
 RURAL_SITES = {
-    "AIIMS Delhi", "Tata Memorial Hospital",
     "Howard University Hospital",
 }
-ETHNICITIES = ["White", "Black", "Hispanic", "Asian", "Native American", "Pacific Islander"]
 GENDERS = ["M", "F"]
 STAGES = ["I", "II", "III", "IV"]
 DRUGS_TREATMENT = ["ImmunoVax-7", "OncoShield-X", "TargetCure-3"]
-DRUGS_CONTROL = ["Placebo"]
-# Date range for the trial
 TRIAL_START = datetime(2022, 6, 1)
 TRIAL_END = datetime(2025, 3, 1)
-# ═══════════════════════════════════════════════════════════════════════════
-# DIFFICULTY CONFIGURATIONS
-# ═══════════════════════════════════════════════════════════════════════════
 DIFFICULTY_CONFIGS = {
     "easy": {
         "dataset_size": 300,
-        "age_error_rate": 0.03,          # 3% of patients have invalid ages
-        "temporal_error_rate": 0.0,       # No temporal errors in easy
-        "missing_data_rate": 0.01,        # 1% missing age
-        "bias_intensity": 0.0,            # No bias in easy
-        "num_boundary_traps": 5,          # Valid edge-case ages
         "num_temporal_traps": 0,
-        "num_distractor_deceased": 4,     # Valid deceased patients
         "num_fake_bias_distractors": 0,
-        "mortality_rate": 0.12,           # 12% overall mortality
-        "control_ratio": 0.50,            # 50/50 control/treatment
-        "task_type": "syntactic_cleaning",
-        "allow_bias": False,
     },
     "medium": {
-        "dataset_size": 500,
-        "age_error_rate": 0.03,
-        "temporal_error_rate": 0.03,      # 3% temporal violations
-        "missing_data_rate": 0.015,
-        "bias_intensity": 0.0,
-        "num_boundary_traps": 6,
-        "num_temporal_traps": 3,          # Near-temporal valid cases
-        "num_distractor_deceased": 5,
         "num_fake_bias_distractors": 0,
-        "mortality_rate": 0.15,
         "control_ratio": 0.50,
-        "task_type": "temporal_consistency",
-        "allow_bias": False,
     },
     "hard": {
-        "dataset_size": 800,
-        "age_error_rate": 0.025,
-        "temporal_error_rate": 0.025,
-        "missing_data_rate": 0.01,
-        "bias_intensity": 0.80,           # Strong bias
-        "num_boundary_traps": 8,
-        "num_temporal_traps": 4,
         "num_distractor_deceased": 8,
-        "num_fake_bias_distractors": 5,   # Fake patterns that look biased but aren't
-        "mortality_rate": 0.18,
         "control_ratio": 0.50,
-        "task_type": "comprehensive_audit",
-        "allow_bias": True,
     },
 }
-# ═══════════════════════════════════════════════════════════════════════════
-# DATASET GENERATOR
-# ═══════════════════════════════════════════════════════════════════════════
 class DatasetGenerator:
-    """
-    Procedural adversarial clinical trial data engine.
-    Generates statistically rigorous patient datasets with:
-    - Configurable size (300-1000+ patients)
-    - Controlled error injection (age, temporal, missing data)
-    - Controllable bias intensity (representation + outcome disparity)
-    - Adversarial traps (boundary-valid, near-temporal, fake patterns)
-    - Seed-based reproducibility (same seed → identical dataset)
-    Usage:
-        gen = DatasetGenerator(seed=42)
-        result = gen.generate(difficulty="hard")
-        dataset = result["dataset"]          # List[dict] — patient records
-        ground_truth = result["ground_truth"] # Dict[str, List[str]] — {pid: [error_types]}
-        traps = result["traps"]              # Set[str] — valid-but-suspicious pids
-        bias_present = result["bias_present"] # bool
-    """
     def __init__(self, seed: Optional[int] = None):
         self.seed = seed
@@ -151,58 +140,122 @@ class DatasetGenerator:
         self._patient_counter += 1
         return f"P{self._patient_counter:04d}"
     def _random_date(self, start: datetime, end: datetime) -> datetime:
-        """Generate a random date between start and end."""
         delta = (end - start).days
         if delta <= 0:
             return start
         return start + timedelta(days=self.rng.randint(0, delta))
-    def _generate_age(self) -> int:
-        """Generate a realistic age using truncated normal distribution."""
-        # Clinical trial typical age: mean=58, std=12
         while True:
-            age = int(self.rng.gauss(58, 12))
-            if 18 <= age <= 100:
                 return age
     def _select_ethnicity(self, bias_mode: str = "neutral") -> str:
-        """
-        Select ethnicity with configurable distribution.
-        bias_mode: "neutral" | "white_dominant" | "diverse"
-        """
-        if bias_mode == "white_dominant":
-            weights = [0.78, 0.06, 0.06, 0.05, 0.03, 0.02]
-        elif bias_mode == "diverse":
-            weights = [0.30, 0.20, 0.20, 0.15, 0.10, 0.05]
-        else:  # neutral — matches US clinical trial demographics
-            weights = [0.55, 0.15, 0.15, 0.10, 0.03, 0.02]
         return self.rng.choices(ETHNICITIES, weights=weights, k=1)[0]
-    def _generate_base_patient(self, group: str, ethnicity: str = None,
-                                bias_mode: str = "neutral") -> dict:
-        """Generate a single valid patient record."""
         pid = self._next_pid()
         site, country = self.rng.choice(HOSPITAL_SITES)
-        gender = self.rng.choice(GENDERS)
-        eth = ethnicity or self._select_ethnicity(bias_mode)
-        age = self._generate_age()
-        stage = self.rng.choices(STAGES, weights=[0.25, 0.30, 0.25, 0.20], k=1)[0]
-        enrollment_date = self._random_date(TRIAL_START, TRIAL_END - timedelta(days=180))
-        treatment_start = enrollment_date + timedelta(days=self.rng.randint(7, 30))
-        if group == "treatment":
-            drug = self.rng.choice(DRUGS_TREATMENT)
-        else:
-            drug = "Placebo"
-        patient = {
             "patient_id": pid,
             "age": age,
-            "gender": gender,
-            "ethnicity": eth,
             "group": group,
             "treatment_start": treatment_start.strftime("%Y-%m-%d"),
             "death_date": None,
@@ -210,459 +263,422 @@ class DatasetGenerator:
             "treatment_site": site,
             "stage": stage,
             "trial_phase": "Phase III",
-            "drug": drug,
             "enrollment_date": enrollment_date.strftime("%Y-%m-%d"),
             "country": country,
         }
         return patient
-    def _apply_mortality(self, patient: dict, mortality_rate: float) -> dict:
-        """Randomly apply mortality with valid timeline."""
-        if self.rng.random() < mortality_rate:
-            treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
-            # Death occurs 1-720 days after treatment start
-            days_to_death = self.rng.randint(1, 720)
-            death_date = treatment_start + timedelta(days=days_to_death)
-            # Cap at trial end
-            if death_date > TRIAL_END + timedelta(days=365):
-                death_date = TRIAL_END + timedelta(days=self.rng.randint(1, 180))
-            patient["death_date"] = death_date.strftime("%Y-%m-%d")
-            patient["outcome"] = "deceased"
-        return patient
-    # ── Error Injectors ───────────────────────────────────────────────
-    def _inject_age_errors(self, patients: list[dict], error_rate: float,
-                            missing_rate: float) -> list[dict]:
-        """Inject invalid age values into random patients."""
-        n_age_errors = max(3, int(len(patients) * error_rate))
-        n_missing = max(1, int(len(patients) * missing_rate))
-        # Select random indices for age errors (avoid overlap)
         available = list(range(len(patients)))
         self.rng.shuffle(available)
-        # Invalid age errors
-        invalid_ages = []
-        for _ in range(n_age_errors):
-            error_kind = self.rng.choice([
-                "negative", "extreme_high", "sentinel", "just_over"
-            ])
-            if error_kind == "negative":
-                invalid_ages.append(self.rng.choice([-1, -5, -10, -3, -15]))
-            elif error_kind == "extreme_high":
-                invalid_ages.append(self.rng.choice([150, 200, 250, 300, 500]))
-            elif error_kind == "sentinel":
-                invalid_ages.append(self.rng.choice([999, 9999, 0, -999]))
-            elif error_kind == "just_over":
-                invalid_ages.append(self.rng.choice([121, 122, 125, 130, 17, 16, 15]))
-        for i, invalid_age in enumerate(invalid_ages):
-            if i >= len(available):
-                break
-            idx = available[i]
-            patients[idx]["age"] = invalid_age
-            pid = patients[idx]["patient_id"]
-            self._ground_truth.setdefault(pid, []).append("invalid_age")
-        # Missing age (None)
-        offset = len(invalid_ages)
-        for j in range(n_missing):
-            if offset + j >= len(available):
-                break
-            idx = available[offset + j]
-            patients[idx]["age"] = None
-            pid = patients[idx]["patient_id"]
-            self._ground_truth.setdefault(pid, []).append("invalid_age")
-        return patients
-    def _inject_temporal_errors(self, patients: list[dict],
-                                 error_rate: float) -> list[dict]:
-        """Inject temporal violations: death_date before treatment_start."""
-        n_errors = max(3, int(len(patients) * error_rate))
-        # Only inject into patients who have death dates or can have one added
-        candidates = []
-        for i, p in enumerate(patients):
-            pid = p["patient_id"]
-            # Don't stack errors on patients already with age errors
-            if pid not in self._ground_truth:
-                candidates.append(i)
         self.rng.shuffle(candidates)
-        for k in range(min(n_errors, len(candidates))):
-            idx = candidates[k]
-            p = patients[idx]
-            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
-            # Death date 15-365 days BEFORE treatment start (clear violation)
-            gap_days = self.rng.randint(15, 365)
-            death_date = treatment_start - timedelta(days=gap_days)
-            p["death_date"] = death_date.strftime("%Y-%m-%d")
-            p["outcome"] = "deceased"
-            pid = p["patient_id"]
-            self._ground_truth.setdefault(pid, []).append("temporal_inconsistency")
         return patients
-    def _inject_bias(self, patients: list[dict], intensity: float) -> list[dict]:
-        """
-        Inject multi-dimensional selection bias into the control group.
-        Bias structure (mirrors real SEER findings):
-        1. Representation: White patients dominate control group (>75%)
-        2. Outcome disparity: Minority control patients have higher mortality
-        3. Gender imbalance: Males overrepresented in control
-        4. Site bias: Minorities underrepresented at major sites
-        """
-        if intensity <= 0:
-            return patients
-        control_patients = [p for p in patients if p["group"] == "control"]
-        treatment_patients = [p for p in patients if p["group"] == "treatment"]
-        if not control_patients:
-            return patients
-        # ── Layer 1: Representation bias ──
-        # Force >75% of control to be White
-        target_white_ratio = 0.75 + (intensity * 0.10)  # 0.75-0.85
-        n_control = len(control_patients)
-        n_white_target = int(n_control * target_white_ratio)
-        n_white_current = sum(1 for p in control_patients if p["ethnicity"] == "White")
-        # Convert some non-White control patients to White
-        non_white_control = [p for p in control_patients if p["ethnicity"] != "White"]
-        to_convert = max(0, n_white_target - n_white_current)
-        self.rng.shuffle(non_white_control)
-        for i in range(min(to_convert, len(non_white_control))):
-            non_white_control[i]["ethnicity"] = "White"
-        # ── Layer 2: Gender imbalance in control ──
-        # Force >65% male in control
-        target_male_ratio = 0.65 + (intensity * 0.10)
-        n_male_target = int(n_control * target_male_ratio)
-        n_male_current = sum(1 for p in control_patients if p["gender"] == "M")
-        female_control = [p for p in control_patients if p["gender"] == "F"]
-        to_convert_gender = max(0, n_male_target - n_male_current)
-        self.rng.shuffle(female_control)
-        for i in range(min(to_convert_gender, len(female_control))):
-            female_control[i]["gender"] = "M"
-        # ── Layer 3: Outcome disparity ──
-        # Minority patients in control → higher mortality (>60%)
-        minority_control = [
-            p for p in control_patients
-            if p["ethnicity"] != "White" and p["patient_id"] not in self._ground_truth
         ]
-        target_minority_mortality = 0.60 + (intensity * 0.15)
-        n_minority_dead = int(len(minority_control) * target_minority_mortality)
-        for i, p in enumerate(minority_control):
-            if i < n_minority_dead:
-                if p["outcome"] != "deceased":
-                    treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
-                    death_date = treatment_start + timedelta(
-                        days=self.rng.randint(30, 365)
-                    )
-                    p["death_date"] = death_date.strftime("%Y-%m-%d")
-                    p["outcome"] = "deceased"
-        # ── Layer 4: White control patients → low mortality ──
-        white_control = [
-            p for p in control_patients
-            if p["ethnicity"] == "White" and p["patient_id"] not in self._ground_truth
         ]
-        # Keep White mortality low
-        target_white_survival = 0.85
-        n_white_alive = int(len(white_control) * target_white_survival)
-        for i, p in enumerate(white_control):
-            if i < n_white_alive:
-                p["death_date"] = None
-                p["outcome"] = "survived"
-        # ── Layer 5: Rural minority underrepresentation ──
-        for p in minority_control:
-            if p["treatment_site"] in RURAL_SITES:
-                # Move some to major sites (reducing rural minority visibility)
-                if self.rng.random() < intensity * 0.5:
-                    major_sites = [
-                        s for s in HOSPITAL_SITES
-                        if s[0] not in RURAL_SITES
-                    ]
-                    new_site = self.rng.choice(major_sites)
-                    p["treatment_site"] = new_site[0]
-                    p["country"] = new_site[1]
         return patients
-    # ── Trap Injectors ────────────────────────────────────────────────
-    def _inject_boundary_traps(self, patients: list[dict], n_traps: int) -> list[dict]:
-        """
-        Inject boundary-valid ages that trap naive agents.
-        Ages like 18, 19, 120 are VALID but suspicious.
-        """
-        boundary_ages = [18, 19, 20, 90, 92, 95, 96, 100, 105, 110, 115, 118, 119, 120, 120]
-        self.rng.shuffle(boundary_ages)  # Randomize which traps appear
         available = [
-            i for i, p in enumerate(patients)
             if p["patient_id"] not in self._ground_truth
-            and p["age"] is not None and 25 <= p["age"] <= 85
         ]
         self.rng.shuffle(available)
-        for k in range(min(n_traps, len(available), len(boundary_ages))):
-            idx = available[k]
-            patients[idx]["age"] = boundary_ages[k]
-            self._traps.add(patients[idx]["patient_id"])
         return patients
-    def _inject_temporal_traps(self, patients: list[dict], n_traps: int) -> list[dict]:
-        """
-        Inject near-temporal valid cases: death 1-3 days AFTER treatment start.
-        These are VALID but look like errors to careless agents.
-        """
         available = [
-            i for i, p in enumerate(patients)
             if p["patient_id"] not in self._ground_truth
-            and p["death_date"] is None
             and p["patient_id"] not in self._traps
         ]
         self.rng.shuffle(available)
-        for k in range(min(n_traps, len(available))):
-            idx = available[k]
-            p = patients[idx]
-            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
-            # Death 1-3 days AFTER treatment — valid but suspicious
-            gap = self.rng.randint(1, 3)
-            death_date = treatment_start + timedelta(days=gap)
-            p["death_date"] = death_date.strftime("%Y-%m-%d")
-            p["outcome"] = "deceased"
-            p["stage"] = "IV"  # Make it medically plausible (late-stage)
-            self._traps.add(p["patient_id"])
         return patients
-    def _inject_fake_bias_distractors(self, patients: list[dict],
-                                       n_distractors: int) -> list[dict]:
-        """
-        Inject patterns that LOOK like bias but aren't.
-        E.g., treatment group with demographic skew (doesn't matter for bias detection
-        since only control group bias is relevant).
-        """
-        treatment_patients = [
-            i for i, p in enumerate(patients)
             if p["group"] == "treatment"
             and p["patient_id"] not in self._ground_truth
             and p["patient_id"] not in self._traps
         ]
-        self.rng.shuffle(treatment_patients)
-        for k in range(min(n_distractors, len(treatment_patients))):
-            idx = treatment_patients[k]
-            # Make treatment group look skewed (irrelevant for bias detection)
-            patients[idx]["ethnicity"] = "White"
-            patients[idx]["gender"] = "M"
-            self._traps.add(patients[idx]["patient_id"])
         return patients
-    def _inject_distractor_deceased(self, patients: list[dict],
-                                     n_distractors: int) -> list[dict]:
-        """
-        Add deceased patients with perfectly valid timelines.
-        These are NOT errors — tests if agent over-flags deceased patients.
-        """
-        available = [
-            i for i, p in enumerate(patients)
-            if p["patient_id"] not in self._ground_truth
-            and p["death_date"] is None
-            and p["patient_id"] not in self._traps
         ]
-        self.rng.shuffle(available)
-        for k in range(min(n_distractors, len(available))):
-            idx = available[k]
-            p = patients[idx]
-            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
-            # Death 30-540 days after treatment (clearly valid)
-            days = self.rng.randint(30, 540)
-            death_date = treatment_start + timedelta(days=days)
-            p["death_date"] = death_date.strftime("%Y-%m-%d")
-            p["outcome"] = "deceased"
-            self._traps.add(p["patient_id"])
-        return patients
-    # ── Main Generator ────────────────────────────────────────────────
     def generate(self, difficulty: str = "easy") -> dict:
-        """
-        Generate a complete adversarial dataset for the given difficulty.
-        Returns:
-            {
-                "dataset": List[dict],          # Patient records
-                "ground_truth": Dict[str, List[str]],  # {pid: [error_types]}
-                "traps": Set[str],              # Valid-but-suspicious pids
-                "bias_present": bool,           # Whether bias was injected
-                "config": dict,                 # Generation parameters
-                "stats": dict,                  # Summary statistics
-            }
-        """
         config = DIFFICULTY_CONFIGS.get(difficulty, DIFFICULTY_CONFIGS["easy"])
         self._ground_truth = {}
         self._traps = set()
         self._patient_counter = 0
-        n = config["dataset_size"]
-        n_control = int(n * config["control_ratio"])
-        n_treatment = n - n_control
-        # ── Step 1: Generate base patients ──
         patients = []
-        # Determine bias mode for control group
-        control_bias_mode = "white_dominant" if config["bias_intensity"] > 0 else "neutral"
         for _ in range(n_control):
-            p = self._generate_base_patient("control", bias_mode=control_bias_mode)
-            p = self._apply_mortality(p, config["mortality_rate"])
-            patients.append(p)
         for _ in range(n_treatment):
-            p = self._generate_base_patient("treatment", bias_mode="diverse")
-            p = self._apply_mortality(p, config["mortality_rate"])
-            patients.append(p)
-        # ── Step 2: Inject errors ──
-        patients = self._inject_age_errors(
-            patients, config["age_error_rate"], config["missing_data_rate"]
-        )
         if config["temporal_error_rate"] > 0:
-            patients = self._inject_temporal_errors(
-                patients, config["temporal_error_rate"]
-            )
-        # ── Step 3: Inject bias (hard only) ──
-        if config["bias_intensity"] > 0:
-            patients = self._inject_bias(patients, config["bias_intensity"])
-        # ── Step 4: Inject adversarial traps ──
-        patients = self._inject_boundary_traps(patients, config["num_boundary_traps"])
         if config["num_temporal_traps"] > 0:
-            patients = self._inject_temporal_traps(
-                patients, config["num_temporal_traps"]
-            )
         if config["num_fake_bias_distractors"] > 0:
-            patients = self._inject_fake_bias_distractors(
-                patients, config["num_fake_bias_distractors"]
-            )
-        patients = self._inject_distractor_deceased(
-            patients, config["num_distractor_deceased"]
-        )
-        # ── Step 5: Shuffle dataset ──
         self.rng.shuffle(patients)
-        # ── Step 6: Compute summary stats ──
-        n_age_errors = sum(
-            1 for errs in self._ground_truth.values()
-            if "invalid_age" in errs
-        )
-        n_temporal_errors = sum(
-            1 for errs in self._ground_truth.values()
-            if "temporal_inconsistency" in errs
-        )
-        total_errors = n_age_errors + n_temporal_errors
-        if config["bias_intensity"] > 0:
-            total_errors += 1  # bias counts as 1 error
         stats = {
             "total_patients": len(patients),
-            "total_errors": total_errors,
-            "age_errors": n_age_errors,
-            "temporal_errors": n_temporal_errors,
-            "bias_present": config["bias_intensity"] > 0,
             "num_traps": len(self._traps),
             "control_count": sum(1 for p in patients if p["group"] == "control"),
             "treatment_count": sum(1 for p in patients if p["group"] == "treatment"),
         }
         return {
             "dataset": patients,
             "ground_truth": dict(self._ground_truth),
             "traps": set(self._traps),
-            "bias_present": config["bias_intensity"] > 0,
             "config": config,
             "stats": stats,
         }
-# ═══════════════════════════════════════════════════════════════════════════
-# STANDALONE TEST
-# ═══════════════════════════════════════════════════════════════════════════
 if __name__ == "__main__":
     print("=" * 60)
     print("  Dataset Generator — Validation Test")
     print("=" * 60)
-    for diff in ["easy", "medium", "hard"]:
-        gen = DatasetGenerator(seed=42)
-        result = gen.generate(difficulty=diff)
         stats = result["stats"]
-        print(f"\n  {diff.upper()}:")
         print(f"    Patients:    {stats['total_patients']}")
-        print(f"    Errors:      {stats['total_errors']} "
-              f"(age={stats['age_errors']}, temporal={stats['temporal_errors']}, "
-              f"bias={'yes' if stats['bias_present'] else 'no'})")
         print(f"    Traps:       {stats['num_traps']}")
         print(f"    Control:     {stats['control_count']}")
         print(f"    Treatment:   {stats['treatment_count']}")
-        # Verify reproducibility
-        gen2 = DatasetGenerator(seed=42)
-        result2 = gen2.generate(difficulty=diff)
-        assert result["dataset"] == result2["dataset"], "REPRODUCIBILITY FAILED!"
-        assert result["ground_truth"] == result2["ground_truth"], "GROUND TRUTH MISMATCH!"
-        print(f"    ✓ Seed reproducibility verified")
-        # Verify ground truth
-        for pid, errors in result["ground_truth"].items():
-            patient = next(p for p in result["dataset"] if p["patient_id"] == pid)
-            for err in errors:
-                if err == "invalid_age":
                     age = patient.get("age")
-                    assert age is None or age < 18 or age > 120, \
-                        f"Ground truth says {pid} invalid age but age={age}"
-                elif err == "temporal_inconsistency":
-                    ts = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
-                    dd = datetime.strptime(patient["death_date"], "%Y-%m-%d")
-                    assert dd < ts, \
-                        f"Ground truth says {pid} temporal error but dates are valid"
-        print(f"    ✓ Ground truth integrity verified")
-    # Verify different seeds produce different datasets
-    gen_a = DatasetGenerator(seed=1)
-    gen_b = DatasetGenerator(seed=2)
-    result_a = gen_a.generate("easy")
-    result_b = gen_b.generate("easy")
-    assert result_a["dataset"] != result_b["dataset"], "Different seeds same data!"
-    print(f"\n    ✓ Different seeds produce different datasets")
     print(f"\n{'=' * 60}")
-    print(f"  ALL TESTS PASSED")
     print(f"{'=' * 60}")

 """
 Procedural Adversarial Clinical Trial Data Engine
+=================================================
+Generates seeded, protocol-driven clinical trial datasets for OpenEnv episodes.
+This generator is intentionally benchmark-oriented:
+  - each episode samples a different protocol excerpt and hidden rule set
+  - age eligibility is protocol-specific, not a fixed 18-120 shortcut
+  - treatment scheduling uses stage-aware exceptions to create valid edge cases
+  - hard episodes alternate between true bias and confounded "looks bad" cohorts
+  - all labels remain deterministic and reproducible from the seed
 """
+from __future__ import annotations
 import hashlib
+import random
 from datetime import datetime, timedelta
 from typing import Optional
 HOSPITAL_SITES = [
     ("Metro General Hospital", "US"),
     ("Cleveland Oncology Institute", "US"),
     ("MD Anderson Cancer Center", "US"),
     ("AIIMS Delhi", "India"),
     ("Tata Memorial Hospital", "India"),
+    ("Charite Berlin", "Germany"),
+    ("Hospital Clinic Barcelona", "Spain"),
     ("Tokyo Medical University", "Japan"),
     ("Seoul National University Hospital", "South Korea"),
     ("Royal Marsden Hospital", "UK"),
     ("Peter MacCallum Cancer Centre", "Australia"),
 ]
 RURAL_SITES = {
+    "AIIMS Delhi",
     "Howard University Hospital",
+    "Tata Memorial Hospital",
 }
+ETHNICITIES = [
+    "White",
+    "Black",
+    "Hispanic",
+    "Asian",
+    "Native American",
+    "Pacific Islander",
+]
 GENDERS = ["M", "F"]
 STAGES = ["I", "II", "III", "IV"]
 DRUGS_TREATMENT = ["ImmunoVax-7", "OncoShield-X", "TargetCure-3"]
 TRIAL_START = datetime(2022, 6, 1)
 TRIAL_END = datetime(2025, 3, 1)
+BASE_STAGE_MORTALITY = {
+    "I": 0.04,
+    "II": 0.08,
+    "III": 0.16,
+    "IV": 0.32,
+}
+AGE_RULESETS = {
+    "easy": [(35, 75), (40, 80), (45, 85)],
+    "medium": [(18, 75), (21, 80), (30, 85), (40, 90)],
+    "hard": [(18, 75), (21, 80), (30, 85), (35, 85), (40, 90)],
+}
+WINDOW_RULESETS = {
+    "easy": [21, 24, 28],
+    "medium": [18, 21, 24, 28],
+    "hard": [14, 18, 21, 24],
+}
 DIFFICULTY_CONFIGS = {
     "easy": {
         "dataset_size": 300,
+        "age_error_rate": 0.020,
+        "missing_age_rate": 0.007,
+        "temporal_error_rate": 0.0,
+        "protocol_window_error_rate": 0.0,
+        "num_boundary_traps": 8,
         "num_temporal_traps": 0,
+        "num_window_traps": 0,
+        "num_distractor_deceased": 5,
         "num_fake_bias_distractors": 0,
+        "bias_probability": 0.0,
+        "control_ratio": 0.50,
+        "task_type": "eligibility_screening",
     },
     "medium": {
+        "dataset_size": 480,
+        "age_error_rate": 0.018,
+        "missing_age_rate": 0.007,
+        "temporal_error_rate": 0.012,
+        "protocol_window_error_rate": 0.015,
+        "num_boundary_traps": 10,
+        "num_temporal_traps": 4,
+        "num_window_traps": 5,
+        "num_distractor_deceased": 6,
         "num_fake_bias_distractors": 0,
+        "bias_probability": 0.0,
         "control_ratio": 0.50,
+        "task_type": "protocol_timeline_audit",
     },
     "hard": {
+        "dataset_size": 720,
+        "age_error_rate": 0.017,
+        "missing_age_rate": 0.006,
+        "temporal_error_rate": 0.010,
+        "protocol_window_error_rate": 0.014,
+        "num_boundary_traps": 12,
+        "num_temporal_traps": 5,
+        "num_window_traps": 7,
         "num_distractor_deceased": 8,
+        "num_fake_bias_distractors": 8,
+        "bias_probability": 0.58,
         "control_ratio": 0.50,
+        "task_type": "equity_and_protocol_audit",
     },
 }
 class DatasetGenerator:
+    """Seeded benchmark dataset generator."""
     def __init__(self, seed: Optional[int] = None):
         self.seed = seed
         self._patient_counter += 1
         return f"P{self._patient_counter:04d}"
+    def _mark_error(self, patient_id: str, error_type: str) -> None:
+        self._ground_truth.setdefault(patient_id, []).append(error_type)
     def _random_date(self, start: datetime, end: datetime) -> datetime:
         delta = (end - start).days
         if delta <= 0:
             return start
         return start + timedelta(days=self.rng.randint(0, delta))
+    def _build_protocol(self, difficulty: str, config: dict) -> dict:
+        age_min, age_max = self.rng.choice(AGE_RULESETS[difficulty])
+        treatment_window = self.rng.choice(WINDOW_RULESETS[difficulty])
+        stage_iv_window = treatment_window + self.rng.choice([7, 10, 14])
+        high_risk_sites = self.rng.sample(sorted(RURAL_SITES), k=2 if difficulty == "hard" else 1)
+        dominant_threshold = self.rng.choice([0.68, 0.70, 0.72]) if difficulty == "hard" else 0.0
+        male_threshold = self.rng.choice([0.56, 0.60, 0.63]) if difficulty == "hard" else 0.0
+        adjusted_gap = self.rng.choice([0.12, 0.15, 0.18]) if difficulty == "hard" else 0.0
+        bias_present = difficulty == "hard" and self.rng.random() < config["bias_probability"]
+        protocol_key = (
+            f"{difficulty}|{age_min}|{age_max}|{treatment_window}|"
+            f"{stage_iv_window}|{'/'.join(high_risk_sites)}|{bias_present}"
+        )
+        protocol_id = hashlib.sha1(protocol_key.encode("utf-8")).hexdigest()[:8].upper()
+        protocol_title = f"ONCO-AX-{difficulty[:1].upper()}{protocol_id}"
+        lines = [
+            f"TRIAL PROTOCOL EXCERPT — {protocol_title}",
+            "",
+            "Eligibility",
+            f"- Participants must be age {age_min}-{age_max} inclusive on enrollment.",
+            "- Missing age is a protocol violation.",
+            "",
+            "Treatment Scheduling",
+            f"- Treatment must begin within {treatment_window} days of enrollment.",
+            (
+                f"- Stage IV exception: treatment may begin within "
+                f"{stage_iv_window} days of enrollment."
+            ),
+            "- death_date must never precede treatment_start.",
+            "- Do not assume a generic 18-120 range; this excerpt overrides defaults.",
+        ]
+        if difficulty == "hard":
+            lines.extend(
+                [
+                    "",
+                    "Equity Review",
+                    "- Selection bias concerns control-arm composition, not treatment-arm skew.",
+                    "- Compare mortality within stage strata before escalating a bias concern.",
+                    (
+                        f"- Stage IV patients at {', '.join(high_risk_sites)} are a known "
+                        "high-risk outreach cohort; elevated mortality there is not standalone "
+                        "evidence of unfair sampling."
+                    ),
+                    (
+                        f"- Escalate bias only when control-arm dominance exceeds "
+                        f"{int(dominant_threshold * 100)}%, male share exceeds "
+                        f"{int(male_threshold * 100)}%, and stage-adjusted mortality gap "
+                        f"exceeds {int(adjusted_gap * 100)} percentage points."
+                    ),
+                ]
+            )
+        return {
+            "protocol_id": protocol_id,
+            "protocol_title": protocol_title,
+            "excerpt": "\n".join(lines),
+            "age_min": age_min,
+            "age_max": age_max,
+            "treatment_window_days": treatment_window,
+            "stage_iv_treatment_window_days": stage_iv_window,
+            "high_risk_sites": high_risk_sites,
+            "bias_control_dominance_threshold": dominant_threshold,
+            "bias_male_threshold": male_threshold,
+            "bias_stage_adjusted_gap": adjusted_gap,
+            "bias_present": bias_present,
+        }
+    def _generate_age(self, protocol: dict) -> int:
         while True:
+            age = int(self.rng.gauss(58, 11))
+            if protocol["age_min"] <= age <= protocol["age_max"]:
                 return age
     def _select_ethnicity(self, bias_mode: str = "neutral") -> str:
+        if bias_mode == "diverse":
+            weights = [0.28, 0.19, 0.20, 0.18, 0.10, 0.05]
+        elif bias_mode == "white_dominant":
+            weights = [0.68, 0.08, 0.08, 0.08, 0.05, 0.03]
+        else:
+            weights = [0.50, 0.16, 0.15, 0.12, 0.04, 0.03]
         return self.rng.choices(ETHNICITIES, weights=weights, k=1)[0]
+    def _base_delay(self, stage: str, protocol: dict) -> int:
+        max_window = (
+            protocol["stage_iv_treatment_window_days"]
+            if stage == "IV"
+            else protocol["treatment_window_days"]
+        )
+        lower = 5 if max_window >= 10 else 1
+        upper = max(lower, max_window - 2)
+        return self.rng.randint(lower, upper)
+    def _generate_base_patient(self, group: str, protocol: dict, bias_mode: str = "neutral") -> dict:
         pid = self._next_pid()
         site, country = self.rng.choice(HOSPITAL_SITES)
+        stage = self.rng.choices(STAGES, weights=[0.24, 0.28, 0.28, 0.20], k=1)[0]
+        age = self._generate_age(protocol)
+        enrollment_end = TRIAL_END - timedelta(days=150)
+        enrollment_date = self._random_date(TRIAL_START, enrollment_end)
+        treatment_start = enrollment_date + timedelta(days=self._base_delay(stage, protocol))
+        return {
             "patient_id": pid,
             "age": age,
+            "gender": self.rng.choice(GENDERS),
+            "ethnicity": self._select_ethnicity(bias_mode),
             "group": group,
             "treatment_start": treatment_start.strftime("%Y-%m-%d"),
             "death_date": None,
             "treatment_site": site,
             "stage": stage,
             "trial_phase": "Phase III",
+            "drug": self.rng.choice(DRUGS_TREATMENT) if group == "treatment" else "Placebo",
             "enrollment_date": enrollment_date.strftime("%Y-%m-%d"),
             "country": country,
         }
+    def _mortality_rate(self, patient: dict, protocol: dict) -> float:
+        rate = BASE_STAGE_MORTALITY.get(patient["stage"], 0.10)
+        if patient["treatment_site"] in protocol["high_risk_sites"] and patient["stage"] == "IV":
+            rate += 0.16
+        if patient["group"] == "treatment":
+            rate *= 0.92
+        return max(0.02, min(0.82, rate))
+    def _set_deceased(self, patient: dict, min_days: int, max_days: int) -> None:
+        treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+        days_to_death = self.rng.randint(min_days, max_days)
+        death_date = treatment_start + timedelta(days=days_to_death)
+        patient["death_date"] = death_date.strftime("%Y-%m-%d")
+        patient["outcome"] = "deceased"
+    def _apply_mortality(self, patient: dict, protocol: dict) -> dict:
+        if self.rng.random() < self._mortality_rate(patient, protocol):
+            self._set_deceased(patient, min_days=3, max_days=540)
         return patient
+    def _apply_target_mortality(self, cohort: list[dict], target_rate: float) -> None:
+        if not cohort:
+            return
+        self.rng.shuffle(cohort)
+        target_count = int(round(len(cohort) * max(0.0, min(1.0, target_rate))))
+        for index, patient in enumerate(cohort):
+            if index < target_count:
+                self._set_deceased(patient, min_days=10, max_days=420)
+            else:
+                patient["death_date"] = None
+                patient["outcome"] = "survived"
+    def _allowed_treatment_window(self, patient: dict, protocol: dict) -> int:
+        return (
+            protocol["stage_iv_treatment_window_days"]
+            if patient.get("stage") == "IV"
+            else protocol["treatment_window_days"]
+        )
+    def _enrollment_date(self, patient: dict) -> datetime:
+        return datetime.strptime(patient["enrollment_date"], "%Y-%m-%d")
+    def _treatment_date(self, patient: dict) -> datetime:
+        return datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+    def _inject_age_errors(self, patients: list[dict], protocol: dict, config: dict) -> list[dict]:
+        n_invalid = max(3, int(len(patients) * config["age_error_rate"]))
+        n_missing = max(1, int(len(patients) * config["missing_age_rate"]))
         available = list(range(len(patients)))
         self.rng.shuffle(available)
+        low_values = [protocol["age_min"] - 1, protocol["age_min"] - 2, max(0, protocol["age_min"] - 5), -1]
+        high_values = [protocol["age_max"] + 1, protocol["age_max"] + 2, protocol["age_max"] + 5, 999]
+        for offset in range(min(n_invalid, len(available))):
+            patient = patients[available[offset]]
+            patient["age"] = self.rng.choice(low_values + high_values)
+            self._mark_error(patient["patient_id"], "invalid_age")
+        start = min(n_invalid, len(available))
+        for offset in range(start, min(start + n_missing, len(available))):
+            patient = patients[available[offset]]
+            patient["age"] = None
+            self._mark_error(patient["patient_id"], "invalid_age")
+        return patients
+    def _inject_temporal_errors(self, patients: list[dict], config: dict) -> list[dict]:
+        n_errors = max(3, int(len(patients) * config["temporal_error_rate"]))
+        candidates = [p for p in patients if p["patient_id"] not in self._ground_truth]
         self.rng.shuffle(candidates)
+        for patient in candidates[:n_errors]:
+            treatment_start = self._treatment_date(patient)
+            death_date = treatment_start - timedelta(days=self.rng.randint(10, 240))
+            patient["death_date"] = death_date.strftime("%Y-%m-%d")
+            patient["outcome"] = "deceased"
+            self._mark_error(patient["patient_id"], "temporal_inconsistency")
+        return patients
+    def _inject_protocol_window_errors(
+        self,
+        patients: list[dict],
+        protocol: dict,
+        config: dict,
+    ) -> list[dict]:
+        n_errors = max(3, int(len(patients) * config["protocol_window_error_rate"]))
+        candidates = [p for p in patients if p["patient_id"] not in self._ground_truth]
+        self.rng.shuffle(candidates)
+        for patient in candidates[:n_errors]:
+            allowed_days = self._allowed_treatment_window(patient, protocol)
+            enrollment = self._enrollment_date(patient)
+            violation_days = allowed_days + self.rng.randint(2, 18)
+            patient["treatment_start"] = (enrollment + timedelta(days=violation_days)).strftime("%Y-%m-%d")
+            if patient["death_date"]:
+                death_date = datetime.strptime(patient["death_date"], "%Y-%m-%d")
+                treatment_start = self._treatment_date(patient)
+                if death_date <= treatment_start:
+                    self._set_deceased(patient, min_days=20, max_days=320)
+            self._mark_error(patient["patient_id"], "protocol_window_violation")
         return patients
+    def _inject_boundary_traps(self, patients: list[dict], protocol: dict, n_traps: int) -> list[dict]:
+        valid_ages = [
+            protocol["age_min"],
+            protocol["age_min"] + 1,
+            protocol["age_min"] + 2,
+            protocol["age_max"] - 2,
+            protocol["age_max"] - 1,
+            protocol["age_max"],
         ]
+        available = [
+            p
+            for p in patients
+            if p["patient_id"] not in self._ground_truth and p["age"] is not None
         ]
+        self.rng.shuffle(available)
+        for patient, age in zip(available[:n_traps], valid_ages * max(1, n_traps)):
+            patient["age"] = age
+            self._traps.add(patient["patient_id"])
         return patients
+    def _inject_temporal_traps(self, patients: list[dict], n_traps: int) -> list[dict]:
         available = [
+            p
+            for p in patients
             if p["patient_id"] not in self._ground_truth
+            and p["patient_id"] not in self._traps
+            and p["death_date"] is None
         ]
         self.rng.shuffle(available)
+        for patient in available[:n_traps]:
+            patient["stage"] = "IV"
+            self._set_deceased(patient, min_days=1, max_days=3)
+            self._traps.add(patient["patient_id"])
+        return patients
+    def _inject_window_traps(self, patients: list[dict], protocol: dict, n_traps: int) -> list[dict]:
+        available = [
+            p
+            for p in patients
+            if p["patient_id"] not in self._ground_truth and p["patient_id"] not in self._traps
+        ]
+        self.rng.shuffle(available)
+        for patient in available[:n_traps]:
+            enrollment = self._enrollment_date(patient)
+            if self.rng.random() < 0.55:
+                patient["stage"] = "IV"
+            allowed_days = self._allowed_treatment_window(patient, protocol)
+            trap_delay = max(1, allowed_days - self.rng.choice([0, 1]))
+            patient["treatment_start"] = (enrollment + timedelta(days=trap_delay)).strftime("%Y-%m-%d")
+            if patient["death_date"]:
+                death_date = datetime.strptime(patient["death_date"], "%Y-%m-%d")
+                if death_date <= self._treatment_date(patient):
+                    self._set_deceased(patient, min_days=12, max_days=240)
+            self._traps.add(patient["patient_id"])
         return patients
+    def _inject_distractor_deceased(self, patients: list[dict], n_distractors: int) -> list[dict]:
         available = [
+            p
+            for p in patients
             if p["patient_id"] not in self._ground_truth
             and p["patient_id"] not in self._traps
+            and p["death_date"] is None
         ]
         self.rng.shuffle(available)
+        for patient in available[:n_distractors]:
+            self._set_deceased(patient, min_days=30, max_days=520)
+            self._traps.add(patient["patient_id"])
         return patients
+    def _inject_fake_bias_distractors(self, patients: list[dict], n_distractors: int) -> list[dict]:
+        treatment_group = [
+            p
+            for p in patients
             if p["group"] == "treatment"
             and p["patient_id"] not in self._ground_truth
             and p["patient_id"] not in self._traps
         ]
+        self.rng.shuffle(treatment_group)
+        for patient in treatment_group[:n_distractors]:
+            patient["ethnicity"] = "White"
+            patient["gender"] = "M"
+            if self.rng.random() < 0.5:
+                patient["stage"] = "IV"
+                self._set_deceased(patient, min_days=15, max_days=180)
+            self._traps.add(patient["patient_id"])
         return patients
+    def _inject_selection_bias(self, patients: list[dict], protocol: dict) -> None:
+        control = [
+            p
+            for p in patients
+            if p["group"] == "control" and p["patient_id"] not in self._ground_truth
         ]
+        if not control:
+            return
+        target_dom_ratio = protocol["bias_control_dominance_threshold"] + self.rng.uniform(0.06, 0.12)
+        dominant_target = int(len(control) * min(0.86, target_dom_ratio))
+        white_control = [p for p in control if p["ethnicity"] == "White"]
+        non_white_control = [p for p in control if p["ethnicity"] != "White"]
+        needed = max(0, dominant_target - len(white_control))
+        self.rng.shuffle(non_white_control)
+        for patient in non_white_control[:needed]:
+            patient["ethnicity"] = "White"
+        target_male_ratio = protocol["bias_male_threshold"] + self.rng.uniform(0.05, 0.10)
+        male_target = int(len(control) * min(0.82, target_male_ratio))
+        male_control = [p for p in control if p["gender"] == "M"]
+        female_control = [p for p in control if p["gender"] == "F"]
+        needed_male = max(0, male_target - len(male_control))
+        self.rng.shuffle(female_control)
+        for patient in female_control[:needed_male]:
+            patient["gender"] = "M"
+        dominant = [p for p in control if p["ethnicity"] == "White"]
+        minority = [p for p in control if p["ethnicity"] != "White"]
+        for stage in STAGES:
+            stage_majority = [p for p in dominant if p["stage"] == stage]
+            stage_minority = [p for p in minority if p["stage"] == stage]
+            if not stage_majority or not stage_minority:
+                continue
+            base = BASE_STAGE_MORTALITY[stage]
+            self._apply_target_mortality(stage_majority, max(0.02, base - 0.03))
+            self._apply_target_mortality(stage_minority, min(0.82, base + 0.18))
+    def _inject_confounder_cohort(self, patients: list[dict], protocol: dict) -> None:
+        control = [
+            p
+            for p in patients
+            if p["group"] == "control" and p["patient_id"] not in self._ground_truth
+        ]
+        if not control:
+            return
+        minority = [p for p in control if p["ethnicity"] != "White"]
+        white = [p for p in control if p["ethnicity"] == "White"]
+        self.rng.shuffle(minority)
+        self.rng.shuffle(white)
+        minority_shift = max(8, len(control) // 18)
+        white_shift = max(4, len(control) // 30)
+        for patient in minority[:minority_shift]:
+            patient["stage"] = "IV"
+            patient["treatment_site"] = self.rng.choice(protocol["high_risk_sites"])
+            patient["country"] = next(
+                country for site, country in HOSPITAL_SITES if site == patient["treatment_site"]
+            )
+        for patient in white[:white_shift]:
+            patient["stage"] = "IV"
+            patient["treatment_site"] = self.rng.choice(protocol["high_risk_sites"])
+            patient["country"] = next(
+                country for site, country in HOSPITAL_SITES if site == patient["treatment_site"]
+            )
+        stage_iv_control = [p for p in control if p["stage"] == "IV"]
+        stage_iv_minority = [p for p in stage_iv_control if p["ethnicity"] != "White"]
+        stage_iv_white = [p for p in stage_iv_control if p["ethnicity"] == "White"]
+        self._apply_target_mortality(stage_iv_minority, 0.66)
+        self._apply_target_mortality(stage_iv_white, 0.63)
     def generate(self, difficulty: str = "easy") -> dict:
         config = DIFFICULTY_CONFIGS.get(difficulty, DIFFICULTY_CONFIGS["easy"])
         self._ground_truth = {}
         self._traps = set()
         self._patient_counter = 0
+        protocol = self._build_protocol(difficulty, config)
+        n_patients = config["dataset_size"]
+        n_control = int(n_patients * config["control_ratio"])
+        n_treatment = n_patients - n_control
         patients = []
         for _ in range(n_control):
+            patient = self._generate_base_patient("control", protocol, bias_mode="neutral")
+            patients.append(self._apply_mortality(patient, protocol))
         for _ in range(n_treatment):
+            patient = self._generate_base_patient("treatment", protocol, bias_mode="diverse")
+            patients.append(self._apply_mortality(patient, protocol))
+        patients = self._inject_age_errors(patients, protocol, config)
         if config["temporal_error_rate"] > 0:
+            patients = self._inject_temporal_errors(patients, config)
+        if config["protocol_window_error_rate"] > 0:
+            patients = self._inject_protocol_window_errors(patients, protocol, config)
+        if difficulty == "hard":
+            if protocol["bias_present"]:
+                self._inject_selection_bias(patients, protocol)
+            else:
+                self._inject_confounder_cohort(patients, protocol)
+        patients = self._inject_boundary_traps(patients, protocol, config["num_boundary_traps"])
         if config["num_temporal_traps"] > 0:
+            patients = self._inject_temporal_traps(patients, config["num_temporal_traps"])
+        if config["num_window_traps"] > 0:
+            patients = self._inject_window_traps(patients, protocol, config["num_window_traps"])
+        patients = self._inject_distractor_deceased(patients, config["num_distractor_deceased"])
         if config["num_fake_bias_distractors"] > 0:
+            patients = self._inject_fake_bias_distractors(patients, config["num_fake_bias_distractors"])
         self.rng.shuffle(patients)
         stats = {
             "total_patients": len(patients),
+            "age_errors": sum("invalid_age" in errs for errs in self._ground_truth.values()),
+            "temporal_errors": sum("temporal_inconsistency" in errs for errs in self._ground_truth.values()),
+            "protocol_window_errors": sum("protocol_window_violation" in errs for errs in self._ground_truth.values()),
+            "bias_present": protocol["bias_present"],
+            "bias_mode": "true_bias" if protocol["bias_present"] else ("confounded_no_bias" if difficulty == "hard" else "none"),
             "num_traps": len(self._traps),
             "control_count": sum(1 for p in patients if p["group"] == "control"),
             "treatment_count": sum(1 for p in patients if p["group"] == "treatment"),
+            "protocol_title": protocol["protocol_title"],
         }
+        stats["total_errors"] = (
+            stats["age_errors"]
+            + stats["temporal_errors"]
+            + stats["protocol_window_errors"]
+            + (1 if protocol["bias_present"] else 0)
+        )
         return {
             "dataset": patients,
             "ground_truth": dict(self._ground_truth),
             "traps": set(self._traps),
+            "bias_present": protocol["bias_present"],
+            "protocol": protocol,
+            "protocol_excerpt": protocol["excerpt"],
+            "protocol_title": protocol["protocol_title"],
             "config": config,
             "stats": stats,
         }
 if __name__ == "__main__":
     print("=" * 60)
     print("  Dataset Generator — Validation Test")
     print("=" * 60)
+    for difficulty in ["easy", "medium", "hard"]:
+        generator = DatasetGenerator(seed=42)
+        result = generator.generate(difficulty=difficulty)
         stats = result["stats"]
+        protocol = result["protocol"]
+        print(f"\n  {difficulty.upper()}:")
+        print(f"    Protocol:    {stats['protocol_title']}")
         print(f"    Patients:    {stats['total_patients']}")
+        print(
+            f"    Errors:      {stats['total_errors']} "
+            f"(age={stats['age_errors']}, temporal={stats['temporal_errors']}, "
+            f"window={stats['protocol_window_errors']}, bias={stats['bias_mode']})"
+        )
         print(f"    Traps:       {stats['num_traps']}")
         print(f"    Control:     {stats['control_count']}")
         print(f"    Treatment:   {stats['treatment_count']}")
+        print(
+            f"    Rules:       age={protocol['age_min']}-{protocol['age_max']} | "
+            f"start<={protocol['treatment_window_days']}d | "
+            f"stage IV<={protocol['stage_iv_treatment_window_days']}d"
+        )
+        generator_2 = DatasetGenerator(seed=42)
+        result_2 = generator_2.generate(difficulty=difficulty)
+        assert result["dataset"] == result_2["dataset"], "REPRODUCIBILITY FAILED!"
+        assert result["ground_truth"] == result_2["ground_truth"], "GROUND TRUTH MISMATCH!"
+        assert result["protocol_excerpt"] == result_2["protocol_excerpt"], "PROTOCOL MISMATCH!"
+        print("    ✓ Seed reproducibility verified")
+        for patient_id, errors in result["ground_truth"].items():
+            patient = next(p for p in result["dataset"] if p["patient_id"] == patient_id)
+            for error in errors:
+                if error == "invalid_age":
                     age = patient.get("age")
+                    assert age is None or age < protocol["age_min"] or age > protocol["age_max"], (
+                        f"Ground truth says {patient_id} invalid age but age={age}"
+                    )
+                elif error == "temporal_inconsistency":
+                    treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+                    death_date = datetime.strptime(patient["death_date"], "%Y-%m-%d")
+                    assert death_date < treatment_start, (
+                        f"Ground truth says {patient_id} temporal error but dates are valid"
+                    )
+                elif error == "protocol_window_violation":
+                    enrollment = datetime.strptime(patient["enrollment_date"], "%Y-%m-%d")
+                    treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+                    allowed = (
+                        protocol["stage_iv_treatment_window_days"]
+                        if patient["stage"] == "IV"
+                        else protocol["treatment_window_days"]
+                    )
+                    assert (treatment_start - enrollment).days > allowed, (
+                        f"Ground truth says {patient_id} window error but delay is valid"
+                    )
+        print("    ✓ Ground truth integrity verified")
+    generator_a = DatasetGenerator(seed=1)
+    generator_b = DatasetGenerator(seed=2)
+    result_a = generator_a.generate("easy")
+    result_b = generator_b.generate("easy")
+    assert result_a["dataset"] != result_b["dataset"], "Different seeds generated identical datasets!"
+    assert result_a["protocol_excerpt"] != result_b["protocol_excerpt"], "Different seeds generated identical protocols!"
+    print("\n    ✓ Different seeds produce different datasets")
     print(f"\n{'=' * 60}")
+    print("  ALL TESTS PASSED")
     print(f"{'=' * 60}")

server/models.py CHANGED Viewed

@@ -2,6 +2,7 @@ from typing import Optional, List, Dict, Any
 from pydantic import Field
 from openenv.core.env_server import Action, Observation, State
 class AuditAction(Action):
     action_type: str = "flag_error"
     patient_id: Optional[str] = None
@@ -12,30 +13,43 @@ class AuditAction(Action):
     report: Optional[str] = None
     confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
 class AuditObservation(Observation):
     done: bool = False
     reward: float = 0.0
     task_id: str = ""
     task_type: str = ""
     task_description: str = ""
     dataset: List[Dict[str, Any]] = Field(default_factory=list)
     errors_found: List[str] = Field(default_factory=list)
     patterns_investigated: List[str] = Field(default_factory=list)
     distributions_computed: List[str] = Field(default_factory=list)
     feedback: Optional[str] = None
     score_so_far: float = 0.0
     attempts_remaining: int = 15
     phase: str = "investigation"
 class AuditState(State):
     episode_id: str = ""
     step_count: int = 0
     task_id: str = ""
     task_type: str = ""
     total_errors: int = 0
     errors_found: int = 0
     current_score: float = 0.0
     attempts: int = 0
     phase: str = "investigation"
     patterns_investigated: List[str] = Field(default_factory=list)
-    distributions_computed: List[str] = Field(default_factory=list)

 from pydantic import Field
 from openenv.core.env_server import Action, Observation, State
 class AuditAction(Action):
     action_type: str = "flag_error"
     patient_id: Optional[str] = None
     report: Optional[str] = None
     confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
 class AuditObservation(Observation):
     done: bool = False
     reward: float = 0.0
     task_id: str = ""
     task_type: str = ""
     task_description: str = ""
+    protocol_title: str = ""
+    trial_protocol_excerpt: str = ""
     dataset: List[Dict[str, Any]] = Field(default_factory=list)
     errors_found: List[str] = Field(default_factory=list)
     patterns_investigated: List[str] = Field(default_factory=list)
     distributions_computed: List[str] = Field(default_factory=list)
     feedback: Optional[str] = None
     score_so_far: float = 0.0
+    dense_reward_total: float = 0.0
+    score_breakdown: Dict[str, float] = Field(default_factory=dict)
     attempts_remaining: int = 15
     phase: str = "investigation"
 class AuditState(State):
     episode_id: str = ""
     step_count: int = 0
     task_id: str = ""
     task_type: str = ""
+    protocol_title: str = ""
+    trial_protocol_excerpt: str = ""
     total_errors: int = 0
     errors_found: int = 0
     current_score: float = 0.0
+    dense_reward_total: float = 0.0
+    correct_flags: int = 0
+    false_positives: int = 0
+    duplicate_flags: int = 0
     attempts: int = 0
     phase: str = "investigation"
+    score_breakdown: Dict[str, float] = Field(default_factory=dict)
     patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)