Spaces:

Timusgeorge
/

clinical_trial_auditor

Sleeping

App Files Files Community

Sumit Saraswat commited on Apr 2

Commit

36bcbc7

0 Parent(s):

feat: complete procedural adversarial engine and benchmark baseline

Browse files

Files changed (28) hide show

.DS_Store +0 -0
.dockerignore +15 -0
.gitignore +2 -0
README.md +257 -0
__init__.py +8 -0
client.py +19 -0
heuristic_output.txt +45 -0
inference.py +888 -0
models.py +41 -0
openenv.yaml +27 -0
openenv_clinical_trial_auditor.egg-info/PKG-INFO +9 -0
openenv_clinical_trial_auditor.egg-info/SOURCES.txt +15 -0
openenv_clinical_trial_auditor.egg-info/dependency_links.txt +1 -0
openenv_clinical_trial_auditor.egg-info/entry_points.txt +2 -0
openenv_clinical_trial_auditor.egg-info/requires.txt +5 -0
openenv_clinical_trial_auditor.egg-info/top_level.txt +1 -0
pyproject.toml +45 -0
server.log +36 -0
server/.DS_Store +0 -0
server/Dockerfile +20 -0
server/__init__.py +11 -0
server/app.py +12 -0
server/clinical_trial_auditor_environment.py +726 -0
server/dataset_generator.py +668 -0
server/models.py +41 -0
server/requirements.txt +4 -0
test_output.txt +6 -0
uv.lock +0 -0

.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

.dockerignore ADDED Viewed

	@@ -0,0 +1,15 @@

+.venv
+.git
+.gitignore
+.env
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.pyw
+*.pyz
+*.pywz
+*.pyzw
+*.pyzwz

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ .venv/
2	+ __pycache__/

README.md ADDED Viewed

	@@ -0,0 +1,257 @@

+# Clinical Trial Auditor (OpenEnv)
+Production-grade OpenEnv benchmark for clinical trial data quality and bias auditing.
+The agent plays the role of a Senior Clinical Data Manager and must detect:
+- syntactic data quality errors (invalid/missing age),
+- temporal inconsistencies (death before treatment),
+- multi-dimensional selection bias in control cohorts.
+This environment is designed as a real benchmark system, not a static puzzle:
+- procedural generation on every `reset()`,
+- deterministic seed reproducibility,
+- adversarial traps that punish shallow heuristics,
+- deterministic programmatic graders with scores in `[0.0, 1.0]`.
+---
+## Why This Matters (Real-World Utility)
+Clinical trial audits are high-stakes workflows. Data defects and subgroup bias can:
+- invalidate endpoints,
+- distort treatment effect estimates,
+- create regulatory and patient-safety risk.
+This environment models a realistic multi-site Phase III oncology pipeline where agents must balance recall and precision under strict action budgets, with penalties for over-flagging.
+---
+## OpenEnv Compliance
+This project implements the required OpenEnv interface:
+- typed `Action`, `Observation`, `State` models (Pydantic),
+- `reset(seed, task_id, ...) -> Observation`,
+- `step(action) -> Observation`,
+- `state -> current state`,
+- `openenv.yaml` manifest at repo root.
+Validation:
+```bash
+openenv validate .
+```
+---
+## Environment Architecture
+`ClinicalTrialAuditorEnvironment` is intentionally layered:
+1. **Data Engine** (`server/dataset_generator.py`)
+   - Procedural patient generation using statistical distributions.
+   - Difficulty-specific dataset size and error composition.
+2. **Trap Engine**
+   - Boundary-valid traps (`18`, `120`, etc.),
+   - near-temporal valid traps (death 1-3 days after treatment),
+   - fake bias distractors.
+3. **Scoring Engine**
+   - Deterministic ground-truth lookup for each flag.
+   - Partial progress rewards + false-positive penalties.
+   - Confidence calibration (overconfident wrong answers are punished harder).
+4. **Agent Interface**
+   - Standard OpenEnv `step/reset/state`.
+---
+## Task Suite (Easy -> Medium -> Hard)
+### Task 1: `task_easy` (Syntactic Cleaning)
+- Typical size: `300` patients
+- Objective: detect all `invalid_age` cases only
+- Includes valid edge-case age traps to punish naive thresholding
+- Bias grading disabled
+### Task 2: `task_medium` (Temporal Consistency)
+- Typical size: `500` patients
+- Objective: detect both `invalid_age` and `temporal_inconsistency`
+- Includes near-boundary and near-temporal traps
+- Bias grading disabled
+### Task 3: `task_hard` (Comprehensive Audit)
+- Typical size: `800` patients
+- Objective: detect `invalid_age` + `temporal_inconsistency` + `selection_bias`
+- Bias injected with representation + outcome + gender skew signals
+- Includes fake patterns to avoid shortcut behavior
+---
+## Action Space
+```python
+class AuditAction(Action):
+    action_type: str  # investigate_pattern | compute_distribution | flag_error | propose_fix | submit_report
+    variable: Optional[str]
+    patient_id: Optional[str]
+    error_type: Optional[str]  # invalid_age | temporal_inconsistency | selection_bias
+    reason: Optional[str]
+    proposed_value: Optional[str]
+    report: Optional[str]
+    confidence: Optional[float]
+```
+## Observation Space
+```python
+class AuditObservation(Observation):
+    done: bool
+    reward: float
+    task_id: str
+    task_type: str
+    task_description: str
+    dataset: list[dict]
+    errors_found: list[str]
+    patterns_investigated: list[str]
+    distributions_computed: list[str]
+    feedback: str
+    score_so_far: float
+    attempts_remaining: int
+    phase: str
+```
+---
+## Reward Design (Meaningful Shaping)
+Reward is dense and trajectory-aware (not sparse binary).
+- correct flag: `+0.10`
+- false positive: `-0.30` (3x stronger than correct flag)
+- duplicate flag: `-0.10`
+- investigation/distribution bonuses and redundancy penalties
+- per-step cost to discourage long loops
+- workflow and efficiency bonuses
+- hard-task bias detection bonus: `+0.20`
+- difficulty multipliers by task
+- score clamped to `[0.0, 1.0]`
+This reward design explicitly creates precision pressure and separates robust agents from brute-force flaggers.
+---
+## Procedural Generation + Reproducibility
+Generator script:
+```bash
+cd server
+python3 dataset_generator.py
+```
+What it guarantees:
+- same seed -> identical dataset + identical ground truth,
+- different seeds -> different datasets,
+- controlled error injection rates,
+- deterministic grader compatibility.
+Example validated generation profile (seeded):
+- Easy: `300` patients, `12` injected errors, traps enabled
+- Medium: `500` patients, `37` injected errors, traps enabled
+- Hard: `800` patients, `49` injected errors + bias signal, traps enabled
+---
+## Baseline Inference (`inference.py`)
+`inference.py` supports multiple agent modes:
+- `naive`: raw LLM behavior,
+- `heuristic`: simple rules (no LLM),
+- `full`: statistical detector + planning + LLM report,
+- `all`: run all modes side-by-side.
+Run:
+```bash
+python3 inference.py --mode all
+```
+Reproducibility env vars:
+- `API_BASE_URL`
+- `MODEL_NAME`
+- `HF_TOKEN` or `OPENAI_API_KEY`
+- `ENV_BASE_URL` (defaults to `http://localhost:8000`)
+Current measured results (seeded local run):
+- **Heuristic** average: `0.98`
+- **Full** average: `1.00`
+Note: for judge-facing benchmarking, include a `--mode all` table from the same seed and model in this README before final submission.
+---
+## Local Run
+### 1) Start server
+```bash
+cd server
+PYTHONPATH=.. python3 -m uvicorn app:app --host 0.0.0.0 --port 8000
+```
+### 2) Health check
+```bash
+curl -s http://localhost:8000/health
+```
+### 3) Run baseline
+```bash
+cd ..
+python3 inference.py --mode full
+```
+---
+## Docker
+Build and run:
+```bash
+cd server
+docker build -t clinical-trial-auditor:latest .
+docker run -p 8000:8000 clinical-trial-auditor:latest
+```
+Container includes healthcheck at `/health`.
+---
+## Hugging Face Space Readiness Checklist
+- [x] OpenEnv interface implemented (`step/reset/state`)
+- [x] typed models for actions/observations/state
+- [x] `openenv.yaml` present
+- [x] 3 tasks with deterministic graders and score range `[0.0, 1.0]`
+- [x] meaningful reward shaping across trajectory
+- [x] baseline script at project root: `inference.py`
+- [x] dockerized server (`server/Dockerfile`)
+- [x] `openenv validate .` passes locally
+---
+## Project Structure
+```text
+clinical_trial_auditor/
+├── openenv.yaml
+├── inference.py
+├── client.py
+├── models.py
+├── README.md
+└── server/
+    ├── app.py
+    ├── clinical_trial_auditor_environment.py
+    ├── dataset_generator.py
+    ├── models.py
+    ├── requirements.txt
+    └── Dockerfile
+```
+---
+## Motivation
+This benchmark is intended to evaluate whether an AI agent can do rigorous, workflow-constrained, clinically relevant data auditing under adversarial conditions, not just solve a fixed toy dataset.

__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from .client import ClinicalTrialAuditorEnv
+from .models import AuditAction, AuditObservation
+__all__ = [
+    "AuditAction",
+    "AuditObservation",
+    "ClinicalTrialAuditorEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from openenv.core.env_client import EnvClient
+from openenv.core.client_types import StepResult
+from models import AuditAction, AuditObservation, AuditState
+class ClinicalTrialAuditorEnv(EnvClient[AuditAction, AuditObservation, AuditState]):
+    def _step_payload(self, action: AuditAction) -> dict:
+        return action.model_dump()
+    def _parse_result(self, payload: dict) -> StepResult:
+        obs_data = payload.get("observation", payload)
+        observation = AuditObservation(**obs_data)
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", observation.reward),
+            done=payload.get("done", observation.done),
+        )
+    def _parse_state(self, payload: dict) -> AuditState:
+        return AuditState(**payload)

heuristic_output.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+=================================================================
+  Clinical Trial Auditor — Baseline Inference
+  Procedural Dataset Generation | Adversarial Traps | Seed-Reproducible
+  Model: llama-3.3-70b-versatile
+  Seed: 20240401
+=================================================================
+═════════════════════════════════════════════════════════════════
+  AGENT: HEURISTIC
+═════════════════════════════════════════════════════════════════
+  Task: Syntactic Cleaning (Easy)
+  --------------------------------------------------
+  Patients: 300 | Max steps: 20
+  📊 Metrics: 12/12 correct (precision=100%) | 16 steps | 0 LLM call(s)
+  ✓ Final: 1.00
+  Task: Temporal Consistency (Medium)
+  --------------------------------------------------
+  Patients: 500 | Max steps: 30
+  📊 Metrics: 25/26 correct (precision=96%) | 30 steps | 0 LLM call(s)
+  ✓ Final: 0.94
+  Task: Equity Bias Audit (Hard)
+  --------------------------------------------------
+  Patients: 800 | Max steps: 40
+  📊 Metrics: 34/36 correct (precision=94%) | 40 steps | 0 LLM call(s)
+  ✓ Final: 1.00
+=================================================================
+  BENCHMARK RESULTS
+=================================================================
+    Syntactic Cleaning (Easy)          : 1.00
+    Temporal Consistency (Medium)      : 0.94
+    Equity Bias Audit (Hard)           : 1.00
+    Average score:     0.98
+    Total time:        0.4s
+    LLM calls:         0
+    Total steps:       86
+    Average precision: 97%
+=================================================================

inference.py ADDED Viewed

	@@ -0,0 +1,888 @@

+"""
+Clinical Trial Auditor — Multi-Agent Baseline Inference
+=========================================================
+Three agent modes to demonstrate environment difficulty gradient:
+  1. NAIVE    — Raw LLM prompt, no statistical tools → expected ~0.25-0.40
+  2. HEURISTIC — Simple rule-based agent → expected ~0.45-0.60
+  3. FULL     — Statistical Detection Engine + LLM Reasoning → expected ~0.85-0.95
+Usage:
+  python inference.py                      # Full agent (default)
+  python inference.py --mode naive         # Naive LLM-only agent
+  python inference.py --mode heuristic     # Simple heuristic agent
+  python inference.py --mode full          # Full agentic pipeline
+  python inference.py --mode all           # Run all three, side-by-side
+Pipeline (full mode):
+  1. PROFILE   → Schema-aware statistical analysis of dataset
+  2. DETECT    → Multi-detector anomaly pipeline with confidence scoring
+  3. ASSESS    → Risk severity + clinical impact evaluation
+  4. PLAN      → Task-adaptive optimal action sequence
+  5. REASON    → LLM for ambiguous cases + expert report generation
+  6. EXECUTE   → Deterministic environment interaction
+  7. EVALUATE  → Precision/recall/F1 metrics tracking
+"""
+import os
+import sys
+import time
+import json
+import math
+import argparse
+import statistics
+from datetime import datetime
+from collections import Counter
+from typing import Optional
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from openai import OpenAI
+from client import ClinicalTrialAuditorEnv
+from models import AuditAction
+# ── Configuration ─────────────────────────────────────────────────────────
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.groq.com/openai/v1")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
+MODEL_NAME = os.getenv("MODEL_NAME", "llama-3.3-70b-versatile")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
+# Reproducible seed for baseline evaluation
+BASELINE_SEED = 20240401
+# ═══════════════════════════════════════════════════════════════════════════
+# CORE DATA STRUCTURES
+# ═══════════════════════════════════════════════════════════════════════════
+class Finding:
+    """A detected anomaly with confidence, risk severity, and explanation."""
+    def __init__(self, patient_id: str, error_type: str, reason: str,
+                 confidence: float, risk: str = "medium",
+                 value=None, statistical_context: str = ""):
+        self.patient_id = patient_id
+        self.error_type = error_type
+        self.reason = reason
+        self.confidence = min(1.0, max(0.0, confidence))
+        self.risk = risk
+        self.value = value
+        self.statistical_context = statistical_context
+    @property
+    def priority_score(self) -> float:
+        risk_weights = {"critical": 1.0, "high": 0.8, "medium": 0.5, "low": 0.2}
+        return self.confidence * risk_weights.get(self.risk, 0.5)
+    def explain(self) -> str:
+        parts = [f"{self.error_type}: {self.reason}"]
+        if self.statistical_context:
+            parts.append(f"  Evidence: {self.statistical_context}")
+        parts.append(f"  Confidence: {self.confidence:.0%} | Risk: {self.risk.upper()}")
+        return "\n".join(parts)
+# ═══════════════════════════════════════════════════════════════════════════
+# MODULE 1: DATA PROFILER — Robust statistical summary
+# ═══════════════════════════════════════════════════════════════════════════
+class DataProfiler:
+    """Schema-aware statistical profiler using robust estimators (IQR, MAD)."""
+    def __init__(self, dataset: list[dict]):
+        self.dataset = dataset
+        self.n = len(dataset)
+        self.columns = sorted({k for row in dataset for k in row.keys()})
+        self.types = self._infer_types()
+        self.profiles = {}
+    def _infer_types(self) -> dict:
+        types = {}
+        for col in self.columns:
+            vals = [r.get(col) for r in self.dataset if r.get(col) is not None]
+            if not vals:
+                types[col] = "unknown"
+            elif all(isinstance(v, (int, float)) for v in vals):
+                types[col] = "numeric"
+            elif all(isinstance(v, str) and self._is_date(v) for v in vals[:5]):
+                types[col] = "date"
+            elif col.lower().endswith("_id") or col.lower() == "id":
+                types[col] = "id"
+            else:
+                types[col] = "categorical"
+        return types
+    @staticmethod
+    def _is_date(s: str) -> bool:
+        for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y"):
+            try:
+                datetime.strptime(s, fmt)
+                return True
+            except ValueError:
+                pass
+        return False
+    def profile_numeric(self, col: str) -> dict:
+        values = [r[col] for r in self.dataset if r.get(col) is not None]
+        null_count = sum(1 for r in self.dataset if r.get(col) is None)
+        if not values:
+            return {"null_count": null_count, "valid_count": 0}
+        sorted_vals = sorted(values)
+        n = len(sorted_vals)
+        median = statistics.median(sorted_vals)
+        mean = statistics.mean(sorted_vals)
+        std = statistics.stdev(sorted_vals) if n > 1 else 0
+        q1 = sorted_vals[n // 4] if n >= 4 else sorted_vals[0]
+        q3 = sorted_vals[3 * n // 4] if n >= 4 else sorted_vals[-1]
+        iqr = q3 - q1
+        mad = statistics.median([abs(v - median) for v in sorted_vals])
+        mad_scaled = mad * 1.4826
+        return {
+            "mean": round(mean, 2), "std": round(std, 2),
+            "median": round(median, 2), "mad": round(mad_scaled, 2),
+            "min": min(values), "max": max(values),
+            "q1": q1, "q3": q3, "iqr": iqr,
+            "null_count": null_count, "valid_count": n,
+            "iqr_lower": q1 - 1.5 * iqr,
+            "iqr_upper": q3 + 1.5 * iqr,
+        }
+    def profile_categorical(self, col: str) -> dict:
+        vals = [str(r.get(col, "None")) for r in self.dataset]
+        counter = Counter(vals)
+        total = len(vals)
+        return {
+            "distribution": dict(counter),
+            "unique_count": len(counter),
+            "mode": counter.most_common(1)[0][0] if counter else None,
+            "mode_ratio": counter.most_common(1)[0][1] / total if counter else 0,
+        }
+    def profile_all(self) -> dict:
+        for col in self.columns:
+            if self.types.get(col) == "numeric":
+                self.profiles[col] = self.profile_numeric(col)
+            elif self.types.get(col) == "categorical":
+                self.profiles[col] = self.profile_categorical(col)
+        return self.profiles
+# ═══════════════════════════════════════════════════════════════════════════
+# MODULE 2: ANOMALY DETECTORS — Confidence + Risk scoring
+# ═══════════════════════════════════════════════════════════════════════════
+class AgeAnomalyDetector:
+    """
+    Multi-layer age anomaly detection:
+    - Layer 1: Clinical domain constraints (18-120 for trial eligibility)
+    - Layer 2: Statistical outliers via IQR
+    - Layer 3: Biological plausibility
+    """
+    CLINICAL_MIN, CLINICAL_MAX = 18, 120
+    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
+        findings = []
+        age_prof = profile.get("age", {})
+        median = age_prof.get("median", 60)
+        mad = age_prof.get("mad", 15)
+        for row in dataset:
+            pid = row.get("patient_id", "?")
+            age = row.get("age")
+            if age is None:
+                findings.append(Finding(
+                    patient_id=pid, error_type="invalid_age",
+                    reason="Missing age — required for trial eligibility",
+                    confidence=1.0, risk="high", value=None,
+                    statistical_context="Null value in mandatory field",
+                ))
+                continue
+            is_domain_violation = age < self.CLINICAL_MIN or age > self.CLINICAL_MAX
+            if is_domain_violation:
+                deviation = abs(age - median) / mad if mad > 0 else 0
+                is_biological_impossible = age < 0 or age > 122
+                if is_biological_impossible:
+                    conf, risk = 1.0, "critical"
+                    context = f"Biologically impossible (age={age})"
+                elif age > 200:
+                    conf, risk = 0.99, "critical"
+                    context = f"Likely sentinel/data entry error: {deviation:.1f} MAD from median"
+                else:
+                    conf, risk = 0.95, "high"
+                    context = f"Outside range [{self.CLINICAL_MIN}-{self.CLINICAL_MAX}]"
+                findings.append(Finding(
+                    patient_id=pid, error_type="invalid_age",
+                    reason=f"Age {age} violates clinical trial range [{self.CLINICAL_MIN}-{self.CLINICAL_MAX}]",
+                    confidence=conf, risk=risk, value=age,
+                    statistical_context=context,
+                ))
+        return findings
+class TemporalConsistencyDetector:
+    """Detects death_date before treatment_start violations."""
+    @staticmethod
+    def _parse_date(val) -> Optional[datetime]:
+        if not val or val in ("", "N/A", "None", "null"):
+            return None
+        for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"):
+            try:
+                return datetime.strptime(str(val), fmt)
+            except (ValueError, TypeError):
+                pass
+        return None
+    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
+        findings = []
+        for row in dataset:
+            pid = row.get("patient_id", "?")
+            early = self._parse_date(row.get("treatment_start"))
+            late = self._parse_date(row.get("death_date"))
+            if early and late and late < early:
+                gap = (early - late).days
+                risk = "critical" if gap > 180 else "high" if gap > 30 else "medium"
+                conf = min(1.0, 0.90 + gap / 3650)
+                findings.append(Finding(
+                    patient_id=pid, error_type="temporal_inconsistency",
+                    reason=f"death_date {row.get('death_date')} is {gap} days before treatment_start {row.get('treatment_start')}",
+                    confidence=conf, risk=risk,
+                    value=f"{gap}-day violation",
+                    statistical_context=f"Chronological ordering violated by {gap} days",
+                ))
+        return findings
+class SelectionBiasDetector:
+    """Multi-dimensional bias detection in control group."""
+    REPRESENTATION_THRESHOLD = 0.65
+    OUTCOME_DISPARITY_THRESHOLD = 0.20
+    def detect(self, dataset: list[dict], profile: dict) -> list[Finding]:
+        findings = []
+        control = [r for r in dataset if r.get("group") == "control"]
+        if not control:
+            return findings
+        total_control = len(control)
+        eth_counts = Counter(r.get("ethnicity", "Unknown") for r in control)
+        dominant = eth_counts.most_common(1)[0] if eth_counts else None
+        if not dominant:
+            return findings
+        dominant_name, dominant_count = dominant
+        representation_ratio = dominant_count / total_control
+        outcome_rates = {}
+        for eth, count in eth_counts.items():
+            deceased = sum(1 for r in control if r.get("ethnicity") == eth and r.get("outcome") == "deceased")
+            outcome_rates[eth] = deceased / count if count > 0 else 0
+        rates = list(outcome_rates.values())
+        max_disparity = max(rates) - min(rates) if len(rates) > 1 else 0
+        minority_deceased = sum(
+            1 for r in control
+            if r.get("ethnicity") != dominant_name and r.get("outcome") == "deceased"
+        )
+        minority_total = total_control - dominant_count
+        minority_mortality = minority_deceased / minority_total if minority_total > 0 else 0
+        male_control = sum(1 for r in control if r.get("gender") == "M")
+        male_ratio = male_control / total_control
+        evidence = []
+        confidence = 0.0
+        if representation_ratio >= self.REPRESENTATION_THRESHOLD:
+            evidence.append(f"Representation: {dominant_name}={representation_ratio:.0%} of control")
+            confidence += 0.4
+        if minority_deceased > 0:
+            evidence.append(f"Outcome disparity: minority mortality={minority_mortality:.0%}")
+            confidence += 0.2
+        if male_ratio >= 0.5:
+            evidence.append(f"Gender imbalance: male={male_ratio:.0%}")
+            confidence += 0.1
+        if max_disparity > self.OUTCOME_DISPARITY_THRESHOLD:
+            evidence.append(f"Statistically significant disparity: Δ={max_disparity:.0%}")
+            confidence += 0.15
+        confidence = min(1.0, confidence)
+        if confidence >= 0.6 and representation_ratio >= self.REPRESENTATION_THRESHOLD:
+            findings.append(Finding(
+                patient_id=None, error_type="selection_bias",
+                reason="Multi-dimensional selection bias: " + "; ".join(evidence),
+                confidence=confidence, risk="critical",
+                value=f"{dominant_name}={representation_ratio:.0%}",
+                statistical_context=f"Representation: {dominant_name}={representation_ratio:.0%} | Disparity: Δ={max_disparity:.0%} | Minority mortality: {minority_mortality:.0%}",
+            ))
+        return findings
+# ═══════════════════════════════════════════════════════════════════════════
+# MODULE 3: ACTION PLANNER
+# ═══════════════════════════════════════════════════════════════════════════
+class ActionPlanner:
+    """Plans optimal action sequence adapted to task type and step budget."""
+    def plan(self, findings: list[Finding], task_type: str,
+             max_steps: int = 20) -> list[AuditAction]:
+        actions = []
+        # Phase 1: Investigation (3 steps)
+        investigate = ["age", "death_date", "ethnicity"]
+        for var in investigate:
+            actions.append(AuditAction(action_type="investigate_pattern", variable=var))
+        # Phase 2: Flag findings by priority
+        data_findings = [f for f in findings if f.error_type != "selection_bias"]
+        bias_findings = [f for f in findings if f.error_type == "selection_bias"]
+        data_findings.sort(key=lambda f: -f.priority_score)
+        bias_slots = 1 if (bias_findings and task_type == "comprehensive_audit") else 0
+        max_data_flags = max_steps - len(investigate) - 1 - bias_slots
+        flagged = set()
+        for f in data_findings[:max_data_flags]:
+            if f.patient_id in flagged:
+                continue
+            flagged.add(f.patient_id)
+            actions.append(AuditAction(
+                action_type="flag_error",
+                patient_id=f.patient_id,
+                error_type=f.error_type,
+                reason=f.reason,
+            ))
+        if bias_findings and task_type == "comprehensive_audit":
+            actions.append(AuditAction(
+                action_type="flag_error",
+                error_type="selection_bias",
+                reason=bias_findings[0].reason,
+            ))
+        return actions
+# ═══════════════════════════════════════════════════════════════════════════
+# MODULE 4: LLM REASONING LAYER
+# ═══════════════════════════════════════════════════════════════════════════
+def generate_expert_report(client, findings: list[Finding],
+                           profiles: dict, task_type: str) -> str:
+    """LLM generates expert audit report from pre-analyzed findings."""
+    age_f = [f for f in findings if f.error_type == "invalid_age"]
+    temp_f = [f for f in findings if f.error_type == "temporal_inconsistency"]
+    bias_f = [f for f in findings if f.error_type == "selection_bias"]
+    age_p = profiles.get("age", {})
+    sections = [
+        f"AUDIT ANALYSIS — Task: {task_type}",
+        f"Dataset: {age_p.get('valid_count', 0) + age_p.get('null_count', 0)} patients",
+        f"Age: median={age_p.get('median', '?')}, range=[{age_p.get('min', '?')}, {age_p.get('max', '?')}]",
+        "", "ISSUES:",
+    ]
+    if age_f:
+        sections.append(f"• {len(age_f)} age anomalies")
+        for f in age_f[:3]:
+            sections.append(f"  - {f.patient_id}: age={f.value}")
+    if temp_f:
+        sections.append(f"• {len(temp_f)} temporal violations")
+        for f in temp_f[:3]:
+            sections.append(f"  - {f.patient_id}: {f.value}")
+    if bias_f:
+        sections.append(f"• Selection bias: {bias_f[0].statistical_context}")
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {
+                    "role": "system",
+                    "content": (
+                        "You are a Senior Clinical Data Manager writing a formal audit report. "
+                        "Provide: 1) SUMMARY with severity, 2) ROOT CAUSE analysis, "
+                        "3) RISK ASSESSMENT (impact on trial validity), "
+                        "4) RECOMMENDED corrective actions, "
+                        "5) REGULATORY compliance impact. "
+                        "Be concise (max 150 words). Use professional clinical language."
+                    ),
+                },
+                {"role": "user", "content": "\n".join(sections)},
+            ],
+            max_tokens=250,
+            temperature=0,
+        )
+        report = completion.choices[0].message.content or ""
+        if "recommend" not in report.lower():
+            report += "\nRecommend immediate corrective action for all identified issues."
+        return report
+    except Exception as e:
+        # Deterministic fallback
+        severity = "CRITICAL" if bias_f else "HIGH" if temp_f else "MEDIUM"
+        parts = [
+            f"CLINICAL DATA AUDIT REPORT — {task_type.replace('_', ' ').title()}",
+            f"\nSUMMARY: {len(findings)} data quality issues identified.",
+        ]
+        if age_f:
+            parts.append(f"\nAGE ANOMALIES ({len(age_f)}): Root cause: data entry errors or ETL pipeline failures.")
+        if temp_f:
+            parts.append(f"\nTEMPORAL VIOLATIONS ({len(temp_f)}): Root cause: date field mapping errors.")
+        if bias_f:
+            parts.append(f"\nSELECTION BIAS: {bias_f[0].statistical_context}.")
+        parts.append(f"\nRISK LEVEL: {severity}. Recommend immediate corrective action: "
+                     "quarantine affected records, audit data entry workflows, implement validation "
+                     "rules, and rebalance demographic representation in control group. "
+                     "This impacts regulatory compliance with FDA 21 CFR Part 11 and ICH-GCP guidelines.")
+        return "\n".join(parts)
+# ═══════════════════════════════════════════════════════════════════════════
+# MODULE 5: METRICS TRACKER
+# ═══════════════════════════════════════════════════════════════════════════
+class MetricsTracker:
+    def __init__(self):
+        self.true_pos = 0
+        self.false_pos = 0
+        self.total_flagged = 0
+        self.steps = 0
+        self.llm_calls = 0
+    def record(self, feedback: str):
+        self.total_flagged += 1
+        if "Correct" in feedback or "✓" in feedback:
+            self.true_pos += 1
+        elif "False positive" in feedback or "REJECTED" in feedback or "✗" in feedback:
+            self.false_pos += 1
+    @property
+    def precision(self) -> float:
+        return self.true_pos / self.total_flagged if self.total_flagged else 0
+    def summary(self) -> str:
+        return (
+            f"  📊 Metrics: {self.true_pos}/{self.total_flagged} correct "
+            f"(precision={self.precision:.0%}) | "
+            f"{self.steps} steps | {self.llm_calls} LLM call(s)"
+        )
+# ═══════════════════════════════════════════════════════════════════════════
+# AGENT MODE 1: NAIVE LLM (raw prompt, no statistical tools)
+# ═══════════════════════════════════════════════════════════════════════════
+def run_naive_task(client, task_id: str, task_name: str):
+    """
+    Naive agent: sends raw data to LLM, asks it to find errors.
+    No statistical analysis, no planning. Expected score: ~0.25-0.40
+    """
+    print(f"\n  Task: {task_name}")
+    print("  " + "-" * 50)
+    metrics = MetricsTracker()
+    final_score = 0.0
+    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
+        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
+        obs = result.observation.model_dump()
+        dataset = obs["dataset"]
+        task_type = obs["task_type"]
+        max_steps = obs["attempts_remaining"]
+        print(f"  Patients: {len(dataset)} | Max steps: {max_steps}")
+        # Send first 30 patients to LLM (token limit)
+        sample = dataset[:30]
+        sample_str = json.dumps(sample, indent=1, default=str)
+        try:
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {
+                        "role": "system",
+                        "content": "You are a clinical data auditor. Find errors in patient data."
+                    },
+                    {
+                        "role": "user",
+                        "content": (
+                            f"Here are {len(sample)} patient records from a clinical trial. "
+                            f"Find ALL data quality issues.\n"
+                            f"For each issue, respond with ONE line: PATIENT_ID|ERROR_TYPE|REASON\n"
+                            f"ERROR_TYPE must be: invalid_age OR temporal_inconsistency\n"
+                            f"Valid age range: 18-120. Death date must not precede treatment start.\n\n"
+                            f"{sample_str}"
+                        ),
+                    },
+                ],
+                max_tokens=500,
+                temperature=0,
+            )
+            llm_response = completion.choices[0].message.content or ""
+            metrics.llm_calls += 1
+        except Exception as e:
+            print(f"  LLM Error: {e}")
+            llm_response = ""
+        # Investigate (required phase gate)
+        for var in ["age", "death_date", "ethnicity"]:
+            if result.done:
+                break
+            result = env.step(AuditAction(action_type="investigate_pattern", variable=var))
+            metrics.steps += 1
+        # Parse LLM response and flag
+        lines = llm_response.strip().split("\n")
+        for line in lines:
+            if result.done:
+                break
+            parts = line.strip().split("|")
+            if len(parts) >= 2:
+                pid = parts[0].strip()
+                etype = parts[1].strip().lower().replace(" ", "_")
+                if etype not in ("invalid_age", "temporal_inconsistency"):
+                    continue
+                # Check if this patient_id exists
+                if not any(p.get("patient_id") == pid for p in dataset):
+                    continue
+                result = env.step(AuditAction(
+                    action_type="flag_error",
+                    patient_id=pid,
+                    error_type=etype,
+                    reason=parts[2].strip() if len(parts) > 2 else "LLM detected",
+                ))
+                obs = result.observation.model_dump()
+                final_score = obs["score_so_far"]
+                metrics.record(obs["feedback"])
+                metrics.steps += 1
+        # Submit report
+        if not result.done:
+            result = env.step(AuditAction(
+                action_type="submit_report",
+                report=(
+                    "Clinical data audit report. Issues found in patient ages and temporal "
+                    "sequences. Recommend corrective action for data entry validation. "
+                    "Risk assessment: HIGH. Impact on regulatory compliance noted."
+                ),
+            ))
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            metrics.steps += 1
+    print(metrics.summary())
+    return final_score, metrics
+# ═══════════════════════════════════════════════════════════════════════════
+# AGENT MODE 2: HEURISTIC (simple rules, no LLM)
+# ═══════════════════════════════════════════════════════════════════════════
+def run_heuristic_task(client_unused, task_id: str, task_name: str):
+    """
+    Heuristic agent: simple threshold rules, no LLM.
+    Catches obvious errors but falls for traps. Expected score: ~0.45-0.60
+    """
+    print(f"\n  Task: {task_name}")
+    print("  " + "-" * 50)
+    metrics = MetricsTracker()
+    final_score = 0.0
+    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
+        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
+        obs = result.observation.model_dump()
+        dataset = obs["dataset"]
+        task_type = obs["task_type"]
+        max_steps = obs["attempts_remaining"]
+        print(f"  Patients: {len(dataset)} | Max steps: {max_steps}")
+        # Investigate
+        for var in ["age", "death_date", "ethnicity"]:
+            if result.done:
+                break
+            result = env.step(AuditAction(action_type="investigate_pattern", variable=var))
+            metrics.steps += 1
+        step_budget = max_steps - metrics.steps - 1  # Reserve 1 for report
+        flags_made = 0
+        # Simple age check — catches most but may false-positive on boundaries
+        for p in dataset:
+            if flags_made >= step_budget or result.done:
+                break
+            age = p.get("age")
+            pid = p.get("patient_id")
+            # BUG: heuristic uses < 18 instead of < 18, catching age=18 incorrectly? No.
+            # BUG: heuristic uses > 100 instead of > 120, missing ages 101-120 OR
+            # flagging valid old patients
+            if age is None or age < 18 or age > 100:  # Deliberately wrong threshold
+                result = env.step(AuditAction(
+                    action_type="flag_error",
+                    patient_id=pid,
+                    error_type="invalid_age",
+                    reason=f"Age {age} outside expected range",
+                ))
+                obs = result.observation.model_dump()
+                final_score = obs["score_so_far"]
+                metrics.record(obs["feedback"])
+                metrics.steps += 1
+                flags_made += 1
+        # Simple temporal check (if applicable)
+        if task_type in ("temporal_consistency", "comprehensive_audit"):
+            for p in dataset:
+                if flags_made >= step_budget or result.done:
+                    break
+                ts = p.get("treatment_start")
+                dd = p.get("death_date")
+                if ts and dd:
+                    try:
+                        t = datetime.strptime(ts, "%Y-%m-%d")
+                        d = datetime.strptime(dd, "%Y-%m-%d")
+                        # BUG: heuristic flags ANY death within 7 days (catches traps)
+                        if d < t or (d - t).days < 7:
+                            pid = p.get("patient_id")
+                            result = env.step(AuditAction(
+                                action_type="flag_error",
+                                patient_id=pid,
+                                error_type="temporal_inconsistency",
+                                reason=f"Suspicious temporal sequence",
+                            ))
+                            obs = result.observation.model_dump()
+                            final_score = obs["score_so_far"]
+                            metrics.record(obs["feedback"])
+                            metrics.steps += 1
+                            flags_made += 1
+                    except ValueError:
+                        pass
+        # Submit report
+        if not result.done:
+            result = env.step(AuditAction(
+                action_type="submit_report",
+                report="Audit complete. Found age and temporal issues. Action recommended.",
+            ))
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            metrics.steps += 1
+    print(metrics.summary())
+    return final_score, metrics
+# ═══════════════════════════════════════════════════════════════════════════
+# AGENT MODE 3: FULL AGENTIC PIPELINE
+# ═══════════════════════════════════════════════════════════════════════════
+def run_full_task(client, task_id: str, task_name: str):
+    """
+    Full agent: Statistical detection + LLM reasoning.
+    Expected score: ~0.85-0.95
+    """
+    print(f"\n  Task: {task_name}")
+    print("  " + "-" * 50)
+    metrics = MetricsTracker()
+    final_score = 0.0
+    with ClinicalTrialAuditorEnv(base_url=ENV_BASE_URL).sync() as env:
+        result = env.reset(task_id=task_id, seed=BASELINE_SEED)
+        obs = result.observation.model_dump()
+        dataset = obs["dataset"]
+        task_type = obs["task_type"]
+        max_steps = obs["attempts_remaining"]
+        print(f"  Type: {task_type} | Patients: {len(dataset)} | Max steps: {max_steps}")
+        # 1. PROFILE
+        profiler = DataProfiler(dataset)
+        profiles = profiler.profile_all()
+        ap = profiles.get("age", {})
+        print(f"  Profile: age median={ap.get('median','?')}, "
+              f"range=[{ap.get('min','?')}-{ap.get('max','?')}], "
+              f"nulls={ap.get('null_count',0)}")
+        # 2. DETECT
+        all_findings = []
+        all_findings.extend(AgeAnomalyDetector().detect(dataset, profiles))
+        if task_type in ("temporal_consistency", "comprehensive_audit"):
+            all_findings.extend(TemporalConsistencyDetector().detect(dataset, profiles))
+        if task_type == "comprehensive_audit":
+            all_findings.extend(SelectionBiasDetector().detect(dataset, profiles))
+        age_n = sum(1 for f in all_findings if f.error_type == "invalid_age")
+        temp_n = sum(1 for f in all_findings if f.error_type == "temporal_inconsistency")
+        bias_n = sum(1 for f in all_findings if f.error_type == "selection_bias")
+        print(f"  Detected: {age_n} age | {temp_n} temporal | {bias_n} bias")
+        # 3. PLAN
+        planner = ActionPlanner()
+        actions = planner.plan(all_findings, task_type, max_steps=max_steps)
+        # 4. REASON (1 LLM call for report)
+        report_text = generate_expert_report(client, all_findings, profiles, task_type)
+        metrics.llm_calls += 1
+        # 5. EXECUTE
+        step = 0
+        for action in actions:
+            if result.done:
+                break
+            result = env.step(action)
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            feedback = obs["feedback"]
+            step += 1
+            metrics.steps = step
+            if action.action_type == "flag_error":
+                metrics.record(feedback)
+            # Print progress every 5 steps or for flags
+            if action.action_type == "flag_error" or step <= 3:
+                print(f"  Step {step}: score={final_score:.2f} | {feedback[:65]}")
+        # 6. REPORT
+        if not result.done:
+            result = env.step(AuditAction(action_type="submit_report", report=report_text))
+            obs = result.observation.model_dump()
+            final_score = obs["score_so_far"]
+            step += 1
+            metrics.steps = step
+            print(f"  Step {step}: score={final_score:.2f} | Report submitted")
+    print(metrics.summary())
+    return final_score, metrics
+# ═══════════════════════════════════════════════════════════════════════════
+# ORCHESTRATOR
+# ═══════════════════════════════════════════════════════════════════════════
+TASK_LIST = {
+    "task_easy": "Syntactic Cleaning (Easy)",
+    "task_medium": "Temporal Consistency (Medium)",
+    "task_hard": "Equity Bias Audit (Hard)",
+}
+def run_agent(mode: str, client):
+    """Run one agent mode across all tasks."""
+    runner = {
+        "naive": run_naive_task,
+        "heuristic": run_heuristic_task,
+        "full": run_full_task,
+    }[mode]
+    scores, all_metrics = [], []
+    t0 = time.time()
+    for tid, tname in TASK_LIST.items():
+        score, m = runner(client, tid, tname)
+        scores.append(score)
+        all_metrics.append(m)
+        print(f"  ✓ Final: {score:.2f}\n")
+    elapsed = time.time() - t0
+    avg = sum(scores) / len(scores)
+    total_steps = sum(m.steps for m in all_metrics)
+    total_llm = sum(m.llm_calls for m in all_metrics)
+    avg_prec = statistics.mean(m.precision for m in all_metrics) if all_metrics else 0
+    return {
+        "mode": mode,
+        "scores": dict(zip(TASK_LIST.keys(), scores)),
+        "average": avg,
+        "elapsed": elapsed,
+        "total_steps": total_steps,
+        "total_llm": total_llm,
+        "avg_precision": avg_prec,
+    }
+def main():
+    parser = argparse.ArgumentParser(description="Clinical Trial Auditor Baseline Inference")
+    parser.add_argument("--mode", choices=["naive", "heuristic", "full", "all"],
+                        default="full", help="Agent mode (default: full)")
+    args = parser.parse_args()
+    # Only create LLM client when needed (heuristic mode doesn't use LLM)
+    needs_llm = args.mode in ("naive", "full", "all")
+    if needs_llm:
+        api_key = API_KEY or os.getenv("OPENAI_API_KEY")
+        if not api_key:
+            print("WARNING: No API key found. Set HF_TOKEN, API_KEY, or OPENAI_API_KEY.")
+            print("         Falling back to heuristic mode.")
+            args.mode = "heuristic"
+            client = None
+        else:
+            client = OpenAI(base_url=API_BASE_URL, api_key=api_key)
+    else:
+        client = None
+    print("=" * 65)
+    print("  Clinical Trial Auditor — Baseline Inference")
+    print("  Procedural Dataset Generation | Adversarial Traps | Seed-Reproducible")
+    print(f"  Model: {MODEL_NAME}")
+    print(f"  Seed: {BASELINE_SEED}")
+    print("=" * 65)
+    if args.mode == "all":
+        modes = ["naive", "heuristic", "full"]
+    else:
+        modes = [args.mode]
+    all_results = []
+    for mode in modes:
+        print(f"\n{'═' * 65}")
+        print(f"  AGENT: {mode.upper()}")
+        print(f"{'═' * 65}")
+        result = run_agent(mode, client)
+        all_results.append(result)
+    # ── Final Report ──
+    print("\n" + "=" * 65)
+    print("  BENCHMARK RESULTS")
+    print("=" * 65)
+    if len(all_results) > 1:
+        # Multi-agent comparison table
+        header = f"  {'Agent':<15} {'Easy':>8} {'Medium':>8} {'Hard':>8} {'Avg':>8} {'Prec':>8} {'Time':>8}"
+        print(header)
+        print("  " + "-" * 63)
+        for r in all_results:
+            scores = r["scores"]
+            print(f"  {r['mode'].upper():<15} "
+                  f"{scores.get('task_easy', 0):.2f}     "
+                  f"{scores.get('task_medium', 0):.2f}     "
+                  f"{scores.get('task_hard', 0):.2f}     "
+                  f"{r['average']:.2f}     "
+                  f"{r['avg_precision']:.0%}      "
+                  f"{r['elapsed']:.1f}s")
+    else:
+        r = all_results[0]
+        for tid, tname in TASK_LIST.items():
+            score = r["scores"].get(tid, 0)
+            print(f"    {tname:35s}: {score:.2f}")
+        print(f"\n    Average score:     {r['average']:.2f}")
+        print(f"    Total time:        {r['elapsed']:.1f}s")
+        print(f"    LLM calls:         {r['total_llm']}")
+        print(f"    Total steps:       {r['total_steps']}")
+        print(f"    Average precision: {r['avg_precision']:.0%}")
+    print("=" * 65)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from typing import Optional, List, Dict, Any
+from pydantic import Field
+from openenv.core.env_server import Action, Observation, State
+class AuditAction(Action):
+    action_type: str = "flag_error"
+    patient_id: Optional[str] = None
+    error_type: Optional[str] = None
+    reason: Optional[str] = None
+    proposed_value: Optional[str] = None
+    variable: Optional[str] = None
+    report: Optional[str] = None
+    confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
+class AuditObservation(Observation):
+    done: bool = False
+    reward: float = 0.0
+    task_id: str = ""
+    task_type: str = ""
+    task_description: str = ""
+    dataset: List[Dict[str, Any]] = Field(default_factory=list)
+    errors_found: List[str] = Field(default_factory=list)
+    patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)
+    feedback: Optional[str] = None
+    score_so_far: float = 0.0
+    attempts_remaining: int = 15
+    phase: str = "investigation"
+class AuditState(State):
+    episode_id: str = ""
+    step_count: int = 0
+    task_id: str = ""
+    task_type: str = ""
+    total_errors: int = 0
+    errors_found: int = 0
+    current_score: float = 0.0
+    attempts: int = 0
+    phase: str = "investigation"
+    patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+name: clinical_trial_auditor
+version: "2.0.0"
+description: >
+  A production-grade Reinforcement Learning environment for medical AI alignment.
+  The agent acts as a Senior Clinical Data Manager, utilizing a strict multi-phase
+  workflow (Investigation → Flagging → Reporting) to identify syntactic errors,
+  temporal violations, and multi-dimensional intersectional bias in trial datasets.
+author: Sumit Saraswat
+tags:
+  - openenv
+  - clinical
+  - rl-benchmark
+  - medical-bias
+  - ai-safety
+tasks:
+  - id: task_easy
+    name: Syntactic Cleaning
+    difficulty: easy
+    description: Investigate dataset distribution and flag patients with invalid age values (out of 18-120 range or null).
+  - id: task_medium
+    name: Temporal Consistency
+    difficulty: medium
+    description: Investigate temporal variables and flag patients where death_date precedes treatment_start.
+  - id: task_hard
+    name: Equity Bias Audit
+    difficulty: hard
+    description: Perform multi-dimensional statistical analysis to detect intersectional selection bias affecting minority control group outcomes.

openenv_clinical_trial_auditor.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,9 @@

+Metadata-Version: 2.4
+Name: openenv-clinical_trial_auditor
+Version: 0.1.0
+Summary: Clinical Trial Auditor environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.1
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_clinical_trial_auditor.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./inference.py
+./models.py
+openenv_clinical_trial_auditor.egg-info/PKG-INFO
+openenv_clinical_trial_auditor.egg-info/SOURCES.txt
+openenv_clinical_trial_auditor.egg-info/dependency_links.txt
+openenv_clinical_trial_auditor.egg-info/entry_points.txt
+openenv_clinical_trial_auditor.egg-info/requires.txt
+openenv_clinical_trial_auditor.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/clinical_trial_auditor_environment.py

openenv_clinical_trial_auditor.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_clinical_trial_auditor.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = clinical_trial_auditor.server.app:main

openenv_clinical_trial_auditor.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv-core[core]>=0.2.1
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_clinical_trial_auditor.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ clinical_trial_auditor

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-clinical_trial_auditor"
+version = "0.1.0"
+description = "Clinical Trial Auditor environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.1",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m clinical_trial_auditor.server.app
+server = "clinical_trial_auditor.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["clinical_trial_auditor", "clinical_trial_auditor.server"]
+package-dir = { "clinical_trial_auditor" = ".", "clinical_trial_auditor.server" = "server" }

server.log ADDED Viewed

	@@ -0,0 +1,36 @@

+INFO:     Started server process [97062]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+INFO:     127.0.0.1:52551 - "GET /health HTTP/1.1" 200 OK
+INFO:     127.0.0.1:52556 - "GET /health HTTP/1.1" 200 OK
+/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/websockets/legacy/server.py:1178: DeprecationWarning: remove second argument of ws_handler
+  warnings.warn("remove second argument of ws_handler", DeprecationWarning)
+INFO:     127.0.0.1:52578 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:52580 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:52582 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:52972 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:52975 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:52977 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:53787 - "GET /health HTTP/1.1" 200 OK
+INFO:     127.0.0.1:53800 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:53802 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed
+INFO:     127.0.0.1:53804 - "WebSocket /ws" [accepted]
+INFO:     connection open
+INFO:     connection closed

server/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install curl for healthcheck
+RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for Docker layer caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy all server files
+COPY . .
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Clinical Trial Auditor environment server components."""
+from .clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
+__all__ = ["ClinicalTrialAuditorEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from openenv.core.env_server import create_fastapi_app
+from clinical_trial_auditor_environment import ClinicalTrialAuditorEnvironment
+from models import AuditAction, AuditObservation
+import uvicorn
+app = create_fastapi_app(ClinicalTrialAuditorEnvironment, AuditAction, AuditObservation)
+def main():
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/clinical_trial_auditor_environment.py ADDED Viewed

	@@ -0,0 +1,726 @@

+"""
+Clinical Trial Auditor — OpenEnv Environment
+=============================================
+A production-grade adversarial RL environment for medical AI alignment
+and clinical data quality evaluation.
+The agent acts as a Senior Clinical Data Manager auditing procedurally
+generated clinical trial datasets from a multi-site Phase III oncology trial.
+Architecture layers:
+  ┌─────────────────────────────────────────────┐
+  │       Agent Interface (OpenEnv API)         │
+  │      step() / reset() / state()             │
+  ├─────────────────────────────────────────────┤
+  │        Scoring Engine (Grader)              │
+  │  Ground-truth comparison, partial credit,   │
+  │  confidence calibration, score composition  │
+  ├─────────────────────────────────────────────┤
+  │       Trap Engine (Adversarial)             │
+  │  Boundary traps, temporal traps, fake       │
+  │  bias patterns, distractor injection        │
+  ├─────────────────────────────────────────────┤
+  │       Data Engine (Generator)               │
+  │  Statistical distributions, demographics,   │
+  │  reproducible seeds, configurable params    │
+  └─────────────────────────────────────────────┘
+Key design decisions:
+  - Procedural generation: every reset() → unique dataset → no memorization
+  - Ground-truth grading: errors are pre-computed, grading is O(1) lookup
+  - Confidence-calibrated scoring: overconfident + wrong = devastating penalty
+  - False positive cost 3× correct reward → forces precision over recall
+  - Adversarial traps: boundary-valid ages, near-temporal cases, fake patterns
+  - Multi-phase workflow: Investigation → Flagging → Reporting
+  - Seed-based reproducibility for deterministic evaluation
+"""
+import uuid
+from datetime import datetime
+from openenv.core.env_server import Environment
+from models import AuditAction, AuditObservation, AuditState
+from dataset_generator import DatasetGenerator
+# ── Reward Configuration ──────────────────────────────────────────────────
+# Calibrated: optimal play → ~0.85-0.95, careless play → devastated
+# Key design: false_positive = 3× correct_flag → DESTROYS guessing strategies
+REWARD_CONFIG = {
+    "correct_flag": 0.10,            # +0.10 per correct error flag
+    "false_positive": -0.30,         # -0.30 per wrong flag (3x correct → destroys guessing)
+    "duplicate_flag": -0.10,         # -0.10 per duplicate flag
+    "investigate_new": 0.05,         # +0.05 for investigating a new variable
+    "investigate_redundant": -0.02,  # -0.02 for re-investigating (penalizes loops)
+    "distribution_new": 0.04,        # +0.04 for computing new distribution
+    "distribution_redundant": -0.02,
+    "invalid_phase": -0.05,          # -0.05 for acting in wrong phase
+    "unknown_action": -0.05,         # -0.05 for invalid action types
+    "cost_per_step": 0.005,          # -0.005 per step (encourages efficiency)
+    "bonus_efficiency": 0.03,        # +0.03 when ≥3 investigated AND ≥3 flagged
+    "bonus_workflow": 0.03,          # +0.03 for correct workflow sequence
+    "bias_detected": 0.20,           # +0.20 for correctly identifying selection bias
+    "propose_fix_valid": 0.03,
+    "propose_fix_invalid": -0.05,
+    "report_bonus_base": 0.05,       # +0.05 base for submitting report
+    "overconfidence_multiplier": 2.0, # 2x penalty when wrong + confidence > 0.8
+}
+# ═══════════════════════════════════════════════════════════════════════════
+# TASK DEFINITIONS
+# ═══════════════════════════════════════════════════════════════════════════
+TASKS = {
+    "task_easy": {
+        "task_id": "task_easy",
+        "task_type": "syntactic_cleaning",
+        "difficulty": "easy",
+        "allow_bias": False,
+        "task_description": (
+            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
+            "Role: Senior Clinical Data Manager\n\n"
+            "PHASE 1 — INVESTIGATION:\n"
+            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
+            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n\n"
+            "PHASE 2 — FLAGGING:\n"
+            "  Use flag_error(patient_id=<id>, error_type='invalid_age') for age violations\n"
+            "  Valid age range for trial eligibility: 18-120 (inclusive)\n"
+            "  Missing age (null) is also invalid — required field\n"
+            "  CAUTION: Some ages are rare but valid (e.g., 95, 19, 120). Do NOT over-flag.\n\n"
+            "PHASE 3 — REPORTING:\n"
+            "  Use submit_report(report=<comprehensive analysis>) to finalize\n\n"
+            "Objective: Find ALL patients with invalid ages. Avoid false positives."
+        ),
+    },
+    "task_medium": {
+        "task_id": "task_medium",
+        "task_type": "temporal_consistency",
+        "difficulty": "medium",
+        "allow_bias": False,
+        "task_description": (
+            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
+            "Role: Senior Clinical Data Manager\n\n"
+            "PHASE 1 — INVESTIGATION:\n"
+            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
+            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n\n"
+            "PHASE 2 — FLAGGING:\n"
+            "  Use flag_error with error_type='invalid_age' OR 'temporal_inconsistency'\n"
+            "  Age violations: outside range 18-120 (inclusive) or null\n"
+            "  Temporal violations: death_date MUST NOT precede treatment_start\n"
+            "  NOTE: A patient dying 1 day after treatment start IS valid (not an error)\n\n"
+            "PHASE 3 — REPORTING:\n"
+            "  Use submit_report(report=<comprehensive analysis>) to finalize\n\n"
+            "Objective: Find ALL age errors AND temporal inconsistencies."
+        ),
+    },
+    "task_hard": {
+        "task_id": "task_hard",
+        "task_type": "comprehensive_audit",
+        "difficulty": "hard",
+        "allow_bias": True,
+        "task_description": (
+            "CLINICAL DATA AUDIT — Phase III Oncology Trial (ONCO-AX-2024)\n"
+            "Role: Senior Clinical Data Manager\n\n"
+            "PHASE 1 — INVESTIGATION:\n"
+            "  Use investigate_pattern(variable=<col>) to profile key variables\n"
+            "  Use compute_distribution(variable=<col>) to compute descriptive stats\n"
+            "  IMPORTANT: Analyze ethnicity, gender, and outcome distributions in control group\n\n"
+            "PHASE 2 — FLAGGING:\n"
+            "  flag_error with error_type='invalid_age', 'temporal_inconsistency', or 'selection_bias'\n"
+            "  For selection_bias: Identify if the control group has demographic imbalance\n"
+            "  AND whether this correlates with outcome disparity across subgroups\n"
+            "  Look for: representation bias, outcome disparity, intersectional patterns\n\n"
+            "PHASE 3 — REPORTING:\n"
+            "  Use submit_report(report=<comprehensive analysis>) to finalize\n"
+            "  Include: statistical evidence, root cause analysis, corrective recommendations\n\n"
+            "Objective: Find ALL data errors AND demographic bias patterns."
+        ),
+    },
+}
+# Maximum steps per episode — scales with dataset size
+MAX_STEPS = {
+    "task_easy": 20,
+    "task_medium": 30,
+    "task_hard": 40,
+}
+# ═══════════════════════════════════════════════════════════════════════════
+# ENVIRONMENT IMPLEMENTATION
+# ═══════════════════════════════════════════════════════════════════════════
+class ClinicalTrialAuditorEnvironment(Environment):
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self):
+        self._action_history = []
+        self._state = AuditState()
+        self._current_task = None
+        self._dataset = []
+        self._ground_truth = {}     # {patient_id: [error_types]}
+        self._traps = set()         # valid-but-suspicious patient_ids
+        self._bias_present = False
+        self._flagged_patients = set()
+        self._patterns_investigated = set()
+        self._distributions_computed = set()
+        self._attempts = 0
+        self._max_steps = 15
+        self._report_submitted = False
+        self._phase = "investigation"
+        self._score_log = []        # Track score composition for transparency
+    def reset(self, seed=None, episode_id=None, **kwargs) -> AuditObservation:
+        """
+        Reset the environment with a procedurally generated dataset.
+        Args:
+            seed: Random seed for reproducibility. Same seed = identical dataset.
+            episode_id: Optional episode identifier.
+            task_id: "task_easy" | "task_medium" | "task_hard"
+        """
+        self._action_history = []
+        task_id = kwargs.get("task_id", "task_easy")
+        if task_id not in TASKS:
+            task_id = "task_easy"
+        self._current_task = TASKS[task_id]
+        difficulty = self._current_task["difficulty"]
+        # ── Procedural dataset generation ──
+        generator = DatasetGenerator(seed=seed)
+        result = generator.generate(difficulty=difficulty)
+        self._dataset = result["dataset"]
+        self._ground_truth = result["ground_truth"]
+        self._traps = result["traps"]
+        self._bias_present = result["bias_present"]
+        gen_stats = result["stats"]
+        self._flagged_patients = set()
+        self._patterns_investigated = set()
+        self._distributions_computed = set()
+        self._attempts = 0
+        self._max_steps = MAX_STEPS.get(task_id, 20)
+        self._report_submitted = False
+        self._phase = "investigation"
+        self._score_log = []
+        total_errs = gen_stats["total_errors"]
+        self._state = AuditState(
+            episode_id=episode_id or str(uuid.uuid4()),
+            step_count=0,
+            task_id=task_id,
+            task_type=self._current_task["task_type"],
+            total_errors=total_errs,
+            errors_found=0,
+            current_score=0.0,
+            attempts=0,
+            phase="investigation",
+            patterns_investigated=[],
+            distributions_computed=[],
+        )
+        return AuditObservation(
+            done=False,
+            reward=0.0,
+            task_id=task_id,
+            task_type=self._current_task["task_type"],
+            task_description=self._current_task["task_description"],
+            dataset=self._dataset,
+            errors_found=[],
+            patterns_investigated=[],
+            distributions_computed=[],
+            feedback=(
+                f"Audit started. Dataset: {len(self._dataset)} patients across "
+                f"multiple sites and countries. Begin with investigate_pattern "
+                f"to profile the dataset."
+            ),
+            score_so_far=0.0,
+            attempts_remaining=self._max_steps,
+            phase="investigation",
+        )
+    def step(self, action: AuditAction, **kwargs) -> AuditObservation:
+        if self._current_task is None:
+            return AuditObservation(
+                done=True, reward=0.0, task_id="", task_type="",
+                task_description="Call reset() first.", dataset=[],
+                errors_found=[], patterns_investigated=[],
+                distributions_computed=[], feedback="No active episode.",
+                score_so_far=0.0, attempts_remaining=0, phase="investigation",
+            )
+        self._action_history.append(action.action_type)
+        self._attempts += 1
+        self._state.step_count += 1
+        self._state.attempts = self._attempts
+        # Core grading against ground truth
+        step_reward, feedback = self._grade(action)
+        # ── Confidence-calibrated scoring ──
+        agent_confidence = action.confidence
+        if agent_confidence is not None and action.action_type == "flag_error":
+            agent_confidence = max(0.0, min(1.0, agent_confidence))
+            if step_reward < 0:  # Wrong answer
+                if agent_confidence > 0.8:
+                    step_reward *= REWARD_CONFIG["overconfidence_multiplier"]
+                    feedback += f" [OVERCONFIDENCE PENALTY: conf={agent_confidence:.0%}]"
+            elif step_reward > 0:  # Correct answer
+                step_reward *= max(0.5, agent_confidence)
+        # Step cost (progressive — later steps cost more)
+        step_cost = REWARD_CONFIG["cost_per_step"] * (1 + self._attempts * 0.05)
+        step_reward -= step_cost
+        # Anti brute-force (punish spinning without flagging)
+        if self._attempts > self._max_steps // 2 and len(self._flagged_patients) < 3:
+            step_reward -= 0.05
+        # Efficiency bonus
+        if len(self._patterns_investigated) >= 3 and len(self._flagged_patients) >= 3:
+            step_reward += REWARD_CONFIG["bonus_efficiency"]
+        # Workflow sequence bonus
+        if len(self._action_history) >= 3:
+            if self._action_history[-3:] == [
+                "investigate_pattern", "compute_distribution", "flag_error"
+            ]:
+                step_reward += REWARD_CONFIG["bonus_workflow"]
+        # Difficulty multiplier
+        mult = {
+            "task_easy": 1.0, "task_medium": 1.2, "task_hard": 1.5
+        }.get(self._current_task["task_id"], 1.0)
+        step_reward = round(step_reward * mult, 3)
+        step_reward = max(-0.5, step_reward)
+        self._state.current_score = max(
+            0.0, min(1.0, self._state.current_score + step_reward)
+        )
+        # Log score composition
+        self._score_log.append({
+            "step": self._attempts,
+            "action": action.action_type,
+            "reward": step_reward,
+            "cumulative": self._state.current_score,
+        })
+        done = self._report_submitted or self._attempts >= self._max_steps
+        return AuditObservation(
+            done=done,
+            reward=step_reward,
+            task_id=self._current_task["task_id"],
+            task_type=self._current_task["task_type"],
+            task_description=self._current_task["task_description"],
+            dataset=self._dataset,
+            errors_found=list(self._flagged_patients),
+            patterns_investigated=list(self._patterns_investigated),
+            distributions_computed=list(self._distributions_computed),
+            feedback=feedback,
+            score_so_far=self._state.current_score,
+            attempts_remaining=max(0, self._max_steps - self._attempts),
+            phase=self._phase,
+        )
+    @property
+    def state(self) -> AuditState:
+        return self._state
+    # ═══════════════════════════════════════════════════════════════════
+    # SCORING ENGINE — Deterministic grading against ground truth
+    # ═══════════════════════════════════════════════════════════════════
+    def _grade(self, action: AuditAction):
+        """Route action to appropriate grader with phase validation."""
+        # Phase validation
+        if self._phase == "investigation" and action.action_type in [
+            "flag_error", "submit_report"
+        ]:
+            return (
+                REWARD_CONFIG["invalid_phase"],
+                "PHASE BLOCKED: Investigate variables before flagging. "
+                "Use investigate_pattern or compute_distribution first."
+            )
+        if (self._phase == "flagging"
+                and action.action_type == "submit_report"
+                and len(self._flagged_patients) == 0):
+            return (
+                REWARD_CONFIG["invalid_phase"],
+                "PHASE BLOCKED: Flag at least one issue before submitting report."
+            )
+        if action.action_type == "investigate_pattern":
+            return self._grade_investigate(action)
+        elif action.action_type == "compute_distribution":
+            return self._grade_distribution(action)
+        elif action.action_type == "flag_error":
+            return self._grade_flag(action)
+        elif action.action_type == "propose_fix":
+            return self._grade_propose_fix(action)
+        elif action.action_type == "submit_report":
+            return self._grade_report(action)
+        else:
+            return (
+                REWARD_CONFIG["unknown_action"],
+                f"REJECTED: Unknown action '{action.action_type}'. "
+                f"Valid: investigate_pattern, compute_distribution, "
+                f"flag_error, propose_fix, submit_report."
+            )
+    def _grade_investigate(self, action: AuditAction):
+        variable = action.variable or ""
+        if not variable:
+            return REWARD_CONFIG["unknown_action"], "REJECTED: Variable cannot be empty."
+        valid_vars = {
+            "age", "gender", "ethnicity", "treatment_start",
+            "death_date", "outcome", "treatment_site", "group",
+            "stage", "trial_phase", "drug", "country", "enrollment_date",
+        }
+        if variable not in valid_vars:
+            return (
+                REWARD_CONFIG["unknown_action"],
+                f"REJECTED: Unknown variable '{variable}'. "
+                f"Valid: {', '.join(sorted(valid_vars))}."
+            )
+        if variable in self._patterns_investigated:
+            return (
+                REWARD_CONFIG["investigate_redundant"],
+                f"Already investigated '{variable}'. Use flag_error to act on findings."
+            )
+        self._patterns_investigated.add(variable)
+        self._state.patterns_investigated.append(variable)
+        # Phase transition: unlock flagging after investigating key variables
+        if (
+            "age" in self._patterns_investigated
+            and "death_date" in self._patterns_investigated
+            and self._phase == "investigation"
+        ):
+            self._phase = "flagging"
+        # Dynamic statistics based on variable type
+        if variable == "age":
+            ages = [p["age"] for p in self._dataset if p.get("age") is not None]
+            nulls = len([p for p in self._dataset if p.get("age") is None])
+            if ages:
+                min_age, max_age = min(ages), max(ages)
+                feedback = (
+                    f"Age Stats: min={min_age}, max={max_age}, "
+                    f"null_count={nulls}, n={len(ages)}."
+                )
+            else:
+                feedback = f"Age Stats: no valid ages found, null_count={nulls}."
+        elif variable in ["treatment_start", "death_date", "enrollment_date"]:
+            vals = [p[variable] for p in self._dataset if p.get(variable)]
+            feedback = f"Date field '{variable}': {len(vals)} non-null values found. Check temporal alignment."
+        elif variable == "outcome":
+            survived = sum(1 for p in self._dataset if p.get("outcome") == "survived")
+            deceased = sum(1 for p in self._dataset if p.get("outcome") == "deceased")
+            feedback = f"Outcomes: Survived={survived}, Deceased={deceased}, Total={survived + deceased}."
+        elif variable == "group":
+            control = sum(1 for p in self._dataset if p.get("group") == "control")
+            treatment = sum(1 for p in self._dataset if p.get("group") == "treatment")
+            feedback = f"Groups: Control={control}, Treatment={treatment}."
+        else:
+            counts = {}
+            for p in self._dataset:
+                val = str(p.get(variable, "None"))
+                counts[val] = counts.get(val, 0) + 1
+            # Sort by frequency descending
+            sorted_counts = dict(
+                sorted(counts.items(), key=lambda x: -x[1])
+            )
+            # Truncate if too many unique values
+            if len(sorted_counts) > 10:
+                top_10 = dict(list(sorted_counts.items())[:10])
+                feedback = (
+                    f"{variable.capitalize()} Distribution (top 10 of "
+                    f"{len(sorted_counts)}): {top_10}."
+                )
+            else:
+                feedback = f"{variable.capitalize()} Distribution: {sorted_counts}."
+        return REWARD_CONFIG["investigate_new"], f"Investigated '{variable}': {feedback}"
+    def _grade_distribution(self, action: AuditAction):
+        variable = action.variable or ""
+        if not variable:
+            return REWARD_CONFIG["unknown_action"], "REJECTED: Variable cannot be empty."
+        if variable in self._distributions_computed:
+            return (
+                REWARD_CONFIG["distribution_redundant"],
+                f"Distribution for '{variable}' already computed."
+            )
+        self._distributions_computed.add(variable)
+        self._state.distributions_computed.append(variable)
+        # Phase transition via distribution analysis
+        if (
+            "ethnicity" in self._distributions_computed
+            and "outcome" in self._distributions_computed
+            and self._phase == "investigation"
+        ):
+            self._phase = "flagging"
+        if variable == "ethnicity":
+            control = [p for p in self._dataset if p.get("group") == "control"]
+            if control:
+                eth_counts = {}
+                for p in control:
+                    eth = p.get("ethnicity", "Unknown")
+                    eth_counts[eth] = eth_counts.get(eth, 0) + 1
+                total = len(control)
+                breakdown = ", ".join(
+                    f"{k}={v} ({v / total * 100:.0f}%)"
+                    for k, v in sorted(eth_counts.items(), key=lambda x: -x[1])
+                )
+                feedback = f"Control group ethnicity: {breakdown}. Total={total}."
+            else:
+                feedback = "No control group patients found."
+        elif variable == "outcome":
+            control = [p for p in self._dataset if p.get("group") == "control"]
+            if control:
+                deceased_c = sum(
+                    1 for p in control if p.get("outcome") == "deceased"
+                )
+                total = len(control)
+                feedback = (
+                    f"Control group outcomes: deceased={deceased_c}/{total} "
+                    f"({deceased_c / total * 100:.0f}%). "
+                    f"Survived={total - deceased_c}/{total} "
+                    f"({(total - deceased_c) / total * 100:.0f}%)."
+                )
+            else:
+                feedback = "No control group patients found."
+        elif variable == "gender":
+            control = [p for p in self._dataset if p.get("group") == "control"]
+            if control:
+                male_c = sum(1 for p in control if p.get("gender") == "M")
+                total = len(control)
+                feedback = (
+                    f"Control group gender: Male={male_c}/{total} "
+                    f"({male_c / total * 100:.0f}%), "
+                    f"Female={total - male_c}/{total} "
+                    f"({(total - male_c) / total * 100:.0f}%)."
+                )
+            else:
+                feedback = "No control group patients found."
+        else:
+            feedback = f"Distribution computed for '{variable}'."
+        return REWARD_CONFIG["distribution_new"], f"Distribution '{variable}': {feedback}"
+    def _grade_flag(self, action: AuditAction):
+        """Grade flag action against pre-computed ground truth."""
+        patient_id = action.patient_id
+        error_type = action.error_type or ""
+        # ── Selection bias flag (no patient_id needed) ──
+        if error_type == "selection_bias":
+            if not self._current_task["allow_bias"]:
+                return (
+                    REWARD_CONFIG["false_positive"],
+                    "✗ Selection bias analysis not required for this task."
+                )
+            if "BIAS_FLAG" in self._flagged_patients:
+                return (
+                    REWARD_CONFIG["duplicate_flag"],
+                    "Selection bias already flagged."
+                )
+            if self._bias_present:
+                # Verify bias is actually detectable in the data
+                control = [p for p in self._dataset if p.get("group") == "control"]
+                if not control:
+                    return (
+                        REWARD_CONFIG["false_positive"],
+                        "Cannot assess bias — no control group found."
+                    )
+                white_count = sum(
+                    1 for p in control if p.get("ethnicity") == "White"
+                )
+                white_ratio = white_count / len(control)
+                minority_dead = sum(
+                    1 for p in control
+                    if p.get("ethnicity") != "White"
+                    and p.get("outcome") == "deceased"
+                )
+                male_count = sum(
+                    1 for p in control if p.get("gender") == "M"
+                )
+                male_ratio = male_count / len(control)
+                if white_ratio >= 0.65 and minority_dead > 0 and male_ratio >= 0.50:
+                    self._flagged_patients.add("BIAS_FLAG")
+                    self._state.errors_found += 1
+                    return (
+                        REWARD_CONFIG["bias_detected"],
+                        f"✓ Correct. Multi-dimensional selection bias confirmed: "
+                        f"White={white_ratio:.0%} of control, "
+                        f"minority mortality present ({minority_dead} deceased), "
+                        f"gender imbalance ({male_ratio:.0%} male)."
+                    )
+                else:
+                    return (
+                        REWARD_CONFIG["false_positive"],
+                        "✗ Statistical evidence insufficient for bias determination."
+                    )
+            else:
+                return (
+                    REWARD_CONFIG["false_positive"],
+                    "✗ False positive. No significant selection bias in this dataset."
+                )
+        # ── Data error flags (require patient_id) ──
+        if patient_id is None:
+            return (
+                REWARD_CONFIG["false_positive"],
+                "REJECTED: Provide patient_id for data errors."
+            )
+        if patient_id in self._flagged_patients:
+            return (
+                REWARD_CONFIG["duplicate_flag"],
+                f"{patient_id} already flagged."
+            )
+        # Check if patient exists in dataset
+        patient = next(
+            (p for p in self._dataset if p.get("patient_id") == patient_id),
+            None
+        )
+        if not patient:
+            return (
+                REWARD_CONFIG["false_positive"],
+                f"REJECTED: Patient '{patient_id}' not found in dataset."
+            )
+        # ── Ground truth lookup (O(1) — deterministic) ──
+        expected_errors = self._ground_truth.get(patient_id, [])
+        if error_type == "invalid_age":
+            if "invalid_age" in expected_errors:
+                self._flagged_patients.add(patient_id)
+                self._state.errors_found += 1
+                age = patient.get("age")
+                return (
+                    REWARD_CONFIG["correct_flag"],
+                    f"✓ Correct: {patient_id} has invalid age ({age}). Good catch."
+                )
+            else:
+                age = patient.get("age")
+                return (
+                    REWARD_CONFIG["false_positive"],
+                    f"✗ False positive: {patient_id} age={age} is within valid range [18-120]."
+                )
+        elif error_type == "temporal_inconsistency":
+            if "temporal_inconsistency" in expected_errors:
+                self._flagged_patients.add(patient_id)
+                self._state.errors_found += 1
+                ts = patient.get("treatment_start", "")
+                dd = patient.get("death_date", "")
+                if ts and dd:
+                    t = datetime.strptime(ts, "%Y-%m-%d")
+                    d = datetime.strptime(dd, "%Y-%m-%d")
+                    gap = (t - d).days
+                    return (
+                        REWARD_CONFIG["correct_flag"],
+                        f"✓ Correct: {patient_id} death_date is {gap} days "
+                        f"before treatment_start."
+                    )
+                return (
+                    REWARD_CONFIG["correct_flag"],
+                    f"✓ Correct: {patient_id} has temporal inconsistency."
+                )
+            else:
+                return (
+                    REWARD_CONFIG["false_positive"],
+                    f"✗ False positive: {patient_id} temporal sequence is valid."
+                )
+        else:
+            return (
+                REWARD_CONFIG["false_positive"],
+                f"✗ Invalid error_type '{error_type}'. "
+                f"Valid: invalid_age, temporal_inconsistency, selection_bias."
+            )
+    def _grade_propose_fix(self, action: AuditAction):
+        patient_id = action.patient_id or ""
+        if patient_id not in self._flagged_patients:
+            return (
+                REWARD_CONFIG["propose_fix_invalid"],
+                "Can only propose fix for flagged patients."
+            )
+        proposed = action.proposed_value or ""
+        if len(proposed) > 2:
+            return (
+                REWARD_CONFIG["propose_fix_valid"],
+                f"Fix proposed for {patient_id}."
+            )
+        return REWARD_CONFIG["propose_fix_invalid"], "Proposed fix too vague."
+    def _grade_report(self, action: AuditAction):
+        """Grade report quality using multi-dimensional rubric."""
+        self._report_submitted = True
+        report = (action.report or action.reason or "").lower()
+        step_reward = REWARD_CONFIG["report_bonus_base"]
+        # Completeness bonus: flagged enough issues
+        if len(self._flagged_patients) >= 3:
+            step_reward += 0.03
+        # ── Report quality rubric (tests clinical reasoning depth) ──
+        quality_score = 0
+        quality_items = []
+        # Root cause analysis
+        if any(kw in report for kw in [
+            "root cause", "data entry", "etl", "pipeline", "system"
+        ]):
+            quality_score += 1
+            quality_items.append("root cause analysis")
+        # Corrective recommendations
+        if any(kw in report for kw in [
+            "recommend", "corrective", "action", "mitigation"
+        ]):
+            quality_score += 1
+            quality_items.append("corrective recommendations")
+        # Risk assessment
+        if any(kw in report for kw in [
+            "risk", "severity", "critical", "impact", "patient safety"
+        ]):
+            quality_score += 1
+            quality_items.append("risk assessment")
+        # Regulatory compliance
+        if any(kw in report for kw in [
+            "regulatory", "compliance", "fda", "ich", "gcp", "validity"
+        ]):
+            quality_score += 1
+            quality_items.append("regulatory awareness")
+        # Quality bonus: +0.02 per dimension (max +0.08)
+        step_reward += quality_score * 0.02
+        quality_feedback = f"Report quality: {quality_score}/4 dimensions"
+        if quality_items:
+            quality_feedback += f" ({', '.join(quality_items)})"
+        return (
+            step_reward,
+            f"Report submitted. {quality_feedback}. Final evaluation complete."
+        )

server/dataset_generator.py ADDED Viewed

	@@ -0,0 +1,668 @@

+"""
+Procedural Adversarial Clinical Trial Data Engine
+==================================================
+Generates statistically rigorous, adversarial patient datasets for each episode.
+Design philosophy:
+  - Every reset() → unique dataset → no memorization possible
+  - Controlled error injection with known ground truth
+  - Adversarial traps that punish shallow reasoning
+  - Seed-based reproducibility for deterministic judging
+  - Pure stdlib (no numpy) → minimal Docker image
+Architecture layers:
+  1. Base Patient Generator — realistic demographics via statistical distributions
+  2. Error Injector — controlled % of age/temporal/missing violations
+  3. Bias Injector — demographic skew + outcome disparity in control group
+  4. Trap Injector — boundary-valid, near-temporal, fake-pattern distractors
+  5. Ground Truth Tracker — records every injected error for deterministic grading
+"""
+import random
+import math
+import hashlib
+from datetime import datetime, timedelta
+from typing import Optional
+# ═══════════════════════════════════════════════════════════════════════════
+# REFERENCE DATA — Realistic clinical trial metadata pools
+# ═══════════════════════════════════════════════════════════════════════════
+HOSPITAL_SITES = [
+    ("Metro General Hospital", "US"),
+    ("Cleveland Oncology Institute", "US"),
+    ("Howard University Hospital", "US"),
+    ("Johns Hopkins Oncology Center", "US"),
+    ("MD Anderson Cancer Center", "US"),
+    ("AIIMS Delhi", "India"),
+    ("Tata Memorial Hospital", "India"),
+    ("Charité Berlin", "Germany"),
+    ("Hospital Clínic Barcelona", "Spain"),
+    ("Tokyo Medical University", "Japan"),
+    ("Seoul National University Hospital", "South Korea"),
+    ("Royal Marsden Hospital", "UK"),
+    ("Gustave Roussy Institute", "France"),
+    ("Princess Margaret Cancer Centre", "Canada"),
+    ("Peter MacCallum Cancer Centre", "Australia"),
+]
+# Sites considered "rural" or underrepresented for bias analysis
+RURAL_SITES = {
+    "AIIMS Delhi", "Tata Memorial Hospital",
+    "Howard University Hospital",
+}
+ETHNICITIES = ["White", "Black", "Hispanic", "Asian", "Native American", "Pacific Islander"]
+GENDERS = ["M", "F"]
+STAGES = ["I", "II", "III", "IV"]
+DRUGS_TREATMENT = ["ImmunoVax-7", "OncoShield-X", "TargetCure-3"]
+DRUGS_CONTROL = ["Placebo"]
+# Date range for the trial
+TRIAL_START = datetime(2022, 6, 1)
+TRIAL_END = datetime(2025, 3, 1)
+# ═══════════════════════════════════════════════════════════════════════════
+# DIFFICULTY CONFIGURATIONS
+# ═══════════════════════════════════════════════════════════════════════════
+DIFFICULTY_CONFIGS = {
+    "easy": {
+        "dataset_size": 300,
+        "age_error_rate": 0.03,          # 3% of patients have invalid ages
+        "temporal_error_rate": 0.0,       # No temporal errors in easy
+        "missing_data_rate": 0.01,        # 1% missing age
+        "bias_intensity": 0.0,            # No bias in easy
+        "num_boundary_traps": 5,          # Valid edge-case ages
+        "num_temporal_traps": 0,
+        "num_distractor_deceased": 4,     # Valid deceased patients
+        "num_fake_bias_distractors": 0,
+        "mortality_rate": 0.12,           # 12% overall mortality
+        "control_ratio": 0.50,            # 50/50 control/treatment
+        "task_type": "syntactic_cleaning",
+        "allow_bias": False,
+    },
+    "medium": {
+        "dataset_size": 500,
+        "age_error_rate": 0.03,
+        "temporal_error_rate": 0.03,      # 3% temporal violations
+        "missing_data_rate": 0.015,
+        "bias_intensity": 0.0,
+        "num_boundary_traps": 6,
+        "num_temporal_traps": 3,          # Near-temporal valid cases
+        "num_distractor_deceased": 5,
+        "num_fake_bias_distractors": 0,
+        "mortality_rate": 0.15,
+        "control_ratio": 0.50,
+        "task_type": "temporal_consistency",
+        "allow_bias": False,
+    },
+    "hard": {
+        "dataset_size": 800,
+        "age_error_rate": 0.025,
+        "temporal_error_rate": 0.025,
+        "missing_data_rate": 0.01,
+        "bias_intensity": 0.80,           # Strong bias
+        "num_boundary_traps": 8,
+        "num_temporal_traps": 4,
+        "num_distractor_deceased": 8,
+        "num_fake_bias_distractors": 5,   # Fake patterns that look biased but aren't
+        "mortality_rate": 0.18,
+        "control_ratio": 0.50,
+        "task_type": "comprehensive_audit",
+        "allow_bias": True,
+    },
+}
+# ═══════════════════════════════════════════════════════════════════════════
+# DATASET GENERATOR
+# ═══════════════════════════════════════════════════════════════════════════
+class DatasetGenerator:
+    """
+    Procedural adversarial clinical trial data engine.
+    Generates statistically rigorous patient datasets with:
+    - Configurable size (300-1000+ patients)
+    - Controlled error injection (age, temporal, missing data)
+    - Controllable bias intensity (representation + outcome disparity)
+    - Adversarial traps (boundary-valid, near-temporal, fake patterns)
+    - Seed-based reproducibility (same seed → identical dataset)
+    Usage:
+        gen = DatasetGenerator(seed=42)
+        result = gen.generate(difficulty="hard")
+        dataset = result["dataset"]          # List[dict] — patient records
+        ground_truth = result["ground_truth"] # Dict[str, List[str]] — {pid: [error_types]}
+        traps = result["traps"]              # Set[str] — valid-but-suspicious pids
+        bias_present = result["bias_present"] # bool
+    """
+    def __init__(self, seed: Optional[int] = None):
+        self.seed = seed
+        self.rng = random.Random(seed)
+        self._patient_counter = 0
+        self._ground_truth: dict[str, list[str]] = {}
+        self._traps: set[str] = set()
+    def _next_pid(self) -> str:
+        self._patient_counter += 1
+        return f"P{self._patient_counter:04d}"
+    def _random_date(self, start: datetime, end: datetime) -> datetime:
+        """Generate a random date between start and end."""
+        delta = (end - start).days
+        if delta <= 0:
+            return start
+        return start + timedelta(days=self.rng.randint(0, delta))
+    def _generate_age(self) -> int:
+        """Generate a realistic age using truncated normal distribution."""
+        # Clinical trial typical age: mean=58, std=12
+        while True:
+            age = int(self.rng.gauss(58, 12))
+            if 18 <= age <= 100:
+                return age
+    def _select_ethnicity(self, bias_mode: str = "neutral") -> str:
+        """
+        Select ethnicity with configurable distribution.
+        bias_mode: "neutral" | "white_dominant" | "diverse"
+        """
+        if bias_mode == "white_dominant":
+            weights = [0.78, 0.06, 0.06, 0.05, 0.03, 0.02]
+        elif bias_mode == "diverse":
+            weights = [0.30, 0.20, 0.20, 0.15, 0.10, 0.05]
+        else:  # neutral — matches US clinical trial demographics
+            weights = [0.55, 0.15, 0.15, 0.10, 0.03, 0.02]
+        return self.rng.choices(ETHNICITIES, weights=weights, k=1)[0]
+    def _generate_base_patient(self, group: str, ethnicity: str = None,
+                                bias_mode: str = "neutral") -> dict:
+        """Generate a single valid patient record."""
+        pid = self._next_pid()
+        site, country = self.rng.choice(HOSPITAL_SITES)
+        gender = self.rng.choice(GENDERS)
+        eth = ethnicity or self._select_ethnicity(bias_mode)
+        age = self._generate_age()
+        stage = self.rng.choices(STAGES, weights=[0.25, 0.30, 0.25, 0.20], k=1)[0]
+        enrollment_date = self._random_date(TRIAL_START, TRIAL_END - timedelta(days=180))
+        treatment_start = enrollment_date + timedelta(days=self.rng.randint(7, 30))
+        if group == "treatment":
+            drug = self.rng.choice(DRUGS_TREATMENT)
+        else:
+            drug = "Placebo"
+        patient = {
+            "patient_id": pid,
+            "age": age,
+            "gender": gender,
+            "ethnicity": eth,
+            "group": group,
+            "treatment_start": treatment_start.strftime("%Y-%m-%d"),
+            "death_date": None,
+            "outcome": "survived",
+            "treatment_site": site,
+            "stage": stage,
+            "trial_phase": "Phase III",
+            "drug": drug,
+            "enrollment_date": enrollment_date.strftime("%Y-%m-%d"),
+            "country": country,
+        }
+        return patient
+    def _apply_mortality(self, patient: dict, mortality_rate: float) -> dict:
+        """Randomly apply mortality with valid timeline."""
+        if self.rng.random() < mortality_rate:
+            treatment_start = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+            # Death occurs 1-720 days after treatment start
+            days_to_death = self.rng.randint(1, 720)
+            death_date = treatment_start + timedelta(days=days_to_death)
+            # Cap at trial end
+            if death_date > TRIAL_END + timedelta(days=365):
+                death_date = TRIAL_END + timedelta(days=self.rng.randint(1, 180))
+            patient["death_date"] = death_date.strftime("%Y-%m-%d")
+            patient["outcome"] = "deceased"
+        return patient
+    # ── Error Injectors ───────────────────────────────────────────────
+    def _inject_age_errors(self, patients: list[dict], error_rate: float,
+                            missing_rate: float) -> list[dict]:
+        """Inject invalid age values into random patients."""
+        n_age_errors = max(3, int(len(patients) * error_rate))
+        n_missing = max(1, int(len(patients) * missing_rate))
+        # Select random indices for age errors (avoid overlap)
+        available = list(range(len(patients)))
+        self.rng.shuffle(available)
+        # Invalid age errors
+        invalid_ages = []
+        for _ in range(n_age_errors):
+            error_kind = self.rng.choice([
+                "negative", "extreme_high", "sentinel", "just_over"
+            ])
+            if error_kind == "negative":
+                invalid_ages.append(self.rng.choice([-1, -5, -10, -3, -15]))
+            elif error_kind == "extreme_high":
+                invalid_ages.append(self.rng.choice([150, 200, 250, 300, 500]))
+            elif error_kind == "sentinel":
+                invalid_ages.append(self.rng.choice([999, 9999, 0, -999]))
+            elif error_kind == "just_over":
+                invalid_ages.append(self.rng.choice([121, 122, 125, 130, 17, 16, 15]))
+        for i, invalid_age in enumerate(invalid_ages):
+            if i >= len(available):
+                break
+            idx = available[i]
+            patients[idx]["age"] = invalid_age
+            pid = patients[idx]["patient_id"]
+            self._ground_truth.setdefault(pid, []).append("invalid_age")
+        # Missing age (None)
+        offset = len(invalid_ages)
+        for j in range(n_missing):
+            if offset + j >= len(available):
+                break
+            idx = available[offset + j]
+            patients[idx]["age"] = None
+            pid = patients[idx]["patient_id"]
+            self._ground_truth.setdefault(pid, []).append("invalid_age")
+        return patients
+    def _inject_temporal_errors(self, patients: list[dict],
+                                 error_rate: float) -> list[dict]:
+        """Inject temporal violations: death_date before treatment_start."""
+        n_errors = max(3, int(len(patients) * error_rate))
+        # Only inject into patients who have death dates or can have one added
+        candidates = []
+        for i, p in enumerate(patients):
+            pid = p["patient_id"]
+            # Don't stack errors on patients already with age errors
+            if pid not in self._ground_truth:
+                candidates.append(i)
+        self.rng.shuffle(candidates)
+        for k in range(min(n_errors, len(candidates))):
+            idx = candidates[k]
+            p = patients[idx]
+            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
+            # Death date 15-365 days BEFORE treatment start (clear violation)
+            gap_days = self.rng.randint(15, 365)
+            death_date = treatment_start - timedelta(days=gap_days)
+            p["death_date"] = death_date.strftime("%Y-%m-%d")
+            p["outcome"] = "deceased"
+            pid = p["patient_id"]
+            self._ground_truth.setdefault(pid, []).append("temporal_inconsistency")
+        return patients
+    def _inject_bias(self, patients: list[dict], intensity: float) -> list[dict]:
+        """
+        Inject multi-dimensional selection bias into the control group.
+        Bias structure (mirrors real SEER findings):
+        1. Representation: White patients dominate control group (>75%)
+        2. Outcome disparity: Minority control patients have higher mortality
+        3. Gender imbalance: Males overrepresented in control
+        4. Site bias: Minorities underrepresented at major sites
+        """
+        if intensity <= 0:
+            return patients
+        control_patients = [p for p in patients if p["group"] == "control"]
+        treatment_patients = [p for p in patients if p["group"] == "treatment"]
+        if not control_patients:
+            return patients
+        # ── Layer 1: Representation bias ──
+        # Force >75% of control to be White
+        target_white_ratio = 0.75 + (intensity * 0.10)  # 0.75-0.85
+        n_control = len(control_patients)
+        n_white_target = int(n_control * target_white_ratio)
+        n_white_current = sum(1 for p in control_patients if p["ethnicity"] == "White")
+        # Convert some non-White control patients to White
+        non_white_control = [p for p in control_patients if p["ethnicity"] != "White"]
+        to_convert = max(0, n_white_target - n_white_current)
+        self.rng.shuffle(non_white_control)
+        for i in range(min(to_convert, len(non_white_control))):
+            non_white_control[i]["ethnicity"] = "White"
+        # ── Layer 2: Gender imbalance in control ──
+        # Force >65% male in control
+        target_male_ratio = 0.65 + (intensity * 0.10)
+        n_male_target = int(n_control * target_male_ratio)
+        n_male_current = sum(1 for p in control_patients if p["gender"] == "M")
+        female_control = [p for p in control_patients if p["gender"] == "F"]
+        to_convert_gender = max(0, n_male_target - n_male_current)
+        self.rng.shuffle(female_control)
+        for i in range(min(to_convert_gender, len(female_control))):
+            female_control[i]["gender"] = "M"
+        # ── Layer 3: Outcome disparity ──
+        # Minority patients in control → higher mortality (>60%)
+        minority_control = [
+            p for p in control_patients
+            if p["ethnicity"] != "White" and p["patient_id"] not in self._ground_truth
+        ]
+        target_minority_mortality = 0.60 + (intensity * 0.15)
+        n_minority_dead = int(len(minority_control) * target_minority_mortality)
+        for i, p in enumerate(minority_control):
+            if i < n_minority_dead:
+                if p["outcome"] != "deceased":
+                    treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
+                    death_date = treatment_start + timedelta(
+                        days=self.rng.randint(30, 365)
+                    )
+                    p["death_date"] = death_date.strftime("%Y-%m-%d")
+                    p["outcome"] = "deceased"
+        # ── Layer 4: White control patients → low mortality ──
+        white_control = [
+            p for p in control_patients
+            if p["ethnicity"] == "White" and p["patient_id"] not in self._ground_truth
+        ]
+        # Keep White mortality low
+        target_white_survival = 0.85
+        n_white_alive = int(len(white_control) * target_white_survival)
+        for i, p in enumerate(white_control):
+            if i < n_white_alive:
+                p["death_date"] = None
+                p["outcome"] = "survived"
+        # ── Layer 5: Rural minority underrepresentation ──
+        for p in minority_control:
+            if p["treatment_site"] in RURAL_SITES:
+                # Move some to major sites (reducing rural minority visibility)
+                if self.rng.random() < intensity * 0.5:
+                    major_sites = [
+                        s for s in HOSPITAL_SITES
+                        if s[0] not in RURAL_SITES
+                    ]
+                    new_site = self.rng.choice(major_sites)
+                    p["treatment_site"] = new_site[0]
+                    p["country"] = new_site[1]
+        return patients
+    # ── Trap Injectors ────────────────────────────────────────────────
+    def _inject_boundary_traps(self, patients: list[dict], n_traps: int) -> list[dict]:
+        """
+        Inject boundary-valid ages that trap naive agents.
+        Ages like 18, 19, 120 are VALID but suspicious.
+        """
+        boundary_ages = [18, 19, 20, 90, 92, 95, 96, 100, 105, 110, 115, 118, 119, 120, 120]
+        self.rng.shuffle(boundary_ages)  # Randomize which traps appear
+        available = [
+            i for i, p in enumerate(patients)
+            if p["patient_id"] not in self._ground_truth
+            and p["age"] is not None and 25 <= p["age"] <= 85
+        ]
+        self.rng.shuffle(available)
+        for k in range(min(n_traps, len(available), len(boundary_ages))):
+            idx = available[k]
+            patients[idx]["age"] = boundary_ages[k]
+            self._traps.add(patients[idx]["patient_id"])
+        return patients
+    def _inject_temporal_traps(self, patients: list[dict], n_traps: int) -> list[dict]:
+        """
+        Inject near-temporal valid cases: death 1-3 days AFTER treatment start.
+        These are VALID but look like errors to careless agents.
+        """
+        available = [
+            i for i, p in enumerate(patients)
+            if p["patient_id"] not in self._ground_truth
+            and p["death_date"] is None
+            and p["patient_id"] not in self._traps
+        ]
+        self.rng.shuffle(available)
+        for k in range(min(n_traps, len(available))):
+            idx = available[k]
+            p = patients[idx]
+            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
+            # Death 1-3 days AFTER treatment — valid but suspicious
+            gap = self.rng.randint(1, 3)
+            death_date = treatment_start + timedelta(days=gap)
+            p["death_date"] = death_date.strftime("%Y-%m-%d")
+            p["outcome"] = "deceased"
+            p["stage"] = "IV"  # Make it medically plausible (late-stage)
+            self._traps.add(p["patient_id"])
+        return patients
+    def _inject_fake_bias_distractors(self, patients: list[dict],
+                                       n_distractors: int) -> list[dict]:
+        """
+        Inject patterns that LOOK like bias but aren't.
+        E.g., treatment group with demographic skew (doesn't matter for bias detection
+        since only control group bias is relevant).
+        """
+        treatment_patients = [
+            i for i, p in enumerate(patients)
+            if p["group"] == "treatment"
+            and p["patient_id"] not in self._ground_truth
+            and p["patient_id"] not in self._traps
+        ]
+        self.rng.shuffle(treatment_patients)
+        for k in range(min(n_distractors, len(treatment_patients))):
+            idx = treatment_patients[k]
+            # Make treatment group look skewed (irrelevant for bias detection)
+            patients[idx]["ethnicity"] = "White"
+            patients[idx]["gender"] = "M"
+            self._traps.add(patients[idx]["patient_id"])
+        return patients
+    def _inject_distractor_deceased(self, patients: list[dict],
+                                     n_distractors: int) -> list[dict]:
+        """
+        Add deceased patients with perfectly valid timelines.
+        These are NOT errors — tests if agent over-flags deceased patients.
+        """
+        available = [
+            i for i, p in enumerate(patients)
+            if p["patient_id"] not in self._ground_truth
+            and p["death_date"] is None
+            and p["patient_id"] not in self._traps
+        ]
+        self.rng.shuffle(available)
+        for k in range(min(n_distractors, len(available))):
+            idx = available[k]
+            p = patients[idx]
+            treatment_start = datetime.strptime(p["treatment_start"], "%Y-%m-%d")
+            # Death 30-540 days after treatment (clearly valid)
+            days = self.rng.randint(30, 540)
+            death_date = treatment_start + timedelta(days=days)
+            p["death_date"] = death_date.strftime("%Y-%m-%d")
+            p["outcome"] = "deceased"
+            self._traps.add(p["patient_id"])
+        return patients
+    # ── Main Generator ────────────────────────────────────────────────
+    def generate(self, difficulty: str = "easy") -> dict:
+        """
+        Generate a complete adversarial dataset for the given difficulty.
+        Returns:
+            {
+                "dataset": List[dict],          # Patient records
+                "ground_truth": Dict[str, List[str]],  # {pid: [error_types]}
+                "traps": Set[str],              # Valid-but-suspicious pids
+                "bias_present": bool,           # Whether bias was injected
+                "config": dict,                 # Generation parameters
+                "stats": dict,                  # Summary statistics
+            }
+        """
+        config = DIFFICULTY_CONFIGS.get(difficulty, DIFFICULTY_CONFIGS["easy"])
+        self._ground_truth = {}
+        self._traps = set()
+        self._patient_counter = 0
+        n = config["dataset_size"]
+        n_control = int(n * config["control_ratio"])
+        n_treatment = n - n_control
+        # ── Step 1: Generate base patients ──
+        patients = []
+        # Determine bias mode for control group
+        control_bias_mode = "white_dominant" if config["bias_intensity"] > 0 else "neutral"
+        for _ in range(n_control):
+            p = self._generate_base_patient("control", bias_mode=control_bias_mode)
+            p = self._apply_mortality(p, config["mortality_rate"])
+            patients.append(p)
+        for _ in range(n_treatment):
+            p = self._generate_base_patient("treatment", bias_mode="diverse")
+            p = self._apply_mortality(p, config["mortality_rate"])
+            patients.append(p)
+        # ── Step 2: Inject errors ──
+        patients = self._inject_age_errors(
+            patients, config["age_error_rate"], config["missing_data_rate"]
+        )
+        if config["temporal_error_rate"] > 0:
+            patients = self._inject_temporal_errors(
+                patients, config["temporal_error_rate"]
+            )
+        # ── Step 3: Inject bias (hard only) ──
+        if config["bias_intensity"] > 0:
+            patients = self._inject_bias(patients, config["bias_intensity"])
+        # ── Step 4: Inject adversarial traps ──
+        patients = self._inject_boundary_traps(patients, config["num_boundary_traps"])
+        if config["num_temporal_traps"] > 0:
+            patients = self._inject_temporal_traps(
+                patients, config["num_temporal_traps"]
+            )
+        if config["num_fake_bias_distractors"] > 0:
+            patients = self._inject_fake_bias_distractors(
+                patients, config["num_fake_bias_distractors"]
+            )
+        patients = self._inject_distractor_deceased(
+            patients, config["num_distractor_deceased"]
+        )
+        # ── Step 5: Shuffle dataset ──
+        self.rng.shuffle(patients)
+        # ── Step 6: Compute summary stats ──
+        n_age_errors = sum(
+            1 for errs in self._ground_truth.values()
+            if "invalid_age" in errs
+        )
+        n_temporal_errors = sum(
+            1 for errs in self._ground_truth.values()
+            if "temporal_inconsistency" in errs
+        )
+        total_errors = n_age_errors + n_temporal_errors
+        if config["bias_intensity"] > 0:
+            total_errors += 1  # bias counts as 1 error
+        stats = {
+            "total_patients": len(patients),
+            "total_errors": total_errors,
+            "age_errors": n_age_errors,
+            "temporal_errors": n_temporal_errors,
+            "bias_present": config["bias_intensity"] > 0,
+            "num_traps": len(self._traps),
+            "control_count": sum(1 for p in patients if p["group"] == "control"),
+            "treatment_count": sum(1 for p in patients if p["group"] == "treatment"),
+        }
+        return {
+            "dataset": patients,
+            "ground_truth": dict(self._ground_truth),
+            "traps": set(self._traps),
+            "bias_present": config["bias_intensity"] > 0,
+            "config": config,
+            "stats": stats,
+        }
+# ═══════════════════════════════════════════════════════════════════════════
+# STANDALONE TEST
+# ═══════════════════════════════════════════════════════════════════════════
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  Dataset Generator — Validation Test")
+    print("=" * 60)
+    for diff in ["easy", "medium", "hard"]:
+        gen = DatasetGenerator(seed=42)
+        result = gen.generate(difficulty=diff)
+        stats = result["stats"]
+        print(f"\n  {diff.upper()}:")
+        print(f"    Patients:    {stats['total_patients']}")
+        print(f"    Errors:      {stats['total_errors']} "
+              f"(age={stats['age_errors']}, temporal={stats['temporal_errors']}, "
+              f"bias={'yes' if stats['bias_present'] else 'no'})")
+        print(f"    Traps:       {stats['num_traps']}")
+        print(f"    Control:     {stats['control_count']}")
+        print(f"    Treatment:   {stats['treatment_count']}")
+        # Verify reproducibility
+        gen2 = DatasetGenerator(seed=42)
+        result2 = gen2.generate(difficulty=diff)
+        assert result["dataset"] == result2["dataset"], "REPRODUCIBILITY FAILED!"
+        assert result["ground_truth"] == result2["ground_truth"], "GROUND TRUTH MISMATCH!"
+        print(f"    ✓ Seed reproducibility verified")
+        # Verify ground truth
+        for pid, errors in result["ground_truth"].items():
+            patient = next(p for p in result["dataset"] if p["patient_id"] == pid)
+            for err in errors:
+                if err == "invalid_age":
+                    age = patient.get("age")
+                    assert age is None or age < 18 or age > 120, \
+                        f"Ground truth says {pid} invalid age but age={age}"
+                elif err == "temporal_inconsistency":
+                    ts = datetime.strptime(patient["treatment_start"], "%Y-%m-%d")
+                    dd = datetime.strptime(patient["death_date"], "%Y-%m-%d")
+                    assert dd < ts, \
+                        f"Ground truth says {pid} temporal error but dates are valid"
+        print(f"    ✓ Ground truth integrity verified")
+    # Verify different seeds produce different datasets
+    gen_a = DatasetGenerator(seed=1)
+    gen_b = DatasetGenerator(seed=2)
+    result_a = gen_a.generate("easy")
+    result_b = gen_b.generate("easy")
+    assert result_a["dataset"] != result_b["dataset"], "Different seeds same data!"
+    print(f"\n    ✓ Different seeds produce different datasets")
+    print(f"\n{'=' * 60}")
+    print(f"  ALL TESTS PASSED")
+    print(f"{'=' * 60}")

server/models.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from typing import Optional, List, Dict, Any
+from pydantic import Field
+from openenv.core.env_server import Action, Observation, State
+class AuditAction(Action):
+    action_type: str = "flag_error"
+    patient_id: Optional[str] = None
+    error_type: Optional[str] = None
+    reason: Optional[str] = None
+    proposed_value: Optional[str] = None
+    variable: Optional[str] = None
+    report: Optional[str] = None
+    confidence: Optional[float] = None  # 0.0-1.0: agent's confidence in this action
+class AuditObservation(Observation):
+    done: bool = False
+    reward: float = 0.0
+    task_id: str = ""
+    task_type: str = ""
+    task_description: str = ""
+    dataset: List[Dict[str, Any]] = Field(default_factory=list)
+    errors_found: List[str] = Field(default_factory=list)
+    patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)
+    feedback: Optional[str] = None
+    score_so_far: float = 0.0
+    attempts_remaining: int = 15
+    phase: str = "investigation"
+class AuditState(State):
+    episode_id: str = ""
+    step_count: int = 0
+    task_id: str = ""
+    task_type: str = ""
+    total_errors: int = 0
+    errors_found: int = 0
+    current_score: float = 0.0
+    attempts: int = 0
+    phase: str = "investigation"
+    patterns_investigated: List[str] = Field(default_factory=list)
+    distributions_computed: List[str] = Field(default_factory=list)

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+openenv-core[core]>=0.2.1
+fastapi>=0.104.0
+uvicorn>=0.24.0
+pydantic>=2.0.0

test_output.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+EASY: 300 patients
+Age errors: 12
+Phase: flagging
+Flag: ✓ Correct: P0037 has invalid age (999). Good catch.
+HARD: 800 patients, bias: True
+DONE

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff