Spaces:

Cooked4riyal
/

EntropyEnv

Running

App Files Files Community

immortalindeed commited on 12 days ago

Commit

4ec75cf

0 Parent(s):

first commit

Browse files

Files changed (32) hide show

.gitignore +16 -0
Dockerfile +26 -0
README.md +379 -0
inference.py +341 -0
openenv.yaml +59 -0
pyproject.toml +19 -0
server/__init__.py +1 -0
server/app.py +631 -0
server/benchmark_store.py +69 -0
server/datasets/__init__.py +1 -0
server/datasets/clinical_cases.py +180 -0
server/datasets/dependency_cases.py +280 -0
server/datasets/security_cases.py +211 -0
server/debug_panel.html +1196 -0
server/demo_agent.py +140 -0
server/graders/__init__.py +1 -0
server/graders/base_grader.py +79 -0
server/graders/clinical_grader.py +154 -0
server/graders/dependency_grader.py +243 -0
server/graders/security_grader.py +99 -0
server/models/__init__.py +1 -0
server/models/clinical_models.py +19 -0
server/models/dependency_models.py +22 -0
server/models/security_models.py +23 -0
server/router.py +322 -0
server/session.py +41 -0
server/validation/__init__.py +1 -0
server/validation/validator.py +282 -0
server/web_ui.py +365 -0
tests/test_endpoints.py +131 -0
tests/test_grader_variance.py +138 -0
validate-submission.sh +70 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+# ── Never commit these ──
+.env
+venv/
+__pycache__/
+*.pyc
+*.egg-info/
+multi_agent_dev_tools_env.egg-info/
+dist/
+build/
+test_output.txt
+# ── Internal / personal files ──
+unnecessary/
+results/
+.pytest_cache/
+uv.lock

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+FROM python:3.10-slim
+# ── OpenEnv labels (required for HF Space tagging) ──
+LABEL org.opencontainers.image.title="multi-agent-dev-tools-env"
+LABEL org.opencontainers.image.description="Multi-Agent Dev Tools RL Environment"
+LABEL openenv="true"
+WORKDIR /app
+# Install dependencies
+COPY pyproject.toml .
+RUN pip install --no-cache-dir . 2>/dev/null || pip install --no-cache-dir \
+    fastapi uvicorn pydantic openai requests packaging gradio python-dotenv
+# Copy project files
+COPY . .
+# Expose port 7860 (HuggingFace Spaces standard)
+EXPOSE 7860
+# Health check for HF Spaces
+HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/')" || exit 1
+# Start the server
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,379 @@

+# 🛠️ Multi-Agent Dev Tools Environment
+> A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
+> Built for the **Scaler × Meta × PyTorch × Hugging Face OpenEnv Hackathon 2026**.
+---
+## 💡 Why This Environment?
+Most existing RL benchmarks test agents on **static, single-turn tasks** — classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
+- A security reviewer doesn't just find a bug — they **identify → propose a fix → revise after feedback**
+- A DevOps engineer doesn't just flag outdated packages — they **resolve version conflicts across an entire dependency graph**
+- A clinical coordinator doesn't just spot missing steps — they **prioritize by urgency and plan a dependency-safe recovery**
+**No existing RL environment tests agents on this full identify → act → revise cycle.** This environment fills that gap by providing 9 tasks across 3 real-world domains with progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
+**Who would use this?** Teams training AI coding assistants (code review bots), dependency management agents (Dependabot-like systems), and clinical decision support systems.
+---
+## 🎯 What Is This?
+This is a **training gym for AI agents** — not the agent itself.
+Think of it like a driving test course: you build the course, and different AI "drivers" take the test.
+An AI agent connects to this environment via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** (0.0 – 1.0) based on how good the answer is.
+```
+                    POST /reset
+AI Agent  ────────────────────────►  This Environment
+                                     │
+                                     ├── Picks a task case
+                                     ├── Returns: observation (the problem)
+          ◄────────────────────────  │
+                                     │
+                    POST /step       │
+          ────────────────────────►  │
+                                     ├── Validates the action (3 stages)
+                                     ├── Grades it (domain-specific grader)
+          ◄────────────────────────  ├── Returns: reward + done + next observation
+                                     │
+             (repeat until done)     │
+```
+---
+## 🏗️ Three Domains, Nine Tasks
+### 🔒 Domain 1: MCP Security Auditing
+Agents must identify vulnerabilities in code snippets, propose fixes, and iteratively revise based on reviewer feedback.
+| Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
+|------|-----------|---------|-----------|-----------|---------|
+| `sec_easy` | Easy | `single` | 4 | 0.80 | `identify_vulnerability` |
+| `sec_medium` | Medium | `multi` | 6 | 0.75 | `identify` → `propose_fix` → `revise_fix` |
+| `sec_hard` | Hard | `adversarial` | 8 | 0.70 | `identify` → `propose_fix` → `revise_fix` (reviewer) |
+**Dataset:** 10 ground-truth cases covering SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF.
+### 📦 Domain 2: PyTorch Migration Time-Machine
+Agents must detect deprecated APIs, resolve version conflicts, and fix `torch.compile` graph-break patterns.
+| Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
+|------|-----------|---------|-----------|-----------|---------|
+| `dep_easy` | Easy | `flag` | 4 | 0.80 | `flag_outdated` |
+| `dep_medium` | Medium | `resolve` | 6 | 0.75 | `resolve_conflict` |
+| `dep_hard` | Hard | `migrate` | 8 | 0.70 | `migrate_api` / `validate_tree` |
+**Dataset:** 10 ground-truth cases covering Variable, cuda(), DataParallel, ONNX export, torch.compile graph-breaks.
+### 🏥 Domain 3: Clinical Workflow Chaos Simulator
+Agents must detect missing steps in hospital workflows, rank them by priority, and plan dependency-ordered recovery sequences.
+| Task | Difficulty | Max Steps | Threshold | Actions |
+|------|-----------|-----------|-----------|---------|
+| `cli_easy` | Easy | 4 | 0.80 | `detect_gap` |
+| `cli_medium` | Medium | 6 | 0.75 | `detect_gap` → `rank_issues` |
+| `cli_hard` | Hard | 6 | 0.70 | `detect_gap` → `rank_issues` → `order_steps` |
+**Dataset:** 10 ground-truth cases covering surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion.
+---
+## 📊 Observation & Action Spaces
+### Observation Space
+Every observation includes these core fields:
+| Field | Type | Description |
+|-------|------|-------------|
+| `task_type` | `str` | Domain: `security`, `dependency`, or `clinical` |
+| `task_id` | `str` | Task identifier (e.g., `sec_easy`) |
+| `task_subtype` | `str` | Variant: `single`, `multi`, `flag`, `resolve`, `migrate` |
+| `task_description` | `str` | Human-readable problem description |
+| `available_actions` | `list[dict]` | Valid actions with parameter specs |
+| `turn` | `int` | Current step number |
+| `done` | `bool` | Whether episode has ended |
+Domain-specific fields are added (e.g., `code_snippet` for security, `compatibility_matrix` for dependency, `events` and `dependency_graph` for clinical).
+### Action Space
+Actions are JSON objects with `action_type` and domain-specific parameters:
+```json
+{"action_type": "identify_vulnerability", "vuln_type": "sql_injection", "cvss_score": 8.5, "severity": "critical", "affected_line": 3}
+{"action_type": "propose_fix", "fix_code": "db.execute(query, (param,))", "explanation": "Use parameterized queries"}
+{"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
+{"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
+```
+---
+## 📊 Scoring System
+### Two-Layer Grading Architecture
+**Layer 1: `base_grader.py`** — Universal reward pipeline applied to ALL domains:
+```
+reward = safe_score(correctness + repetition_penalty + harmful_penalty + efficiency_bonus)
+```
+| Component | Formula | Range |
+|-----------|---------|-------|
+| `compute_correctness()` | Domain-specific (see below) | 0.0 – 1.0 |
+| `repetition_penalty` | −0.15 × count(same action in last 3 turns) | −0.45 – 0.0 |
+| `harmful_output_penalty` | −0.30 if forbidden pattern detected | −0.30 – 0.0 |
+| `efficiency_bonus` | +0.10 if `correctness >= 0.8` and early finish | 0.0 – 0.10 |
+| `safe_score()` | `clamp(score, 0.0, 1.0)` | 0.0 – 1.0 |
+**Layer 2: Domain-specific graders:**
+#### Security Grader
+| Action | Component | Weight |
+|--------|-----------|--------|
+| `identify_vulnerability` | vuln_type match | ×0.45 |
+| `identify_vulnerability` | CVSS in range (partial: ±3.0) | ×0.30 |
+| `identify_vulnerability` | severity match (adjacent: ×0.40) | ×0.25 |
+| `propose_fix` | token coverage + identifier preserved (floor: 0.25) | up to 1.15 |
+| `revise_fix` | feedback keyword coverage − regression (floor: 0.20) | 0.0 – 1.0 |
+#### Dependency Grader
+| Action | Formula |
+|--------|---------|
+| `flag_outdated` | F1 × 0.55 + deprecated_api_match × 0.45 |
+| `resolve_conflict` | valid_pkgs / conflict_count + tree_bonus(0.15) − downgrade(0.10) |
+| `migrate_api` | order_score × 0.30 + completeness × 0.40 + fix_quality × 0.30 |
+#### Clinical Grader
+| Action | Formula |
+|--------|---------|
+| `detect_gap` | F1(predicted, expected) × 0.65 + risk_match × 0.35 |
+| `rank_issues` | completeness × 0.40 + NDCG@k × 0.60 |
+| `order_steps` | order_violations × 0.40 + completeness × 0.40 + efficiency × 0.20 |
+### GRPO Training Signal Quality
+This environment is specifically designed for **Group Relative Policy Optimization**:
+- **Smooth reward ramp** — Scores transition smoothly from 0.0 → 1.0, never binary
+- **Partial credit everywhere** — F1 scoring, NDCG ranking, adjacent-severity credit
+- **Progressive penalty learning** — Schema penalty (−0.20), repetition (−0.15), harmful (−0.30)
+- **Efficiency bonus** — Agents learn to solve faster by finishing early
+- **Floor scores** — Valid workflow attempts always get minimum credit (0.20–0.25)
+---
+## 🔐 Validation (3 Stages)
+Every action goes through 3-stage validation before reaching the grader:
+1. **Schema** — Required fields present? Correct types? (Auto-casts `"8.5"` → `8.5`)
+2. **Domain** — Is `vuln_type` in the valid set? Is `cvss_score` in [0, 10]?
+3. **Consistency** — Is `revise_fix` called after `propose_fix`? No identical repeats?
+If validation fails, the agent gets a **rich feedback observation** (not just 0.0):
+```json
+{
+  "validation_failed": true,
+  "error_type": "domain_error",
+  "message": "cvss_score 12.5 out of range",
+  "hint": "cvss_score must be a float between 0.0 and 10.0",
+  "available_actions": ["identify_vulnerability", "propose_fix", "revise_fix"]
+}
+```
+---
+## 🏛️ Architecture
+```
+project-root/
+├── inference.py                # Baseline agent (OpenAI-compatible, spec-compliant logs)
+├── openenv.yaml                # OpenEnv manifest (9 tasks declared)
+├── pyproject.toml              # Python package config with openenv-core dependency
+├── Dockerfile                  # Docker build for HF Spaces (port 7860)
+├── server/
+│   ├── app.py                  # FastAPI endpoints: /, /reset, /step, /state, /debug
+│   ├── router.py               # Central dispatcher: observations, done conditions, score_details
+│   ├── session.py              # In-memory session state management
+│   ├── benchmark_store.py      # Persistent JSON results store (survives restarts)
+│   ├── demo_agent.py           # Rule-based demo agent for Gradio UI
+│   ├── web_ui.py               # Gradio UI with task runner and history
+│   ├── debug_panel.html        # Interactive HTML debug panel
+│   ├── validation/
+│   │   └── validator.py        # 3-stage validation: Schema → Domain → Consistency
+│   ├── graders/
+│   │   ├── base_grader.py      # safe_score, grade_dynamic, penalties, bonuses
+│   │   ├── security_grader.py  # Vuln detection, fix quality, feedback coverage
+│   │   ├── dependency_grader.py # F1 scoring, version checking, graph ordering
+│   │   └── clinical_grader.py  # F1, NDCG ranking, dependency-violation counting
+│   └── datasets/
+│       ├── security_cases.py   # 10 cases: SQL injection, XSS, IDOR, SSRF, etc.
+│       ├── dependency_cases.py # 10 cases: Variable, cuda(), DataParallel, graph-breaks
+│       └── clinical_cases.py   # 10 cases: surgery prep, ER triage, chemo, cardiac
+└── results/
+    └── run_history.json        # Persistent benchmark results (auto-created)
+```
+---
+## 📡 API Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET /` | Health check | Returns status, task list, spec version |
+| `POST /reset` | Start episode | `{"task_id": "sec_easy"}` → `{episode_id, observation}` |
+| `POST /step` | Submit action | `{episode_id, action_type, ...}` → `{reward, done, observation}` |
+| `GET /state` | Query state | `?episode_id=xxx` → `{step_count, done, reward_acc}` |
+| `GET /debug` | Debug panel | Interactive HTML benchmark runner |
+| `GET /web` | Gradio UI | Full task browser with run history |
+### Step Response Format
+```json
+{
+  "episode_id": "uuid-string",
+  "step_count": 2,
+  "reward": 0.75,
+  "done": false,
+  "observation": {
+    "task_type": "security",
+    "task_id": "sec_easy",
+    "task_subtype": "single",
+    "task_description": "Identify the SQL injection vulnerability...",
+    "turn": 1,
+    "done": false,
+    "available_actions": [...]
+  },
+  "score_details": {
+    "vuln_type_match": 1.0,
+    "cvss_in_range": 1.0,
+    "severity_match": 0.0
+  }
+}
+```
+---
+## 🚀 Setup & Running
+### Prerequisites
+- Python 3.10+
+- `pip install fastapi uvicorn openai requests packaging gradio python-dotenv`
+### Running Locally
+```bash
+# 1. Start the environment server
+cd multi-agent-dev-tools-env
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+# 2. Run baseline inference (in another terminal)
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
+export HF_TOKEN="your_token_here"
+export ENV_URL="http://localhost:7860"
+python inference.py
+```
+### Docker
+```bash
+docker build -t multi-agent-dev-tools-env .
+docker run -p 7860:7860 multi-agent-dev-tools-env
+```
+### Deploy to Hugging Face Spaces
+```bash
+huggingface-cli login
+openenv push --repo-id <username>/multi-agent-dev-tools-env
+```
+---
+## 📝 Mandatory Log Format
+The `inference.py` emits structured stdout logs matching the spec exactly:
+```
+[START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
+[STEP] step=2 action=propose_fix reward=1.00 done=true error=null
+[END] success=true steps=2 score=1.00 rewards=0.85,1.00
+```
+### Environment Variables (Required)
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | API key / HF token | `hf_xxxxx` or `gsk_xxxxx` |
+| `ENV_URL` | Environment URL | `http://localhost:7860` |
+---
+## 📈 Baseline Scores
+Tested with multiple model families for universal compatibility:
+| Model | Family | Parameters | Average Score |
+|-------|--------|------------|---------------|
+| Llama 3.3 70B | Meta | 70B | **0.97** |
+| Qwen3-32B | Alibaba | 32B | **0.99** |
+| DeepSeek V3.2 | DeepSeek | MoE | **0.96** |
+The environment provides smooth reward gradients that enable GRPO training of smaller models (8B+).
+---
+## 🔧 Key Design Decisions
+1. **Data-driven done conditions** — `completion_threshold` and `required_sequence` stored per case
+2. **Universal model compatibility** — Strips `<think>`, `<reasoning>`, `<antThinking>` etc.
+3. **Type-casting validator** — Auto-converts `"8.5"` → `8.5` before rejecting
+4. **Floor scores** — Valid workflow attempts always get minimum credit
+5. **Deterministic case selection** — `hash(episode_id) % len(cases)` for reproducibility
+6. **Compatibility matrix separation** — Prevents context truncation for large observations
+7. **Patch-level version fuzzy** — `2.1.1` matches `2.1.0` by major.minor
+8. **Hallucination filter** — `_score_rank` filters step IDs not in `available_steps`
+9. **Persistent results** — `benchmark_store.py` writes to disk, survives restarts
+10. **Robust dependency fallback** — Works without `packaging` module via manual version parsing
+---
+## ☑️ Compliance Checklist
+### Phase 1: Automated Validation (Pass/Fail)
+- [x] HF Space deploys and responds to `GET /`
+- [x] `openenv.yaml` present with all 9 task IDs
+- [x] `POST /reset` returns `episode_id` + `observation` for all 9 tasks
+- [x] `POST /step` returns `reward` (float, 0.0–1.0) + `done` (bool) + `observation`
+- [x] `GET /state` returns episode state
+- [x] All endpoints return HTTP 200 (never 500)
+- [x] `Dockerfile` at project root, builds cleanly
+- [x] `inference.py` at project root, runs under 20 min
+- [x] `openenv validate` passes
+### Phase 2: Agentic Evaluation (Scored)
+- [x] Observations include `task_type`, `task_subtype`, `task_description`, `available_actions`
+- [x] Partial credit graders (F1, NDCG, weighted sub-scores) — not binary
+- [x] Score variance across 9 tasks (varied difficulty = varied scores)
+- [x] `score_details` in step response for grading transparency
+- [x] `safe_score()` clamps all rewards to [0.0, 1.0]
+### Phase 3: Human Review
+- [x] 3 real-world domains (security, dependency, clinical)
+- [x] Multi-turn iterative workflows (identify → fix → revise)
+- [x] Rich validation hints for agent learning
+- [x] Debug panel with benchmark runner UI
+- [x] GRPO-compatible reward shaping

inference.py ADDED Viewed

	@@ -0,0 +1,341 @@

+# inference.py  <-- MUST be at project root
+# Mandatory baseline inference script for OpenEnv hackathon.
+# Uses OpenAI-compatible client for HuggingFace Inference API.
+#
+# STDOUT FORMAT (mandatory — any deviation causes scoring failure):
+#   [START] task=<task_name> env=<benchmark> model=<model_name>
+#   [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+#   [END]   success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
+#
+# Universal model compatibility:
+#   Strips <think>, <thinking>, <reasoning>, <reflection>, <thought>, <antThinking>
+#   Handles unclosed thinking tags, markdown fences, prose before/after JSON
+#   Type coercion for string→float, string→list, etc.
+import os
+import re
+import json
+import textwrap
+import requests
+from openai import OpenAI
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+except ImportError:
+    pass
+# ── Mandatory environment variables (spec-required names) ──
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
+HF_TOKEN     = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or ""
+ENV_URL      = os.getenv("ENV_URL",       "http://localhost:7860")
+MAX_STEPS   = 8
+TEMPERATURE = 0.1
+MAX_TOKENS  = 400
+BENCHMARK   = "multi-agent-dev-tools-env"
+TASKS = [
+    "sec_easy", "sec_medium", "sec_hard",
+    "dep_easy", "dep_medium", "dep_hard",
+    "cli_easy", "cli_medium", "cli_hard",
+]
+# ── Generic System Prompt (works for ALL LLMs) ──
+SYSTEM_PROMPT = textwrap.dedent("""\
+You are an autonomous multi-domain analyst agent inside an RL environment.
+YOUR JOB:
+1. Read the observation — it contains task_type, task_subtype, task_description,
+   available_actions (with parameter specs), and domain-specific data.
+2. Choose the correct action from available_actions.
+3. Respond with ONLY a valid JSON object. No markdown fences. No prose. No thinking tags.
+DOMAIN RULES:
+- security: Workflow is ALWAYS: identify_vulnerability → propose_fix → revise_fix (if feedback)
+  vuln_type MUST be one of: sql_injection|xss|idor|hardcoded_secret|missing_auth|jwt_misuse|path_traversal|ssrf|rate_limit_missing|xxe
+  severity MUST be: critical|high|medium|low. cvss_score: 0.0-10.0 (float).
+  NEVER call identify_vulnerability twice. After identify, ALWAYS call propose_fix next.
+- dependency:
+  task_subtype=flag    → flag_outdated   (find deprecated packages/APIs)
+  task_subtype=resolve → resolve_conflict (pick compatible versions from compatibility_matrix)
+  task_subtype=migrate → migrate_api     (fix ALL graph-break IDs, include code_changes for each)
+- clinical: ALWAYS follow this order: detect_gap → rank_issues → order_steps
+  Use ONLY step IDs from observation.available_steps.
+  risk_level MUST be: critical|high|medium|low
+  If dependency_graph is present, ensure prerequisites come BEFORE dependent steps.
+EXACT FORMAT EXAMPLES — copy field names exactly:
+{"action_type": "identify_vulnerability", "vuln_type": "sql_injection", "cvss_score": 8.5, "severity": "critical", "affected_line": 3}
+{"action_type": "propose_fix", "fix_code": "db.execute(query, (param,))", "explanation": "Use parameterized query to prevent SQL injection"}
+{"action_type": "revise_fix", "fix_code": "cursor.execute(sql, values)", "addressed_feedback": "Used parameterized queries and added input validation"}
+{"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
+{"action_type": "resolve_conflict", "packages": {"torch": "2.1.0", "numpy": "1.24.0"}, "reasoning": "torch 2.1 requires numpy >=1.24"}
+{"action_type": "migrate_api", "completed_items": ["break_001", "break_002", "break_003"], "code_changes": {"break_001": "use torch.where", "break_002": "use tensor.shape[0]", "break_003": "use .detach().numpy()"}}
+{"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
+{"action_type": "rank_issues", "priority_order": ["resolve_insurance", "pre_op_consent", "book_specialist"]}
+{"action_type": "order_steps", "recovery_steps": ["resolve_insurance", "complete_pre_op", "book_specialist", "schedule_surgery"]}
+CRITICAL: Output ONLY the JSON object. Nothing before or after it.
+""")
+def build_user_prompt(step_num: int, obs: dict, history: list) -> str:
+    """Build a focused user prompt from observation and history.
+    Works with ALL models — keeps context compact to avoid truncation.
+    """
+    task_type = obs.get("task_type", "unknown")
+    task_id   = obs.get("task_id", "unknown")
+    task_sub  = obs.get("task_subtype", "")
+    parts = [f"Step {step_num} | task_type={task_type} | task_id={task_id} | subtype={task_sub}"]
+    # History summary — short to avoid confusing models
+    if history:
+        used = [h["action_type"] for h in history]
+        last = history[-1]
+        parts.append(f"Actions used so far: {used}")
+        parts.append(f"Last reward: {last['reward']:.2f}")
+        if last["reward"] == 0.0:
+            parts.append("WARNING: Last action scored 0.0 — it was wrong or invalid. Do NOT repeat it.")
+        elif last["reward"] < 0.4:
+            parts.append(f"WARNING: Low score ({last['reward']:.2f}). Try a better approach.")
+    # Validation failure — show prominently
+    if obs.get("validation_failed"):
+        parts.append(f"\nACTION VALIDATION FAILED!")
+        parts.append(f"Error: {obs.get('message', 'unknown error')}")
+        hint = obs.get("hint", obs.get("available_actions", ""))
+        parts.append(f"Hint: {hint}")
+        parts.append("Fix your JSON and try again with a VALID action.")
+    # Reviewer feedback for security tasks
+    if obs.get("reviewer_feedback"):
+        parts.append(f"\nREVIEWER FEEDBACK (address this in your revise_fix):")
+        parts.append(obs["reviewer_feedback"])
+    # Full observation — separate compat matrix to avoid truncation
+    obs_copy = dict(obs)
+    compat = obs_copy.pop("compatibility_matrix", None)
+    obs_text = json.dumps(obs_copy, default=str)
+    if len(obs_text) > 1800:
+        obs_text = obs_text[:1800] + "..."
+    parts.append(f"\nObservation:\n{obs_text}")
+    if compat:
+        parts.append(f"\nCompatibility Matrix (use this to choose correct versions):\n{json.dumps(compat, indent=2)}")
+    # Next action hint — helps ALL models stay on track
+    if task_type == "security":
+        used_types = [h["action_type"] for h in history]
+        if not history or "identify_vulnerability" not in used_types:
+            parts.append("\nNEXT ACTION: identify_vulnerability")
+        elif "propose_fix" not in used_types:
+            parts.append("\nNEXT ACTION: propose_fix")
+        else:
+            parts.append("\nNEXT ACTION: revise_fix (address the reviewer_feedback)")
+    elif task_type == "clinical":
+        used_types = [h["action_type"] for h in history]
+        if "detect_gap" not in used_types:
+            parts.append("\nNEXT ACTION: detect_gap")
+        elif "rank_issues" not in used_types:
+            parts.append("\nNEXT ACTION: rank_issues (use the step IDs from available_steps)")
+        elif "order_steps" not in used_types:
+            parts.append("\nNEXT ACTION: order_steps (respect dependency_graph ordering)")
+    parts.append("\nOutput ONLY a single JSON object:")
+    return "\n".join(parts)
+def parse_action(raw_text: str) -> dict:
+    """Parse LLM response into action dict.
+    Universal compatibility — handles ALL known model output patterns:
+    - Qwen3/DeepSeek R1: <think>...</think>{json}
+    - QwQ: <reasoning>...</reasoning>{json}
+    - Gemini: <thought>...</thought>{json}
+    - Claude: <antThinking>...</antThinking>{json}
+    - Mistral/Mixtral: plain prose before JSON
+    - All models: ```json fences, unclosed tags, nested JSON
+    """
+    text = raw_text.strip()
+    # Strip ALL known reasoning/thinking blocks (closed and unclosed)
+    for tag in ["think", "thinking", "reasoning", "reflection", "thought", "antThinking"]:
+        open_tag = f"<{tag}>"
+        close_tag = f"</{tag}>"
+        if open_tag in text:
+            if close_tag in text:
+                # Normal case: strip everything between tags
+                text = text.split(close_tag)[-1].strip()
+            else:
+                # Unclosed tag: take everything after the open tag and find JSON
+                text = text.split(open_tag)[-1].strip()
+    # Strip markdown code fences
+    if "```json" in text:
+        text = text.split("```json")[1].split("```")[0].strip()
+    elif "```" in text:
+        parts = text.split("```")
+        if len(parts) >= 3:
+            text = parts[1].strip()
+    # Find first JSON object if text has prose before/after
+    if not text.startswith("{"):
+        start = text.find("{")
+        if start >= 0:
+            end = text.rfind("}")
+            if end > start:
+                text = text[start:end + 1]
+    try:
+        return json.loads(text)
+    except (json.JSONDecodeError, TypeError):
+        pass
+    # Regex fallback: find outermost JSON object (handles nested braces)
+    match = re.search(r"\{(?:[^{}]|\{[^{}]*\})*\}", text, re.DOTALL)
+    if match:
+        try:
+            return json.loads(match.group())
+        except (json.JSONDecodeError, TypeError):
+            pass
+    return {"action_type": "error", "raw": text[:100]}
+def run_task(client: OpenAI, task_id: str) -> float:
+    """Run a single task through the environment. Returns score in [0, 1]."""
+    # Reset environment
+    resp = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id}, timeout=30)
+    data = resp.json()
+    if "error" in data and not data.get("episode_id"):
+        # ── MANDATORY: [START] line even on error ──
+        print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+        print(f"[END] success=false steps=0 score=0.00 rewards=", flush=True)
+        return 0.0
+    episode_id = data.get("episode_id", "unknown")
+    obs = data.get("observation", data)
+    # ── MANDATORY [START] — exact spec format ──
+    print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+    rewards = []
+    history = []
+    step_num = 0
+    last_error = None
+    for step_num in range(1, MAX_STEPS + 1):
+        user_prompt = build_user_prompt(step_num, obs, history)
+        error_msg = None
+        try:
+            reply = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user",   "content": user_prompt},
+                ],
+                temperature=TEMPERATURE,
+                max_tokens=MAX_TOKENS,
+            )
+            response_text = (reply.choices[0].message.content or "").strip()
+        except Exception as e:
+            error_msg = str(e)
+            response_text = '{"action_type": "error"}'
+        action = parse_action(response_text)
+        action_type = action.get("action_type", "unknown")
+        action["episode_id"] = episode_id
+        try:
+            step_resp = requests.post(f"{ENV_URL}/step", json=action, timeout=30)
+            step_data = step_resp.json()
+        except Exception as e:
+            error_msg = str(e)
+            # ── MANDATORY [STEP] line on connection error ──
+            print(f"[STEP] step={step_num} action={action_type} reward=0.00 done=true error={error_msg}", flush=True)
+            rewards.append(0.0)
+            break
+        reward = float(step_data.get("reward", 0.0))
+        done   = bool(step_data.get("done", False))
+        obs    = step_data.get("observation", step_data)
+        step_error = step_data.get("error") or error_msg
+        last_error = step_error
+        rewards.append(reward)
+        history.append({"step": step_num, "action_type": action_type, "reward": reward, "done": done})
+        # Show 'invalid' for validation failures
+        display_action = action_type
+        if obs.get("validation_failed"):
+            display_action = "invalid"
+        # ── MANDATORY [STEP] — exact spec format ──
+        error_val = step_error if step_error else "null"
+        print(f"[STEP] step={step_num} action={display_action} reward={reward:.2f} done={str(done).lower()} error={error_val}", flush=True)
+        if done:
+            break
+    # Score = max(rewards) — agent's best single-step performance, clamped to [0, 1]
+    score = round(min(max(max(rewards) if rewards else 0.0, 0.0), 1.0), 2)
+    success = score > 0.0
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    # ── MANDATORY [END] — exact spec format ──
+    print(f"[END] success={str(success).lower()} steps={step_num} score={score:.2f} rewards={rewards_str}", flush=True)
+    return score
+def main() -> None:
+    """Run all 9 tasks and report final scores."""
+    if not HF_TOKEN:
+        print("ERROR: Set HF_TOKEN or API_KEY environment variable.", flush=True)
+        return
+    client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    # Health check
+    try:
+        health = requests.get(f"{ENV_URL}/", timeout=10, headers={"Accept": "application/json"})
+        health_data = health.json()
+        print(f"Environment: {health_data.get('env', 'unknown')} | Tasks: {health_data.get('tasks', 0)}", flush=True)
+    except Exception as e:
+        print(f"ERROR: Cannot connect to environment at {ENV_URL}: {e}", flush=True)
+        return
+    scores = {}
+    for task_id in TASKS:
+        try:
+            scores[task_id] = run_task(client, task_id)
+        except Exception as e:
+            print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}", flush=True)
+            print(f"[END] success=false steps=0 score=0.00 rewards=", flush=True)
+            scores[task_id] = 0.0
+    avg = round(sum(scores.values()) / max(len(scores), 1), 2)
+    print(f"\n✅ All tasks complete! Average: {avg:.2f}", flush=True)
+    # Final scores JSON — evaluator may parse this
+    print(json.dumps({"final_scores": scores}), flush=True)
+    # Persist results to disk
+    try:
+        from server.benchmark_store import append_result
+        append_result(MODEL_NAME, MODEL_NAME, scores)
+        print(f"💾 Results saved (avg: {avg:.4f})", flush=True)
+    except Exception:
+        pass
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,59 @@

+spec_version: 1
+name: multi-agent-dev-tools-env
+description: >
+  A multi-domain RL environment for training AI agents on real-world developer
+  and clinical tasks. Covers MCP security auditing, PyTorch migration debugging,
+  and clinical workflow chaos recovery. 9 tasks across 3 domains with graded
+  difficulty (easy/medium/hard).
+type: environment
+runtime: docker
+port: 7860
+# Action and Observation spaces use typed Pydantic models
+# See server/models/ for full definitions
+tasks:
+  - id: sec_easy
+    name: Single vulnerability classification
+    difficulty: easy
+    description: Identify vulnerability type, CVSS score, and severity from a tool-call snippet.
+  - id: sec_medium
+    name: Vulnerability identification + fix proposal
+    difficulty: medium
+    description: Identify the vulnerability and propose a secure code fix.
+  - id: sec_hard
+    name: Adversarial patch defense with reviewer feedback
+    difficulty: hard
+    description: Identify, fix, and iteratively revise based on reviewer feedback.
+  - id: dep_easy
+    name: PyTorch 1.x deprecated API detection
+    difficulty: easy
+    description: Flag outdated packages and deprecated API usage.
+  - id: dep_medium
+    name: Version conflict chain resolution
+    difficulty: medium
+    description: Resolve version conflicts using compatibility matrix constraints.
+  - id: dep_hard
+    name: torch.compile graph-break hunter
+    difficulty: hard
+    description: Fix torch.compile graph-break patterns in dependency order.
+  - id: cli_easy
+    name: Single workflow gap detection
+    difficulty: easy
+    description: Detect missing steps in a clinical workflow and assess risk.
+  - id: cli_medium
+    name: Multi-gap priority ranking
+    difficulty: medium
+    description: Detect gaps and rank them by clinical priority.
+  - id: cli_hard
+    name: Dependency-ordered recovery planning
+    difficulty: hard
+    description: Plan a dependency-safe recovery sequence for a disrupted clinical workflow.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,19 @@

+[project]
+name = "multi-agent-dev-tools-env"
+version = "1.0.0"
+requires-python = ">=3.10"
+dependencies = [
+    "fastapi>=0.110.0",
+    "uvicorn>=0.29.0",
+    "pydantic>=2.0.0",
+    "openai>=1.0.0",
+    "requests>=2.31.0",
+    "packaging>=24.0",
+    "pytest>=8.0.0",
+    "gradio>=4.0.0",
+    "python-dotenv>=1.0.0",
+    "openenv-core>=0.2.0"
+]
+[project.scripts]
+server = "server.app:main"

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server package

server/app.py ADDED Viewed

	@@ -0,0 +1,631 @@

+# server/app.py
+# FastAPI server with endpoints. ALL return HTTP 200 ALWAYS.
+# Endpoints: GET /, GET /debug, POST /reset, POST /step, GET /state, POST /inference
+import os
+import sys
+import json
+import random
+import uuid
+import subprocess
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, HTMLResponse
+from .session import create_session, SESSIONS, TASK_TYPE_MAP, SessionState
+from .router import route_step, build_initial_obs
+from .validation.validator import validate_action
+from .datasets.security_cases import SECURITY_CASES
+from .datasets.dependency_cases import DEPENDENCY_CASES
+from .datasets.clinical_cases import CLINICAL_CASES
+app = FastAPI(title='Multi-Agent Dev Tools Environment')
+# ── Load Debug Panel HTML ──
+_DEBUG_HTML_PATH = os.path.join(os.path.dirname(__file__), 'debug_panel.html')
+def _load_debug_html() -> str:
+    try:
+        with open(_DEBUG_HTML_PATH, 'r', encoding='utf-8') as f:
+            return f.read()
+    except FileNotFoundError:
+        return '<h1>Debug panel not found. Place debug_panel.html in server/ directory.</h1>'
+_DEBUG_HTML = _load_debug_html()
+# ── Mount Gradio UI ──
+try:
+    from .web_ui import build_ui
+    import gradio as gr
+    ui_app = build_ui()
+    app = gr.mount_gradio_app(app, ui_app, path='/web')
+except Exception as e:
+    import traceback
+    print(f'[WARNING] Gradio UI not mounted: {e}')
+    traceback.print_exc()
+# ── Dataset Loader ──
+DATASETS = {
+    'sec_easy': SECURITY_CASES.get('sec_easy', []),
+    'sec_medium': SECURITY_CASES.get('sec_medium', []),
+    'sec_hard': SECURITY_CASES.get('sec_hard', []),
+    'dep_easy': DEPENDENCY_CASES.get('dep_easy', []),
+    'dep_medium': DEPENDENCY_CASES.get('dep_medium', []),
+    'dep_hard': DEPENDENCY_CASES.get('dep_hard', []),
+    'cli_easy': CLINICAL_CASES.get('cli_easy', []),
+    'cli_medium': CLINICAL_CASES.get('cli_medium', []),
+    'cli_hard': CLINICAL_CASES.get('cli_hard', []),
+}
+# Per-domain max steps (must match grader config)
+DOMAIN_MAX_STEPS = {'security': 8, 'dependency': 8, 'clinical': 6}
+def load_case(task_id: str, episode_id: str = '') -> dict:
+    """Load a deterministic case for reproducibility.
+    Same episode_id always gets same case (judges can re-run and match)."""
+    cases = DATASETS.get(task_id, [])
+    if not cases:
+        return {}
+    idx = hash(episode_id) % len(cases)
+    return cases[idx]
+# build_initial_obs is imported from router.py — single source of truth for observations
+# ═══════════════════════════════════════════════════════════
+# ENDPOINTS — All return HTTP 200 ALWAYS
+# ═══════════════════════════════════════════════════════════
+@app.get('/')
+async def health(request: Request):
+    """Health check + debug panel. Returns HTML for browsers, JSON for automated scripts."""
+    try:
+        accept = request.headers.get('accept', '')
+        if 'text/html' in accept:
+            return HTMLResponse(content=_DEBUG_HTML, status_code=200)
+        return {
+            'status': 'ok',
+            'env': 'Multi-Agent Real-World Ecosystem',
+            'domains': ['security', 'pytorch', 'clinical'],
+            'tasks': 9,
+            'task_ids': [
+                'sec_easy', 'sec_medium', 'sec_hard',
+                'dep_easy', 'dep_medium', 'dep_hard',
+                'cli_easy', 'cli_medium', 'cli_hard',
+            ],
+            'spec': 'OpenEnv v1',
+        }
+    except Exception as e:
+        return JSONResponse(status_code=200, content={'status': 'error', 'error': str(e)})
+@app.post('/reset')
+async def reset(request: Request):
+    """Create a new episode for a task. Returns episode_id + initial observation."""
+    try:
+        body = await request.json()
+        task_id = body.get('task_id', 'sec_easy')
+        if task_id not in TASK_TYPE_MAP:
+            return JSONResponse(status_code=200, content={
+                'error': f'Unknown task_id: {task_id}',
+                'observation': {},
+                'done': True,
+            })
+        ep_id = str(uuid.uuid4())
+        task_case = load_case(task_id, ep_id)
+        session = create_session(task_id, task_case)
+        session.episode_id = ep_id
+        SESSIONS[session.episode_id] = session
+        # Cleanup old done sessions to prevent memory leaks on HF Spaces
+        done_ids = [eid for eid, s in SESSIONS.items() if s.done]
+        for eid in done_ids:
+            del SESSIONS[eid]
+        obs = build_initial_obs(session)
+        return {
+            'episode_id': session.episode_id,
+            'observation': obs,
+        }
+    except Exception as e:
+        return JSONResponse(status_code=200, content={
+            'error': str(e),
+            'observation': {},
+            'done': True,
+            'reward': 0.0,
+        })
+@app.post('/step')
+async def step(request: Request):
+    """Submit an action for an episode. Returns reward + next observation."""
+    try:
+        body = await request.json()
+        ep_id = body.get('episode_id')
+        session = SESSIONS.get(ep_id)
+        if not session:
+            return JSONResponse(status_code=200, content={
+                'reward': 0.0,
+                'done': True,
+                'error': 'unknown episode_id',
+                'observation': {},
+            })
+        if session.done:
+            return JSONResponse(status_code=200, content={
+                'reward': 0.0,
+                'done': True,
+                'observation': {'message': 'Episode already complete.'},
+            })
+        # Run pre-action validation
+        valid, val_obs = validate_action(body, session)
+        if not valid:
+            last_r = 0.0
+            if session.history:
+                last_r = session.history[-1].get('reward', 0.0)
+            return {
+                'reward': last_r,
+                'done': False,
+                'observation': val_obs,
+            }
+        # Route to grader
+        result = route_step(session, body)
+        # Update session state
+        session.step_count += 1
+        session.last_actions.append(body.get('action_type', 'unknown'))
+        session.history.append(body)
+        session.reward_acc += result.get('reward', 0.0)
+        session.done = result.get('done', False)
+        # Enrich observation with strategic context
+        step_obs = result.get('observation', {})
+        step_obs['task_type'] = session.task_type
+        step_obs['task_id'] = session.task_id
+        step_obs['step_count'] = session.step_count
+        task_max = DOMAIN_MAX_STEPS.get(session.task_type, 8)
+        step_obs['max_steps'] = task_max
+        step_obs['previous_reward'] = round(float(result.get('reward', 0.0)), 4)
+        step_obs['steps_remaining'] = max(0, task_max - session.step_count)
+        step_obs['reward_so_far'] = round(session.reward_acc, 4)
+        step_obs['trajectory_score'] = round(
+            session.reward_acc / max(session.step_count, 1), 4
+        )
+        # Turn guidance — tell agent what to do next
+        last_action = body.get('action_type', '')
+        if session.task_type == 'security':
+            if last_action == 'identify_vulnerability':
+                step_obs['next_expected_action'] = 'propose_fix'
+                step_obs['guidance'] = 'Vulnerability identified. Now propose a fix using propose_fix.'
+            elif last_action == 'propose_fix':
+                step_obs['next_expected_action'] = 'revise_fix'
+                step_obs['guidance'] = 'Fix proposed. If reviewer_feedback is present, use revise_fix.'
+        elif session.task_type == 'clinical':
+            if last_action == 'detect_gap':
+                step_obs['next_expected_action'] = 'rank_issues'
+                step_obs['guidance'] = 'Gaps detected. Now rank issues by priority using rank_issues.'
+            elif last_action == 'rank_issues':
+                step_obs['next_expected_action'] = 'order_steps'
+                step_obs['guidance'] = 'Issues ranked. Now create recovery plan using order_steps.'
+        # Cleanup session if done
+        if session.done:
+            SESSIONS.pop(session.episode_id, None)
+        return {
+            'reward': round(float(result.get('reward', 0.0)), 4),
+            'done': bool(result.get('done', False)),
+            'observation': step_obs,
+        }
+    except Exception as e:
+        return JSONResponse(status_code=200, content={
+            'reward': 0.0,
+            'done': True,
+            'error': str(e),
+            'observation': {},
+        })
+@app.get('/state')
+async def state(episode_id: str = ''):
+    """Get current state of an episode."""
+    try:
+        session = SESSIONS.get(episode_id)
+        if not session:
+            return {
+                'episode_id': episode_id,
+                'step_count': 0,
+                'done': True,
+            }
+        return {
+            'episode_id': session.episode_id,
+            'step_count': session.step_count,
+            'active_domain': session.task_type,
+            'reward_acc': round(session.reward_acc, 4),
+            'done': session.done,
+        }
+    except Exception as e:
+        return JSONResponse(status_code=200, content={'error': str(e)})
+# ═══════════════════════════════════════════════════════════
+# DEBUG PANEL — guaranteed HTML endpoint
+# ═══════════════════════════════════════════════════════════
+@app.get('/debug', response_class=HTMLResponse)
+async def debug_panel():
+    """Always serves the debug panel HTML regardless of Accept header."""
+    try:
+        html = _load_debug_html()  # Reload from disk each time for development
+        return HTMLResponse(content=html, status_code=200)
+    except Exception as e:
+        return HTMLResponse(content=f'<h1>Error loading debug panel: {e}</h1>', status_code=200)
+# ═══════════════════════════════════════════════════════════
+# INFERENCE — run inference.py from browser
+# ═══════════════════════════════════════════════════════════
+@app.post('/inference')
+async def run_inference(request: Request):
+    """Runs inference.py as a subprocess and returns parsed scores."""
+    try:
+        env_vars = os.environ.copy()
+        env_vars['ENV_URL'] = env_vars.get('ENV_URL', 'http://localhost:7860')
+        # Find inference.py at project root (one level up from server/)
+        inference_path = os.path.join(
+            os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
+            'inference.py'
+        )
+        if not os.path.exists(inference_path):
+            return JSONResponse(status_code=200, content={
+                'error': 'inference.py not found at project root',
+                'path_checked': inference_path,
+            })
+        result = subprocess.run(
+            [sys.executable, inference_path],
+            capture_output=True, text=True, timeout=1200,  # 20 min max
+            env=env_vars
+        )
+        stdout = result.stdout or ''
+        stderr = result.stderr or ''
+        # Parse [END] lines for scores
+        logs = []
+        final_scores = {}
+        for line in stdout.splitlines():
+            line = line.strip()
+            if not line:
+                continue
+            logs.append(line)
+            if line.startswith('[END]'):
+                parts = {}
+                for token in line.split():
+                    if '=' in token:
+                        k, v = token.split('=', 1)
+                        parts[k] = v
+                task_id = parts.get('task_id', '')
+                total_reward = parts.get('total_reward', '0')
+                if task_id:
+                    try:
+                        final_scores[task_id] = float(total_reward)
+                    except ValueError:
+                        final_scores[task_id] = 0.0
+        # Also try final JSON summary line
+        for line in reversed(stdout.splitlines()):
+            line = line.strip()
+            if line.startswith('{') and 'final_scores' in line:
+                try:
+                    parsed = json.loads(line)
+                    if 'final_scores' in parsed:
+                        final_scores = parsed['final_scores']
+                except Exception:
+                    pass
+                break
+        avg = (
+            round(sum(final_scores.values()) / len(final_scores), 4)
+            if final_scores else 0.0
+        )
+        return JSONResponse(status_code=200, content={
+            'status': 'ok' if result.returncode == 0 else 'completed_with_errors',
+            'final_scores': final_scores,
+            'average_score': avg,
+            'logs': logs[-50:],
+            'stderr': stderr[-500:] if stderr else '',
+            'returncode': result.returncode,
+        })
+    except subprocess.TimeoutExpired:
+        return JSONResponse(status_code=200, content={
+            'error': 'inference.py timed out after 20 minutes',
+            'final_scores': {},
+        })
+    except Exception as e:
+        return JSONResponse(status_code=200, content={
+            'error': str(e),
+            'final_scores': {},
+        })
+# ═══════════════════════════════════════════════════════════
+# BENCHMARK RUNNER — run from the UI with custom API keys
+# ═══════════════════════════════════════════════════════════
+TASK_IDS = [
+    'sec_easy', 'sec_medium', 'sec_hard',
+    'dep_easy', 'dep_medium', 'dep_hard',
+    'cli_easy', 'cli_medium', 'cli_hard',
+]
+def _parse_llm_response(raw_text: str) -> str:
+    """Strip thinking blocks and markdown from LLM response. Universal model compat."""
+    text = raw_text.strip()
+    # Strip ALL known reasoning/thinking blocks (closed and unclosed)
+    for tag in ['think', 'thinking', 'reasoning', 'reflection', 'thought', 'antThinking']:
+        open_tag = f'<{tag}>'
+        close_tag = f'</{tag}>'
+        if open_tag in text:
+            if close_tag in text:
+                text = text.split(close_tag)[-1].strip()
+            else:
+                text = text.split(open_tag)[-1].strip()
+    # Strip markdown fences
+    if '```json' in text:
+        text = text.split('```json')[1].split('```')[0].strip()
+    elif '```' in text:
+        parts = text.split('```')
+        if len(parts) >= 3:
+            text = parts[1].strip()
+    # Find JSON object
+    if not text.startswith('{'):
+        start = text.find('{')
+        if start >= 0:
+            end = text.rfind('}')
+            if end > start:
+                text = text[start:end + 1]
+    return text
+def _run_single_task_inline(task_id, api_base, api_key, model_id, system_prompt):
+    """Run one task against the local server. Yields dict events."""
+    import re
+    import requests as req
+    logs = []
+    try:
+        from openai import OpenAI
+        client = OpenAI(base_url=api_base, api_key=api_key)
+    except Exception as e:
+        msg = f'[ERROR] OpenAI client init failed: {e}'
+        logs.append(msg)
+        yield {'type': 'log', 'level': 'err', 'msg': msg}
+        yield {'type': 'task_done', 'task_id': task_id, 'score': 0.0, 'logs': logs}
+        return
+    # Reset
+    try:
+        resp = req.post('http://localhost:7860/reset', json={'task_id': task_id}, timeout=30)
+        data = resp.json()
+    except Exception as e:
+        msg = f'[ERROR] Reset failed: {e}'
+        logs.append(msg)
+        yield {'type': 'log', 'level': 'err', 'msg': msg}
+        yield {'type': 'task_done', 'task_id': task_id, 'score': 0.0, 'logs': logs}
+        return
+    ep_id = data.get('episode_id', 'unknown')
+    obs = data.get('observation', data)
+    msg = f'[START] task={task_id} env=multi-agent-dev-tools-env model={model_id}'
+    logs.append(msg)
+    yield {'type': 'log', 'level': 'info', 'msg': msg}
+    messages = [{'role': 'system', 'content': system_prompt}]
+    rewards = []
+    history = []
+    done = False
+    max_steps = 8
+    while not done and len(rewards) < max_steps:
+        step_num = len(rewards) + 1
+        # Build focused prompt with history context
+        obs_text = json.dumps(obs, default=str)
+        if len(obs_text) > 1500:
+            obs_text = obs_text[:1500] + '...'
+        user_parts = [f'Step {step_num} | Observation:']
+        if history:
+            user_parts.append(f'Previous actions: {[h["action_type"] for h in history]}')
+            if history[-1]['reward'] == 0.0:
+                user_parts.append('WARNING: Last action scored 0.0 — do NOT repeat it.')
+        user_parts.append(obs_text)
+        user_parts.append('Output ONLY a single JSON object:')
+        messages.append({'role': 'user', 'content': '\n'.join(user_parts)})
+        try:
+            reply = client.chat.completions.create(
+                model=model_id, messages=messages, max_tokens=400, temperature=0.1
+            )
+            agent_text = (reply.choices[0].message.content or '').strip()
+        except Exception as e:
+            agent_text = '{"action_type":"invalid"}'
+            msg = f'[WARN] API error: {str(e)[:100]}'
+            logs.append(msg)
+            yield {'type': 'log', 'level': 'warn', 'msg': msg}
+        # Universal think-block + markdown stripping
+        raw = _parse_llm_response(agent_text)
+        messages.append({'role': 'assistant', 'content': raw})
+        if len(messages) > 12:
+            messages = [messages[0]] + messages[-10:]
+        try:
+            action = json.loads(raw)
+        except Exception:
+            # Regex fallback
+            match = re.search(r'\{(?:[^{}]|\{[^{}]*\})*\}', raw, re.DOTALL)
+            if match:
+                try:
+                    action = json.loads(match.group())
+                except Exception:
+                    action = {'action_type': 'invalid'}
+            else:
+                action = {'action_type': 'invalid'}
+        # Step
+        try:
+            step_resp = req.post('http://localhost:7860/step', json={
+                'episode_id': ep_id, **action
+            }, timeout=30)
+            step_data = step_resp.json()
+        except Exception as e:
+            msg = f'[ERROR] Step failed: {e}'
+            logs.append(msg)
+            yield {'type': 'log', 'level': 'err', 'msg': msg}
+            break
+        reward = float(step_data.get('reward', 0.0))
+        done = bool(step_data.get('done', False))
+        obs = step_data.get('observation', step_data)
+        rewards.append(reward)
+        atype = action.get('action_type', '?')
+        display_action = atype
+        if obs.get('validation_failed'):
+            display_action = 'invalid'
+        history.append({'action_type': atype, 'reward': reward})
+        error_val = step_data.get('error', 'null') or 'null'
+        msg = f'[STEP] step={step_num} action={display_action} reward={reward:.2f} done={str(done).lower()} error={error_val}'
+        logs.append(msg)
+        yield {'type': 'log', 'level': 'info', 'msg': msg}
+    # Score = max(rewards) — same logic as inference.py
+    score = round(max(rewards) if rewards else 0.0, 2)
+    score = min(max(score, 0.0), 1.0)
+    success = score > 0.0
+    rewards_str = ','.join(f'{r:.2f}' for r in rewards)
+    msg = f'[END] success={str(success).lower()} steps={len(rewards)} score={score:.2f} rewards={rewards_str}'
+    logs.append(msg)
+    yield {'type': 'log', 'level': 'ok', 'msg': msg}
+    yield {'type': 'task_done', 'task_id': task_id, 'score': score, 'logs': logs}
+@app.post('/benchmark/run')
+def run_benchmark(body: dict):
+    """Run all 9 tasks with a given model config. Streams results via SSE."""
+    from datetime import datetime
+    from fastapi.responses import StreamingResponse
+    from .benchmark_store import append_result
+    import json
+    model_name = body.get('model_name', 'Unknown Model')
+    model_id = body.get('model_id', '')
+    api_base = body.get('api_base', '')
+    api_key = body.get('api_key', '')
+    if not model_id or not api_base or not api_key:
+        return JSONResponse(status_code=200, content={'error': 'missing_fields'})
+    system_prompt = body.get('system_prompt', '') or BENCHMARK_SYSTEM_PROMPT
+    def event_stream():
+        scores = {}
+        all_logs = []
+        for task_id in TASK_IDS:
+            for event in _run_single_task_inline(task_id, api_base, api_key, model_id, system_prompt):
+                if event.get('type') == 'log':
+                    all_logs.append(event['msg'])
+                elif event.get('type') == 'task_done':
+                    scores[task_id] = event['score']
+                yield f"data: {json.dumps(event)}\n\n"
+        avg = round(sum(scores.values()) / len(scores), 4) if scores else 0.0
+        result = {
+            'model_name': model_name,
+            'model_id': model_id,
+            'api_base': api_base,
+            'scores': scores,
+            'average': avg,
+            'timestamp': datetime.now().isoformat(),
+            'logs': all_logs,
+        }
+        # Persist to disk via benchmark_store
+        try:
+            append_result(model_name, model_id, scores)
+        except Exception:
+            pass
+        yield f"data: {json.dumps({'type': 'done', 'result': result})}\n\n"
+    return StreamingResponse(event_stream(), media_type="text/event-stream")
+@app.get('/benchmark/results')
+async def get_benchmark_results():
+    """Return all saved benchmark results (persisted to disk)."""
+    from .benchmark_store import get_all
+    results = get_all()
+    return JSONResponse(status_code=200, content={
+        'results': results,
+        'count': len(results),
+    })
+@app.post('/benchmark/clear')
+async def clear_benchmark_results():
+    """Clear all saved benchmark results."""
+    from .benchmark_store import _save
+    _save([])
+    return JSONResponse(status_code=200, content={'status': 'cleared'})
+# Default system prompt for benchmark
+BENCHMARK_SYSTEM_PROMPT = '''You are a multi-domain analyst agent. Each observation has a task_type field.
+Read it. Respond ONLY with a single valid JSON object. No prose, no markdown, no explanation.
+IF task_type == 'security':
+  Turn 1 ALWAYS: {"action_type":"identify_vulnerability","vuln_type":"sql_injection","cvss_score":9.1,"severity":"critical"}
+  Turn 2 ALWAYS: {"action_type":"propose_fix","fix_code":"db.execute(sql, (param,))","explanation":"Use parameterized query"}
+  Turn 3+ (reviewer_feedback present): {"action_type":"revise_fix","fix_code":"<fixed code>","addressed_feedback":"<COPY feedback verbatim>"}
+IF task_type == 'dependency':
+  task_subtype=flag: {"action_type":"flag_outdated","packages":{"torch":"1.9.0"},"deprecated_api":"torch.autograd.Variable","replacement":"plain tensor"}
+  task_subtype=resolve: READ compatibility_matrix. {"action_type":"resolve_conflict","packages":{"torch":"2.1.0","numpy":"1.24.0"},"reasoning":"..."}
+  task_subtype=migrate: {"action_type":"migrate_api","completed_items":["break_001"],"code_changes":{"break_001":"torch.where"}}
+IF task_type == 'clinical':
+  Turn 1: {"action_type":"detect_gap","missing_steps":["step1","step2"],"risk_level":"critical"}
+  Turn 2: {"action_type":"rank_issues","priority_order":["most_urgent","least_urgent"]}
+  Turn 3: {"action_type":"order_steps","recovery_steps":["first","second","last"]}
+ALWAYS: Output ONLY a single JSON object. Follow guidance and next_expected_action.
+'''
+def main():
+    import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=False)
+if __name__ == "__main__":
+    main()

server/benchmark_store.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# server/benchmark_store.py
+# Persists benchmark results to disk so they survive server restarts.
+# Used by both inference.py (CLI) and web_ui.py (frontend).
+import json
+import os
+from datetime import datetime
+from typing import List, Dict
+_STORE_PATH = os.path.join(
+    os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
+    'results', 'run_history.json'
+)
+os.makedirs(os.path.dirname(_STORE_PATH), exist_ok=True)
+def _load() -> List[Dict]:
+    """Load all benchmark results from disk."""
+    if not os.path.exists(_STORE_PATH):
+        return []
+    try:
+        with open(_STORE_PATH, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+            return data if isinstance(data, list) else []
+    except (json.JSONDecodeError, IOError):
+        return []
+def _save(results: List[Dict]) -> None:
+    """Save all benchmark results to disk."""
+    try:
+        with open(_STORE_PATH, 'w', encoding='utf-8') as f:
+            json.dump(results, f, indent=2, default=str)
+    except IOError as e:
+        print(f"[benchmark_store] WARNING: Could not save results: {e}")
+def append_result(model: str, model_id: str, scores: Dict[str, float]) -> Dict:
+    """Add a new benchmark result and persist to disk. Returns the saved entry."""
+    avg = round(sum(scores.values()) / max(len(scores), 1), 4)
+    entry = {
+        'model': model,
+        'model_id': model_id,
+        'scores': scores,
+        'avg': avg,
+        'type': 'full_run',
+        'timestamp': datetime.utcnow().isoformat(),
+    }
+    results = _load()
+    results.append(entry)
+    _save(results)
+    return entry
+def get_all() -> List[Dict]:
+    """Return all benchmark results, newest first."""
+    results = _load()
+    return sorted(results, key=lambda x: x.get('timestamp', ''), reverse=True)
+def get_leaderboard() -> List[Dict]:
+    """Return deduplicated leaderboard: best score per model_id."""
+    results = _load()
+    best: Dict[str, Dict] = {}
+    for r in results:
+        mid = r.get('model_id', r.get('model', 'unknown'))
+        if mid not in best or r.get('avg', 0) > best[mid].get('avg', 0):
+            best[mid] = r
+    return sorted(best.values(), key=lambda x: x.get('avg', 0), reverse=True)

server/datasets/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server/datasets package

server/datasets/clinical_cases.py ADDED Viewed

	@@ -0,0 +1,180 @@

+# server/datasets/clinical_cases.py
+# Ground truth cases for Clinical Workflow Chaos Simulator tasks.
+# Covers: gap detection, priority ranking, dependency-ordered recovery planning.
+CLINICAL_CASES = {
+    'cli_easy': [
+        {
+            'case_id': 'cli_easy_001',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['detect_gap']},
+            'patient_id': 'P101',
+            'patient_events': ['admission', 'surgery_scheduled', 'surgery_performed'],
+            'events': ['admission', 'surgery_scheduled', 'surgery_performed'],
+            'expected_missing_steps': ['pre_op_consent'],
+            'expected_risk': 'critical',
+            'available_steps': ['pre_op_consent', 'blood_work', 'anesthesia_consult'],
+            'task_description': 'A patient is scheduled for surgery but the pre-operative checklist is incomplete. Identify the missing step and assess the risk level.',
+        },
+        {
+            'case_id': 'cli_easy_002',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['detect_gap']},
+            'patient_id': 'P102',
+            'patient_events': ['admission', 'diagnosis', 'medication_prescribed', 'discharge'],
+            'events': ['admission', 'diagnosis', 'medication_prescribed', 'discharge'],
+            'expected_missing_steps': ['allergy_check'],
+            'expected_risk': 'high',
+            'available_steps': ['allergy_check', 'follow_up_scheduled', 'lab_results_reviewed'],
+            'task_description': 'Find the missing safety check in this medication workflow.',
+        },
+        {
+            'case_id': 'cli_easy_003',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['detect_gap']},
+            'patient_id': 'P103',
+            'patient_events': ['er_admission', 'triage', 'treatment', 'discharge'],
+            'events': ['er_admission', 'triage', 'treatment', 'discharge'],
+            'expected_missing_steps': ['insurance_verification'],
+            'expected_risk': 'medium',
+            'available_steps': ['insurance_verification', 'attending_consult', 'social_work_referral'],
+            'task_description': 'Identify the missing administrative step in this ER workflow.',
+        },
+        {
+            'case_id': 'cli_easy_004',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['detect_gap']},
+            'patient_id': 'P104',
+            'patient_events': ['admission', 'ct_scan_ordered', 'ct_scan_performed', 'diagnosis'],
+            'events': ['admission', 'ct_scan_ordered', 'ct_scan_performed', 'diagnosis'],
+            'expected_missing_steps': ['contrast_allergy_screen'],
+            'expected_risk': 'high',
+            'available_steps': ['contrast_allergy_screen', 'kidney_function_test', 'radiologist_review'],
+            'task_description': 'Find the missing safety step before this contrast CT scan.',
+        },
+        {
+            'case_id': 'cli_easy_005',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['detect_gap']},
+            'patient_id': 'P105',
+            'patient_events': ['admission', 'blood_transfusion_ordered', 'transfusion_started'],
+            'events': ['admission', 'blood_transfusion_ordered', 'transfusion_started'],
+            'expected_missing_steps': ['blood_type_crossmatch'],
+            'expected_risk': 'critical',
+            'available_steps': ['blood_type_crossmatch', 'consent_form', 'vital_signs_baseline'],
+            'task_description': 'Find the critical missing step before blood transfusion.',
+        },
+    ],
+    'cli_medium': [
+        {
+            'case_id': 'cli_medium_001',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 2, 'required_sequence': ['detect_gap', 'rank_issues']},
+            'patient_id': 'P201',
+            'patient_events': ['admission', 'surgery_planned', 'insurance_denied', 'specialist_unavailable'],
+            'events': ['admission', 'surgery_planned', 'insurance_denied', 'specialist_unavailable'],
+            'expected_missing_steps': ['resolve_insurance', 'pre_op_consent', 'book_specialist'],
+            'expected_risk': 'critical',
+            'priority_order': ['resolve_insurance', 'pre_op_consent', 'book_specialist'],
+            'available_steps': ['resolve_insurance', 'pre_op_consent', 'book_specialist', 'schedule_surgery'],
+            'dependency_graph': {
+                'schedule_surgery': ['resolve_insurance', 'pre_op_consent', 'book_specialist'],
+                'pre_op_consent': [],
+                'book_specialist': [],
+                'resolve_insurance': [],
+            },
+            'task_description': 'Multiple steps are missing in this surgical patient workflow. Detect all gaps and rank them by clinical priority.',
+        },
+        {
+            'case_id': 'cli_medium_002',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 2, 'required_sequence': ['detect_gap', 'rank_issues']},
+            'patient_id': 'P202',
+            'patient_events': ['er_admission', 'triage_level_2', 'medication_given'],
+            'events': ['er_admission', 'triage_level_2', 'medication_given'],
+            'expected_missing_steps': ['allergy_check', 'attending_notification', 'vital_signs_check'],
+            'expected_risk': 'high',
+            'priority_order': ['allergy_check', 'vital_signs_check', 'attending_notification'],
+            'available_steps': ['allergy_check', 'attending_notification', 'vital_signs_check', 'lab_order'],
+            'dependency_graph': {
+                'allergy_check': [],
+                'vital_signs_check': [],
+                'attending_notification': [],
+                'lab_order': ['vital_signs_check'],
+            },
+            'task_description': 'Multiple safety steps were skipped in this ER case. Find and rank them.',
+        },
+        {
+            'case_id': 'cli_medium_003',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 2, 'required_sequence': ['detect_gap', 'rank_issues']},
+            'patient_id': 'P203',
+            'patient_events': ['admission', 'chemo_ordered', 'chemo_started', 'adverse_reaction'],
+            'events': ['admission', 'chemo_ordered', 'chemo_started', 'adverse_reaction'],
+            'expected_missing_steps': ['baseline_labs', 'oncologist_approval', 'dose_verification'],
+            'expected_risk': 'critical',
+            'priority_order': ['oncologist_approval', 'dose_verification', 'baseline_labs'],
+            'available_steps': ['baseline_labs', 'oncologist_approval', 'dose_verification', 'pharmacy_review'],
+            'dependency_graph': {
+                'oncologist_approval': [],
+                'dose_verification': ['oncologist_approval'],
+                'baseline_labs': [],
+                'pharmacy_review': ['dose_verification'],
+            },
+            'task_description': 'Critical chemotherapy workflow violations. Find all gaps and prioritize.',
+        },
+    ],
+    'cli_hard': [
+        {
+            'case_id': 'cli_hard_001',
+            'completion_threshold': 0.70,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['detect_gap', 'rank_issues', 'order_steps']},
+            'patient_id': 'P301',
+            'patient_events': ['surgery_planned', 'insurance_denied', 'pre_op_test_skipped'],
+            'events': ['surgery_planned', 'insurance_denied', 'pre_op_test_skipped'],
+            'expected_missing_steps': ['resolve_insurance', 'complete_pre_op', 'book_specialist', 'schedule_surgery'],
+            'expected_risk': 'critical',
+            'priority_order': ['resolve_insurance', 'complete_pre_op', 'book_specialist', 'schedule_surgery'],
+            'dependency_graph': {
+                'schedule_surgery': ['resolve_insurance', 'complete_pre_op', 'book_specialist'],
+                'complete_pre_op': ['resolve_insurance'],
+                'book_specialist': [],
+                'resolve_insurance': [],
+            },
+            'required_steps': ['resolve_insurance', 'complete_pre_op', 'book_specialist', 'schedule_surgery'],
+            'available_steps': ['resolve_insurance', 'complete_pre_op', 'book_specialist', 'schedule_surgery'],
+            'task_description': 'A complex surgical patient has multiple workflow failures. Detect all gaps, rank by priority, and plan a dependency-ordered recovery sequence that respects prerequisite constraints.',
+        },
+        {
+            'case_id': 'cli_hard_002',
+            'completion_threshold': 0.70,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['detect_gap', 'rank_issues', 'order_steps']},
+            'patient_id': 'P302',
+            'patient_events': ['cardiac_event', 'icu_admission', 'multiple_failures_detected'],
+            'events': ['cardiac_event', 'icu_admission', 'multiple_failures_detected'],
+            'expected_missing_steps': ['stabilize_vitals', 'cardiology_consult', 'imaging_ordered', 'medication_review', 'family_notification'],
+            'expected_risk': 'critical',
+            'priority_order': ['stabilize_vitals', 'cardiology_consult', 'imaging_ordered', 'medication_review', 'family_notification'],
+            'dependency_graph': {
+                'family_notification': ['stabilize_vitals'],
+                'medication_review': ['cardiology_consult', 'imaging_ordered'],
+                'imaging_ordered': ['stabilize_vitals'],
+                'cardiology_consult': ['stabilize_vitals'],
+                'stabilize_vitals': [],
+            },
+            'required_steps': ['stabilize_vitals', 'cardiology_consult', 'imaging_ordered', 'medication_review', 'family_notification'],
+            'available_steps': ['stabilize_vitals', 'cardiology_consult', 'imaging_ordered', 'medication_review', 'family_notification'],
+            'task_description': 'Complex cardiac emergency recovery plan. Multiple dependency chains. Medication review needs both cardiology consult AND imaging. Respect ALL prerequisites.',
+        },
+    ],
+}

server/datasets/dependency_cases.py ADDED Viewed

	@@ -0,0 +1,280 @@

+# server/datasets/dependency_cases.py
+# Ground truth cases for PyTorch Migration Time-Machine tasks.
+# Covers: deprecated API detection, version conflict resolution, graph-break fixing.
+DEPENDENCY_CASES = {
+    'dep_easy': [
+        {
+            'case_id': 'dep_easy_001',
+            'task_subtype': 'flag',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'torch.autograd.Variable',
+            'replacement': 'plain tensor (remove Variable wrapper)',
+            'code_snippet': '''import torch
+from torch.autograd import Variable
+x = Variable(torch.randn(3, 4), requires_grad=True)
+y = Variable(torch.randn(3, 4))
+z = x + y''',
+            'task_description': 'Identify outdated PyTorch packages and deprecated APIs in this legacy training script.',
+        },
+        {
+            'case_id': 'dep_easy_002',
+            'task_subtype': 'flag',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'tensor.data.numpy()',
+            'replacement': 'tensor.detach().numpy()',
+            'code_snippet': '''import torch
+model = torch.nn.Linear(10, 5)
+x = torch.randn(1, 10)
+output = model(x)
+result = output.data.numpy()  # deprecated''',
+            'task_description': 'Find deprecated tensor conversion API in this code.',
+        },
+        {
+            'case_id': 'dep_easy_003',
+            'task_subtype': 'flag',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'model.cuda()',
+            'replacement': 'model.to(device)',
+            'code_snippet': '''import torch
+model = torch.nn.Sequential(
+    torch.nn.Linear(784, 128),
+    torch.nn.ReLU(),
+    torch.nn.Linear(128, 10)
+)
+model.cuda()  # deprecated device placement
+x = torch.randn(1, 784).cuda()''',
+            'task_description': 'Detect deprecated device placement API in this model code.',
+        },
+        {
+            'case_id': 'dep_easy_004',
+            'task_subtype': 'flag',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'torch.onnx.export',
+            'replacement': 'torch.onnx.dynamo_export',
+            'code_snippet': '''import torch
+model = torch.nn.Linear(10, 5)
+dummy = torch.randn(1, 10)
+torch.onnx.export(model, dummy, "model.onnx",
+                  opset_version=11)''',
+            'task_description': 'Find the deprecated ONNX export API in this code.',
+        },
+        {
+            'case_id': 'dep_easy_005',
+            'task_subtype': 'flag',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['flag_outdated']},
+            'expected_outdated_packages': ['torch'],
+            'expected_deprecated_api': 'torch.nn.DataParallel',
+            'replacement': 'torch.nn.parallel.DistributedDataParallel or FSDP',
+            'code_snippet': '''import torch
+import torch.nn as nn
+model = nn.Linear(100, 10)
+model = nn.DataParallel(model)  # deprecated
+model.cuda()''',
+            'task_description': 'Find deprecated parallelism API in this training code.',
+        },
+    ],
+    'dep_medium': [
+        {
+            'case_id': 'dep_medium_001',
+            'task_subtype': 'resolve',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'conflict_packages': ['torch', 'numpy'],
+            'compatibility_matrix': {
+                'torch': {
+                    '2.1.0': {'numpy': '>=1.24,<2.0'},
+                    '2.0.0': {'numpy': '>=1.22,<1.26'},
+                    '1.13.0': {'numpy': '>=1.19,<1.25'},
+                },
+                'numpy': {
+                    '1.26.0': {},
+                    '1.24.0': {},
+                    '1.22.0': {},
+                    '1.19.0': {},
+                    '1.16.0': {},
+                },
+            },
+            'requirements': {'torch': '1.9.0', 'numpy': '1.16.0'},
+            'code_snippet': '''# requirements.txt
+torch==1.9.0
+numpy==1.16.0
+torchvision==0.10.0''',
+            'task_description': 'Resolve the version conflict between torch and numpy. Find compatible versions using the compatibility matrix.',
+        },
+        {
+            'case_id': 'dep_medium_002',
+            'task_subtype': 'resolve',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'conflict_packages': ['torch', 'numpy', 'torchvision'],
+            'compatibility_matrix': {
+                'torch': {
+                    '2.2.0': {'numpy': '>=1.24,<2.0', 'torchvision': '>=0.17'},
+                    '2.1.0': {'numpy': '>=1.24,<2.0', 'torchvision': '>=0.16'},
+                    '2.0.0': {'numpy': '>=1.22,<1.26', 'torchvision': '>=0.15'},
+                },
+                'numpy': {
+                    '1.26.0': {},
+                    '1.24.0': {},
+                    '1.22.0': {},
+                },
+                'torchvision': {
+                    '0.17.0': {'torch': '>=2.2'},
+                    '0.16.0': {'torch': '>=2.1'},
+                    '0.15.0': {'torch': '>=2.0'},
+                },
+            },
+            'requirements': {'torch': '1.12.0', 'numpy': '1.21.0', 'torchvision': '0.13.0'},
+            'code_snippet': '''# requirements.txt
+torch==1.12.0
+numpy==1.21.0
+torchvision==0.13.0
+# CUDA 11.7''',
+            'task_description': 'Resolve three-way conflict between PyTorch, NumPy, and TorchVision.',
+        },
+        {
+            'case_id': 'dep_medium_003',
+            'task_subtype': 'resolve',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['resolve_conflict']},
+            'conflict_packages': ['torch', 'transformers'],
+            'compatibility_matrix': {
+                'torch': {
+                    '2.1.0': {'transformers': '>=4.35'},
+                    '2.0.0': {'transformers': '>=4.30'},
+                },
+                'transformers': {
+                    '4.37.0': {'torch': '>=2.0'},
+                    '4.35.0': {'torch': '>=2.0'},
+                    '4.30.0': {'torch': '>=1.13'},
+                },
+            },
+            'requirements': {'torch': '1.11.0', 'transformers': '4.20.0'},
+            'code_snippet': '''# requirements.txt
+torch==1.11.0
+transformers==4.20.0''',
+            'task_description': 'Resolve conflict between PyTorch and Transformers library versions.',
+        },
+    ],
+    'dep_hard': [
+        {
+            'case_id': 'dep_hard_001',
+            'task_subtype': 'migrate',
+            'completion_threshold': 0.70,
+            'max_steps': 8,
+            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api']},
+            'graph_breaks': ['break_001', 'break_002', 'break_003'],
+            'checklist_dependency_graph': {
+                'break_003': ['break_001', 'break_002'],
+                'break_002': ['break_001'],
+                'break_001': [],
+            },
+            'correct_fix_map': {
+                'break_001': 'torch.where',
+                'break_002': 'tensor.shape[0]',
+                'break_003': '.detach().numpy()',
+            },
+            'code_snippet': '''import torch
+@torch.compile
+def forward(x):
+    # break_001: data-dependent control flow
+    if x.item() > 0.5:
+        x = x * 2
+    # break_002: Python builtin on tensor
+    batch_size = len(x)
+    # break_003: numpy conversion inside compile
+    result = x.numpy()
+    return result''',
+            'break_descriptions': [
+                'break_001: line 6 — data-dependent control flow: if x.item() > 0.5',
+                'break_002: line 9 — Python builtin on tensor: len(x)',
+                'break_003: line 12 — numpy inside compiled function: x.numpy()',
+            ],
+            'graph_break_report': [
+                'break_001: line 6 — data-dependent control flow: if x.item() > 0.5',
+                'break_002: line 9 — Python builtin on tensor: len(x)',
+                'break_003: line 12 — numpy inside compiled function: x.numpy()',
+            ],
+            'task_description': 'This PyTorch model uses torch.compile but has multiple graph-break patterns. Fix them in dependency order.',
+        },
+        {
+            'case_id': 'dep_hard_002',
+            'task_subtype': 'migrate',
+            'completion_threshold': 0.70,
+            'max_steps': 8,
+            'done_conditions': {'min_actions': 2, 'required_sequence': ['migrate_api']},
+            'graph_breaks': ['break_a', 'break_b', 'break_c', 'break_d'],
+            'checklist_dependency_graph': {
+                'break_d': ['break_b', 'break_c'],
+                'break_c': ['break_a'],
+                'break_b': ['break_a'],
+                'break_a': [],
+            },
+            'correct_fix_map': {
+                'break_a': 'torch.where',
+                'break_b': 'tensor.shape[0]',
+                'break_c': 'torch.tensor',
+                'break_d': '.detach()',
+            },
+            'code_snippet': '''import torch
+@torch.compile(fullgraph=True)
+def training_step(model, x, labels):
+    # break_a: data-dependent branch
+    if x.max().item() > 1.0:
+        x = x / x.max()
+    # break_b: Python len() on tensor
+    n_samples = len(x)
+    # break_c: Python list to tensor inside compile
+    weights = torch.FloatTensor([1.0, 2.0, 3.0])
+    # break_d: in-place operation on leaf tensor
+    x += 0.1  # in-place modification
+    output = model(x)
+    loss = torch.nn.functional.cross_entropy(output, labels)
+    return loss''',
+            'break_descriptions': [
+                'break_a: line 6 — data-dependent: if x.max().item() > 1.0',
+                'break_b: line 10 — Python builtin: len(x)',
+                'break_c: line 13 — legacy constructor: torch.FloatTensor()',
+                'break_d: line 16 — in-place op on leaf: x += 0.1',
+            ],
+            'graph_break_report': [
+                'break_a: line 6 — data-dependent: if x.max().item() > 1.0',
+                'break_b: line 10 — Python builtin: len(x)',
+                'break_c: line 13 — legacy constructor: torch.FloatTensor()',
+                'break_d: line 16 — in-place op on leaf: x += 0.1',
+            ],
+            'task_description': 'Fix all 4 graph-break patterns in this compiled training step. Dependencies must be resolved in order.',
+        },
+    ],
+}

server/datasets/security_cases.py ADDED Viewed

	@@ -0,0 +1,211 @@

+# server/datasets/security_cases.py
+# Ground truth cases for MCP Security Sandbox tasks.
+# Graders compare agent answers against these cases.
+SECURITY_CASES = {
+    'sec_easy': [
+        {
+            'case_id': 'sec_easy_001',
+            'task_subtype': 'single',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['identify_vulnerability']},
+            'expected_vuln_type': 'sql_injection',
+            'cvss_range': [7.5, 9.8],
+            'expected_severity': 'critical',
+            'required_fix_tokens': ['?', 'execute', 'param'],
+            'must_preserve_identifier': 'execute_query',
+            'current_feedback_keywords': ['sql', 'injection'],
+            'original_vuln_pattern': "query + user_input",
+            'tool_call': 'db.execute("SELECT * FROM users WHERE name = \'" + user_input + "\'")',
+            'task_description': 'Identify the SQL injection vulnerability in this MCP database tool call and propose a fix.',
+        },
+        {
+            'case_id': 'sec_easy_002',
+            'task_subtype': 'single',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['identify_vulnerability']},
+            'expected_vuln_type': 'xss',
+            'cvss_range': [4.0, 6.5],
+            'expected_severity': 'medium',
+            'required_fix_tokens': ['escape', 'html'],
+            'must_preserve_identifier': 'render_html',
+            'current_feedback_keywords': ['xss', 'script'],
+            'original_vuln_pattern': '<script>',
+            'tool_call': 'render_html(f"<div>{user_comment}</div>")',
+            'task_description': 'Scan this HTML rendering call for XSS vulnerability.',
+        },
+        {
+            'case_id': 'sec_easy_003',
+            'task_subtype': 'single',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['identify_vulnerability']},
+            'expected_vuln_type': 'hardcoded_secret',
+            'cvss_range': [5.0, 7.5],
+            'expected_severity': 'high',
+            'required_fix_tokens': ['env', 'os'],
+            'must_preserve_identifier': 'connect_db',
+            'current_feedback_keywords': ['hardcoded', 'password'],
+            'original_vuln_pattern': 'password="admin123"',
+            'tool_call': 'connect_db(host="prod-db.internal", password="admin123", user="root")',
+            'task_description': 'Scan this database connection call for hardcoded credentials.',
+        },
+        {
+            'case_id': 'sec_easy_004',
+            'task_subtype': 'single',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['identify_vulnerability']},
+            'expected_vuln_type': 'path_traversal',
+            'cvss_range': [6.0, 8.5],
+            'expected_severity': 'high',
+            'required_fix_tokens': ['basename', 'path'],
+            'must_preserve_identifier': 'read_file',
+            'current_feedback_keywords': ['path', 'traversal', '..'],
+            'original_vuln_pattern': '../../../etc/passwd',
+            'tool_call': 'read_file(filepath=user_input)  # user sends "../../etc/passwd"',
+            'task_description': 'Scan this file read call for path traversal vulnerability.',
+        },
+        {
+            'case_id': 'sec_easy_005',
+            'task_subtype': 'single',
+            'completion_threshold': 0.80,
+            'max_steps': 4,
+            'done_conditions': {'min_actions': 1, 'required_sequence': ['identify_vulnerability']},
+            'expected_vuln_type': 'missing_auth',
+            'cvss_range': [7.0, 9.0],
+            'expected_severity': 'critical',
+            'required_fix_tokens': ['auth', 'token', 'verify'],
+            'must_preserve_identifier': 'delete_user',
+            'current_feedback_keywords': ['authentication', 'missing'],
+            'original_vuln_pattern': 'no auth check',
+            'tool_call': 'delete_user(user_id=request.params["id"])  # no authentication check',
+            'task_description': 'Scan this admin action call for missing authentication.',
+        },
+    ],
+    'sec_medium': [
+        {
+            'case_id': 'sec_medium_001',
+            'task_subtype': 'multi',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['identify_vulnerability', 'propose_fix', 'revise_fix']},
+            'expected_vuln_type': 'sql_injection',
+            'cvss_range': [7.5, 9.8],
+            'expected_severity': 'critical',
+            'required_fix_tokens': ['?', 'execute', 'param'],
+            'must_preserve_identifier': 'search_products',
+            'current_feedback_keywords': ['sql', 'union', 'injection'],
+            'original_vuln_pattern': 'query + ',
+            'reviewer_feedback': 'Your fix prevents basic injection but UNION-based attacks are still possible. Use parameterized queries with bound parameters.',
+            'additional_vulns': [
+                {'vuln_type': 'xss', 'severity': 'medium', 'line': 15},
+                {'vuln_type': 'missing_auth', 'severity': 'high', 'line': 3},
+            ],
+            'tool_call': '''def search_products(query, user_id):
+    # No auth check on line 3
+    results = db.execute("SELECT * FROM products WHERE name LIKE '%" + query + "%'")
+    return f"<div>{results}</div>"  # XSS on line 15''',
+            'task_description': 'This MCP tool has multiple security issues. Identify the most critical vulnerability, propose a fix, and revise based on reviewer feedback.',
+        },
+        {
+            'case_id': 'sec_medium_002',
+            'task_subtype': 'multi',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['identify_vulnerability', 'propose_fix', 'revise_fix']},
+            'expected_vuln_type': 'ssrf',
+            'cvss_range': [6.5, 9.0],
+            'expected_severity': 'high',
+            'required_fix_tokens': ['allowlist', 'url'],
+            'must_preserve_identifier': 'fetch_url',
+            'current_feedback_keywords': ['ssrf', 'internal', 'url'],
+            'original_vuln_pattern': 'requests.get(user_url)',
+            'reviewer_feedback': 'Fix addresses external URLs but internal network requests (169.254.x.x, localhost) are still allowed. Add an allowlist or block private IP ranges.',
+            'additional_vulns': [
+                {'vuln_type': 'hardcoded_secret', 'severity': 'medium', 'line': 2},
+            ],
+            'tool_call': '''def fetch_url(user_url):
+    API_KEY = "sk-secret-key-12345"  # hardcoded secret
+    resp = requests.get(user_url, headers={"Authorization": API_KEY})
+    return resp.text''',
+            'task_description': 'Scan this URL fetcher for all vulnerabilities including SSRF and credential exposure.',
+        },
+        {
+            'case_id': 'sec_medium_003',
+            'task_subtype': 'multi',
+            'completion_threshold': 0.75,
+            'max_steps': 6,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['identify_vulnerability', 'propose_fix', 'revise_fix']},
+            'expected_vuln_type': 'idor',
+            'cvss_range': [5.5, 8.0],
+            'expected_severity': 'high',
+            'required_fix_tokens': ['user_id', 'owner'],
+            'must_preserve_identifier': 'get_document',
+            'current_feedback_keywords': ['idor', 'authorization', 'object'],
+            'original_vuln_pattern': 'doc_id from request',
+            'reviewer_feedback': 'Fix checks user ownership but does not validate the requesting user session. Add session-based auth check.',
+            'additional_vulns': [
+                {'vuln_type': 'rate_limit_missing', 'severity': 'medium', 'line': 1},
+            ],
+            'tool_call': '''@app.get("/documents/{doc_id}")  # no rate limiting
+def get_document(doc_id: int, request: Request):
+    # No check if requesting user owns this document
+    return db.documents.find_one({"_id": doc_id})''',
+            'task_description': 'Scan this document access endpoint for authorization and rate limiting issues.',
+        },
+    ],
+    'sec_hard': [
+        {
+            'case_id': 'sec_hard_001',
+            'task_subtype': 'adversarial',
+            'completion_threshold': 0.70,
+            'max_steps': 8,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['identify_vulnerability', 'propose_fix', 'revise_fix']},
+            'expected_vuln_type': 'sql_injection',
+            'cvss_range': [8.0, 10.0],
+            'expected_severity': 'critical',
+            'required_fix_tokens': ['?', 'execute', 'param'],
+            'must_preserve_identifier': 'send_email',
+            'current_feedback_keywords': ['still vulnerable', 'sql still present', 'parameterize'],
+            'original_vuln_pattern': "query + user_input",
+            'reviewer_feedback': 'Fix still allows SQL injection through the search parameter. Use parameterized queries.',
+            'reviewer_feedback_sequence': [
+                'Fix still allows SQL injection through the search parameter. Use parameterized queries.',
+                'Better, but concatenation on line 5 still vulnerable. Also add input validation.',
+            ],
+            'tool_call': '''def send_email(recipient, subject, body):
+    query = "INSERT INTO emails (to_addr, subject, body) VALUES ('" + recipient + "', '" + subject + "', '" + body + "')"
+    db.execute(query)
+    log(f"Email sent to {recipient}")
+    return {"status": "sent", "to": recipient}''',
+            'task_description': 'Rewrite this email tool call to remove SQL injection while preserving ALL business logic. A reviewer will evaluate your fix.',
+        },
+        {
+            'case_id': 'sec_hard_002',
+            'task_subtype': 'adversarial',
+            'completion_threshold': 0.70,
+            'max_steps': 8,
+            'done_conditions': {'min_actions': 3, 'required_sequence': ['identify_vulnerability', 'propose_fix', 'revise_fix']},
+            'expected_vuln_type': 'jwt_misuse',
+            'cvss_range': [7.5, 9.5],
+            'expected_severity': 'critical',
+            'required_fix_tokens': ['verify', 'algorithm', 'HS256', 'secret'],
+            'must_preserve_identifier': 'authenticate_user',
+            'current_feedback_keywords': ['jwt', 'algorithm', 'none attack', 'verify signature'],
+            'original_vuln_pattern': 'algorithms=["none"]',
+            'reviewer_feedback': 'Fix does not prevent algorithm=none attack. Explicitly set allowed algorithms.',
+            'reviewer_feedback_sequence': [
+                'Fix does not prevent algorithm=none attack. Explicitly set allowed algorithms.',
+                'Algorithm is set but secret key is still derived from user input. Use server secret.',
+            ],
+            'tool_call': '''def authenticate_user(token):
+    payload = jwt.decode(token, options={"verify_signature": False})
+    user_id = payload.get("user_id")
+    return get_user(user_id)''',
+            'task_description': 'Rewrite this JWT authentication to prevent algorithm confusion attacks while preserving user lookup logic.',
+        },
+    ],
+}

server/debug_panel.html ADDED Viewed

	@@ -0,0 +1,1196 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>OpenEnv Debug Panel — Multi-Agent Ecosystem</title>
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
+<style>
+  *{box-sizing:border-box;margin:0;padding:0}
+  :root{
+    --bg:#0d1017;--surface:#151822;--surface2:#1c2030;--border:#262d40;
+    --blue:#4f8ef7;--green:#22c55e;--amber:#f59e0b;--red:#ef4444;--purple:#a855f7;--cyan:#22d3ee;
+    --text:#e2e8f0;--muted:#6b7a94;--mono:'JetBrains Mono','Fira Code',monospace;
+  }
+  body{background:var(--bg);color:var(--text);font-family:'Inter','Segoe UI',sans-serif;font-size:14px;height:100vh;overflow:hidden}
+  /* ── Header ── */
+  .header{background:linear-gradient(135deg,#131828 0%,#1a2040 100%);border-bottom:1px solid var(--border);padding:12px 20px;display:flex;align-items:center;gap:14px;flex-shrink:0}
+  .header-logo{display:flex;align-items:center;gap:10px}
+  .logo-dot{width:10px;height:10px;border-radius:50%;animation:pulse 2s infinite}
+  .logo-dot.green{background:var(--green);box-shadow:0 0 8px var(--green)}
+  .logo-dot.err{background:var(--red);box-shadow:0 0 8px var(--red)}
+  @keyframes pulse{0%,100%{opacity:1}50%{opacity:.5}}
+  .header h1{font-size:16px;font-weight:700;color:#fff;white-space:nowrap}
+  .badge{padding:3px 10px;border-radius:99px;font-size:10px;font-weight:600;background:#1e3a5f;color:var(--blue);border:1px solid #2563eb33}
+  /* ── Full Layout ── */
+  .layout{display:grid;grid-template-columns:280px 1fr;height:calc(100vh - 50px)}
+  .sidebar{background:var(--surface);border-right:1px solid var(--border);overflow-y:auto;padding:12px;display:flex;flex-direction:column;gap:10px}
+  .main{display:flex;flex-direction:column;overflow:hidden;min-height:0}
+  /* ── Cards ── */
+  .card{background:var(--surface2);border:1px solid var(--border);border-radius:8px;overflow:hidden}
+  .card-hdr{padding:8px 12px;border-bottom:1px solid var(--border);font-size:11px;font-weight:600;color:var(--muted);text-transform:uppercase;letter-spacing:.04em;display:flex;align-items:center;gap:6px;background:var(--surface)}
+  .card-body{padding:10px}
+  /* ── Domain tabs ── */
+  .domain-tabs{display:flex;gap:3px;background:var(--bg);border-radius:6px;padding:3px}
+  .domain-tab{flex:1;padding:6px 0;border:none;border-radius:5px;cursor:pointer;font-size:11px;font-weight:600;color:var(--muted);background:transparent;transition:all .2s}
+  .domain-tab.active{color:#fff}
+  .domain-tab[data-domain="security"].active{background:#1e1a2e;color:var(--purple);box-shadow:0 0 0 1px #a855f744}
+  .domain-tab[data-domain="pytorch"].active{background:#1a2a1a;color:var(--green);box-shadow:0 0 0 1px #22c55e44}
+  .domain-tab[data-domain="clinical"].active{background:#1a2030;color:var(--cyan);box-shadow:0 0 0 1px #22d3ee44}
+  /* ── Task list ── */
+  .task-list{display:flex;flex-direction:column;gap:3px}
+  .task-btn{padding:7px 10px;border:1px solid var(--border);border-radius:6px;background:transparent;color:var(--text);cursor:pointer;text-align:left;display:flex;align-items:center;gap:8px;transition:all .15s;font-size:12px}
+  .task-btn:hover{border-color:var(--blue);background:#1e254033}
+  .task-btn.active{border-color:var(--blue);background:#1e2540;color:#fff}
+  .task-btn .diff{font-size:9px;font-weight:700;padding:2px 7px;border-radius:99px;margin-left:auto}
+  .diff-easy{background:#14532d33;color:var(--green);border:1px solid #22c55e44}
+  .diff-medium{background:#78350f33;color:var(--amber);border:1px solid #f59e0b44}
+  .diff-hard{background:#7f1d1d33;color:var(--red);border:1px solid #ef444444}
+  /* ── Form elements ── */
+  label{display:block;font-size:10px;color:var(--muted);font-weight:600;text-transform:uppercase;letter-spacing:.04em;margin-bottom:4px}
+  input,select,textarea{width:100%;background:var(--bg);border:1px solid var(--border);border-radius:5px;padding:7px 9px;color:var(--text);font-size:12px;font-family:inherit;outline:none;transition:border .15s}
+  input:focus,select:focus,textarea:focus{border-color:var(--blue)}
+  textarea{resize:vertical;font-family:var(--mono);font-size:11px;min-height:60px}
+  .field{margin-bottom:8px}
+  /* ── Buttons ── */
+  .btn{padding:7px 14px;border:none;border-radius:6px;cursor:pointer;font-size:12px;font-weight:600;transition:all .15s;display:inline-flex;align-items:center;gap:5px}
+  .btn-primary{background:var(--blue);color:#fff}
+  .btn-primary:hover{background:#3b7de8}
+  .btn-success{background:#166534;color:var(--green);border:1px solid #22c55e44}
+  .btn-success:hover{background:#14532d}
+  .btn-danger{background:#7f1d1d;color:var(--red);border:1px solid #ef444444}
+  .btn-ghost{background:transparent;color:var(--muted);border:1px solid var(--border);font-size:11px}
+  .btn-ghost:hover{color:var(--text);border-color:var(--text)}
+  .btn:disabled{opacity:.4;cursor:not-allowed}
+  /* ── Top bar ── */
+  .main-topbar{padding:8px 16px;border-bottom:1px solid var(--border);display:flex;align-items:center;gap:10px;flex-wrap:wrap;background:var(--surface);flex-shrink:0}
+  .info-chip{background:var(--bg);border:1px solid var(--border);border-radius:5px;padding:4px 8px;font-size:10px;white-space:nowrap}
+  .info-chip span{color:var(--muted);margin-right:3px}
+  .info-chip strong{color:var(--text)}
+  /* ── Main content: 3 rows ── */
+  .content-area{display:flex;flex-direction:column;flex:1;overflow:hidden;min-height:0}
+  /* Row 1: Observation + Reward (flexible) */
+  .obs-reward-area{display:grid;grid-template-columns:1fr 340px;flex:1;overflow:hidden;min-height:0;border-bottom:1px solid var(--border)}
+  /* Row 2: Action builder (auto height, scrollable) */
+  .action-section{border-bottom:1px solid var(--border);background:var(--surface);padding:10px 16px;max-height:220px;overflow-y:auto;flex-shrink:0}
+  .action-tabs{display:flex;gap:3px;flex-wrap:wrap}
+  .action-tab{padding:4px 10px;border:1px solid var(--border);border-radius:5px;cursor:pointer;font-size:10px;font-weight:600;color:var(--muted);background:transparent}
+  .action-tab.active{border-color:var(--blue);color:var(--blue);background:#1e2540}
+  .action-fields{display:none;grid-template-columns:1fr 1fr;gap:8px}
+  .action-fields.visible{display:grid}
+  .action-fields .full{grid-column:1/-1}
+  /* Row 3: Step log (fixed 160px) */
+  .step-log{background:var(--bg);border-top:1px solid var(--border);overflow-y:auto;padding:8px 12px;font-family:var(--mono);font-size:11px;line-height:1.7;height:160px;flex-shrink:0}
+  .log-line{display:flex;gap:8px;align-items:baseline}
+  .log-time{color:var(--muted);flex-shrink:0;min-width:52px}
+  .log-tag{flex-shrink:0;font-weight:700;min-width:56px}
+  .log-tag.start{color:var(--blue)}
+  .log-tag.step{color:var(--amber)}
+  .log-tag.end{color:var(--green)}
+  .log-tag.error{color:var(--red)}
+  .log-tag.info{color:var(--purple)}
+  .log-msg{color:var(--text);word-break:break-all}
+  /* ── JSON viewer ── */
+  .json-view{background:var(--bg);font-family:var(--mono);font-size:11px;line-height:1.5;overflow-y:auto;padding:12px;white-space:pre-wrap;word-break:break-all;flex:1}
+  .json-key{color:#93c5fd}
+  .json-str{color:#86efac}
+  .json-num{color:#fbbf24}
+  .json-bool{color:#f87171}
+  .json-null{color:var(--muted)}
+  /* ── Reward ── */
+  .reward-section{padding:12px;overflow-y:auto;background:var(--surface)}
+  .reward-display{text-align:center;padding:10px 0}
+  .reward-number{font-size:42px;font-weight:800;font-family:var(--mono);line-height:1}
+  .reward-bar-wrap{margin:8px 0;height:8px;background:var(--border);border-radius:99px;overflow:hidden}
+  .reward-bar{height:100%;border-radius:99px;transition:width .5s ease;background:linear-gradient(90deg,var(--green),#84cc16)}
+  .reward-label{font-size:10px;color:var(--muted)}
+  .breakdown-item{display:flex;justify-content:space-between;align-items:center;padding:4px 0;border-bottom:1px solid var(--border);font-size:11px}
+  .breakdown-item:last-child{border:none}
+  .breakdown-val.pos{color:var(--green)}
+  .breakdown-val.neg{color:var(--red)}
+  /* ── Task meta ── */
+  .task-meta{background:var(--bg);border:1px solid var(--border);border-radius:6px;padding:8px 10px;font-size:11px;line-height:1.6;color:var(--muted)}
+  .task-meta strong{color:var(--text);display:block;margin-bottom:3px;font-size:12px}
+  /* ── Inference runner ── */
+  .inference-panel{background:var(--surface2);border:1px solid var(--border);border-radius:8px;padding:10px;margin-top:4px}
+  .inference-progress{display:flex;gap:4px;flex-wrap:wrap;margin:6px 0}
+  .task-chip{padding:2px 6px;border-radius:4px;font-size:9px;font-weight:700;border:1px solid var(--border);color:var(--muted)}
+  .task-chip.running{border-color:var(--amber);color:var(--amber);animation:pulse 1s infinite}
+  .task-chip.done{border-color:var(--green);color:var(--green)}
+  .task-chip.fail{border-color:var(--red);color:var(--red)}
+  /* ── Status indicator ── */
+  .status-dot{width:8px;height:8px;border-radius:50%;display:inline-block;flex-shrink:0}
+  /* ── Responsive ── */
+  @media(max-width:900px){
+    .layout{grid-template-columns:1fr;grid-template-rows:auto 1fr}
+    .sidebar{border-right:none;border-bottom:1px solid var(--border);max-height:260px;flex-direction:row;flex-wrap:wrap;overflow-x:auto}
+    .obs-reward-area{grid-template-columns:1fr}
+  }
+  /* ── Page Navigation ── */
+  .page-tabs{display:flex;gap:2px;background:var(--bg);border-radius:6px;padding:2px;margin-left:16px}
+  .page-tab{padding:5px 14px;border:none;border-radius:5px;cursor:pointer;font-size:11px;font-weight:600;color:var(--muted);background:transparent;transition:all .2s}
+  .page-tab.active{color:#fff;background:var(--blue);box-shadow:0 0 12px #4f8ef733}
+  .page-tab:hover:not(.active){color:var(--text);background:var(--surface2)}
+  .page{display:none;height:calc(100vh - 50px);overflow:hidden}
+  .page.visible{display:flex;flex-direction:column}
+  /* ── Benchmark Page ── */
+  .bench-layout{display:grid;grid-template-columns:360px 1fr;height:100%;overflow:hidden}
+  .bench-sidebar{background:var(--surface);border-right:1px solid var(--border);padding:16px;overflow-y:auto}
+  .bench-main{display:flex;flex-direction:column;overflow:hidden}
+  .bench-card{background:var(--surface2);border:1px solid var(--border);border-radius:10px;overflow:hidden;margin-bottom:12px}
+  .bench-card-hdr{padding:10px 14px;border-bottom:1px solid var(--border);font-size:12px;font-weight:700;color:var(--text);display:flex;align-items:center;gap:8px;background:linear-gradient(135deg,var(--surface) 0%,var(--surface2) 100%)}
+  .bench-card-body{padding:12px}
+  .preset-row{display:flex;gap:4px;flex-wrap:wrap;margin-bottom:10px}
+  .preset-btn{padding:4px 10px;border:1px solid var(--border);border-radius:5px;cursor:pointer;font-size:10px;font-weight:600;color:var(--muted);background:transparent;transition:all .15s}
+  .preset-btn:hover{border-color:var(--blue);color:var(--blue)}
+  .preset-btn.active{border-color:var(--blue);background:#1e2540;color:var(--blue)}
+  .bench-field{margin-bottom:10px}
+  .bench-field label{font-size:10px;color:var(--muted);font-weight:600;text-transform:uppercase;letter-spacing:.04em;margin-bottom:4px;display:block}
+  .bench-field input,.bench-field select{width:100%;background:var(--bg);border:1px solid var(--border);border-radius:6px;padding:8px 10px;color:var(--text);font-size:12px;font-family:inherit;outline:none;transition:border .15s}
+  .bench-field input:focus{border-color:var(--blue)}
+  .bench-field input[type=password]{font-family:var(--mono);letter-spacing:2px}
+  .run-btn{width:100%;padding:10px;border:none;border-radius:8px;cursor:pointer;font-size:13px;font-weight:700;color:#fff;background:linear-gradient(135deg,#4f8ef7 0%,#a855f7 100%);transition:all .2s;display:flex;align-items:center;justify-content:center;gap:8px}
+  .run-btn:hover{transform:translateY(-1px);box-shadow:0 4px 20px #4f8ef744}
+  .run-btn:disabled{opacity:.5;cursor:not-allowed;transform:none;box-shadow:none}
+  .run-btn.running{background:linear-gradient(135deg,#f59e0b 0%,#ef4444 100%);animation:pulse 1.5s infinite}
+  /* ── Results Table ── */
+  .results-area{flex:1;overflow-y:auto;padding:16px;background:var(--bg)}
+  .results-table{width:100%;border-collapse:collapse;font-size:12px}
+  .results-table th{padding:8px 10px;text-align:left;font-size:10px;font-weight:700;color:var(--muted);text-transform:uppercase;letter-spacing:.04em;border-bottom:2px solid var(--border);position:sticky;top:0;background:var(--bg);z-index:1}
+  .results-table td{padding:6px 10px;border-bottom:1px solid var(--border)}
+  .results-table tr:hover{background:var(--surface2)}
+  .score-cell{font-family:var(--mono);font-weight:700;font-size:12px}
+  .score-high{color:var(--green)}
+  .score-mid{color:var(--amber)}
+  .score-low{color:var(--red)}
+  .avg-cell{font-size:14px;font-weight:800}
+  /* ── Bar Chart ── */
+  .chart-container{padding:16px;border-top:1px solid var(--border);background:var(--surface);flex-shrink:0;max-height:280px;overflow-y:auto}
+  .chart-bar-row{display:flex;align-items:center;gap:8px;margin-bottom:6px}
+  .chart-label{width:120px;font-size:11px;font-weight:600;color:var(--text);text-align:right;flex-shrink:0;white-space:nowrap;overflow:hidden;text-overflow:ellipsis}
+  .chart-bar-bg{flex:1;height:22px;background:var(--bg);border-radius:4px;overflow:hidden;border:1px solid var(--border)}
+  .chart-bar-fill{height:100%;border-radius:3px;transition:width .8s ease;display:flex;align-items:center;padding:0 6px;font-size:10px;font-weight:700;color:#fff;white-space:nowrap;min-width:0}
+  /* ── Benchmark Log ── */
+  .bench-log{background:var(--bg);border-top:1px solid var(--border);height:200px;overflow-y:auto;padding:8px 12px;font-family:var(--mono);font-size:11px;line-height:1.6;flex-shrink:0}
+  .bench-log .log-warn{color:var(--amber)}
+  .bench-log .log-err{color:var(--red)}
+  .bench-log .log-ok{color:var(--green)}
+  .bench-log .log-info{color:var(--blue)}
+  /* ── Empty State ── */
+  .empty-state{display:flex;flex-direction:column;align-items:center;justify-content:center;height:100%;color:var(--muted);gap:12px}
+  .empty-state .icon{font-size:48px;opacity:.3}
+  .empty-state p{font-size:13px;text-align:center;max-width:260px;line-height:1.5}
+</style>
+</head>
+<body>
+<!-- ── HEADER ── -->
+<div class="header">
+  <div class="header-logo">
+    <div class="logo-dot green" id="status-dot"></div>
+    <h1>OpenEnv Debug Panel</h1>
+    <span class="badge">Multi-Agent Ecosystem</span>
+  </div>
+  <div style="display:flex;gap:8px;margin-left:auto;align-items:center">
+    <div class="page-tabs">
+      <button class="page-tab active" onclick="switchPage('debug')" id="ptab-debug">🔧 Debug</button>
+      <button class="page-tab" onclick="switchPage('benchmark')" id="ptab-benchmark">📊 Benchmark</button>
+    </div>
+    <span class="badge" style="background:#1a2a1a;color:var(--green);border-color:#22c55e33">Security · PyTorch · Clinical</span>
+    <span id="server-status" style="font-size:10px;color:var(--muted)">Checking...</span>
+  </div>
+</div>
+<!-- ══ PAGE: DEBUG ══ -->
+<div class="page visible" id="page-debug">
+<!-- ── LAYOUT ── -->
+<div class="layout">
+  <!-- SIDEBAR -->
+  <div class="sidebar">
+    <!-- Domain Selector -->
+    <div class="card">
+      <div class="card-hdr">🎯 Domain</div>
+      <div class="card-body" style="padding:6px">
+        <div class="domain-tabs">
+          <button class="domain-tab active" data-domain="security" onclick="switchDomain('security')">Security</button>
+          <button class="domain-tab" data-domain="pytorch" onclick="switchDomain('pytorch')">PyTorch</button>
+          <button class="domain-tab" data-domain="clinical" onclick="switchDomain('clinical')">Clinical</button>
+        </div>
+      </div>
+    </div>
+    <!-- Task Selector -->
+    <div class="card">
+      <div class="card-hdr">📋 Tasks</div>
+      <div class="card-body" style="padding:6px">
+        <div class="task-list" id="task-list"></div>
+      </div>
+    </div>
+    <!-- Task Info -->
+    <div class="card">
+      <div class="card-hdr">ℹ️ Task Info</div>
+      <div class="card-body">
+        <div class="task-meta" id="task-meta">Select a task to see details.</div>
+      </div>
+    </div>
+    <!-- Run Full Inference -->
+    <div class="inference-panel">
+      <div style="font-size:11px;font-weight:700;color:var(--text);margin-bottom:6px">⚡ Full Inference Run</div>
+      <div style="font-size:10px;color:var(--muted);margin-bottom:8px">Runs all 9 tasks via /inference endpoint.</div>
+      <button class="btn btn-success" style="width:100%;font-size:11px" onclick="runFullInference()" id="inf-btn">▶ Run All 9 Tasks</button>
+      <div class="inference-progress" id="inf-progress" style="display:none"></div>
+      <div id="inf-scores" style="margin-top:6px;font-family:var(--mono);font-size:10px"></div>
+    </div>
+  </div>
+  <!-- MAIN PANEL -->
+  <div class="main">
+    <!-- Top bar -->
+    <div class="main-topbar">
+      <div style="display:flex;gap:8px;flex:1;flex-wrap:wrap">
+        <div class="info-chip"><span>Task:</span><strong id="chip-task">—</strong></div>
+        <div class="info-chip"><span>Episode:</span><strong id="chip-episode" style="font-family:var(--mono);font-size:9px">—</strong></div>
+        <div class="info-chip"><span>Step:</span><strong id="chip-step">0</strong></div>
+        <div class="info-chip"><span>Reward:</span><strong id="chip-reward" style="color:var(--green)">0.0000</strong></div>
+        <div class="info-chip"><span>Done:</span><strong id="chip-done">—</strong></div>
+      </div>
+      <div style="display:flex;gap:6px">
+        <button class="btn btn-primary" onclick="doReset()" id="btn-reset">⟳ Reset</button>
+        <button class="btn btn-success" onclick="doStep()" id="btn-step" disabled>▶ Step</button>
+        <button class="btn btn-ghost" onclick="clearLog()">🗑 Clear</button>
+      </div>
+    </div>
+    <!-- Content area: 3 flex rows -->
+    <div class="content-area">
+      <!-- ROW 1: Observation + Reward -->
+      <div class="obs-reward-area">
+        <!-- Observation -->
+        <div style="display:flex;flex-direction:column;overflow:hidden;border-right:1px solid var(--border)">
+          <div class="card-hdr">📥 Observation</div>
+          <div class="json-view" id="obs-view">
+            <span style="color:var(--muted)">Press Reset to load the first observation...</span>
+          </div>
+        </div>
+        <!-- Reward -->
+        <div style="display:flex;flex-direction:column;overflow:hidden">
+          <div class="card-hdr">🏆 Reward</div>
+          <div class="reward-section">
+            <div class="reward-display">
+              <div class="reward-number" id="reward-num" style="color:var(--muted)">—</div>
+              <div class="reward-bar-wrap"><div class="reward-bar" id="reward-bar" style="width:0%"></div></div>
+              <div class="reward-label" id="reward-label">No reward yet</div>
+            </div>
+            <div id="reward-breakdown"></div>
+            <div id="step-result-raw" style="margin-top:6px"></div>
+          </div>
+        </div>
+      </div>
+      <!-- ROW 2: Action builder -->
+      <div class="action-section">
+        <div style="display:flex;align-items:center;gap:8px;margin-bottom:8px">
+          <div style="font-size:11px;font-weight:700;color:var(--text)">⚡ Build Action</div>
+          <div class="action-tabs" id="action-tabs"></div>
+          <button class="btn btn-ghost" style="margin-left:auto" onclick="toggleRawJson()">{ } Raw JSON</button>
+        </div>
+        <div id="action-fields-container"></div>
+        <div id="raw-json-area" style="display:none">
+          <div class="field">
+            <label>Raw JSON Action</label>
+            <textarea id="raw-action" rows="3" placeholder='{"action_type":"identify_vulnerability","vuln_type":"sql_injection","cvss_score":7.5,"severity":"high"}'></textarea>
+          </div>
+        </div>
+      </div>
+    </div>
+    <!-- ROW 3: Step log (outside content-area, fixed height) -->
+    <div class="step-log" id="step-log">
+      <div class="log-line"><span class="log-tag info">INFO</span><span class="log-msg">Debug panel ready. Select a task and press Reset to start.</span></div>
+    </div>
+  </div>
+</div>
+<script>
+// ═══════════════════════════════════════════════
+// DATA
+// ═══════════════════════════════════════════════
+const TASKS = {
+  security: [
+    { id:'sec_easy',   label:'Injection Detection',    diff:'easy',   desc:'Identify whether a tool-call has a vulnerability. Return vuln_type, cvss_score, severity.', actions:['identify_vulnerability'] },
+    { id:'sec_medium', label:'Multi-Vuln Scan',        diff:'medium', desc:'Scan a code module for multiple vulnerabilities, then propose fixes.', actions:['identify_vulnerability','propose_fix'] },
+    { id:'sec_hard',   label:'Auto-Sanitize + Review', diff:'hard',   desc:'Identify, fix, and revise code based on reviewer feedback. Multi-turn.', actions:['identify_vulnerability','propose_fix','revise_fix'] },
+  ],
+  pytorch: [
+    { id:'dep_easy',   label:'Deprecation Mapper',     diff:'easy',   desc:'Detect deprecated PyTorch 1.x APIs and flag with replacements.', actions:['flag_outdated'] },
+    { id:'dep_medium', label:'Dependency Resolver',    diff:'medium', desc:'Resolve version conflicts using a compatibility matrix.', actions:['resolve_conflict'] },
+    { id:'dep_hard',   label:'Graph-Break Hunter',     diff:'hard',   desc:'Find and fix torch.compile breaking patterns.', actions:['migrate_api'] },
+  ],
+  clinical: [
+    { id:'cli_easy',   label:'Gap Detection',          diff:'easy',   desc:'Identify missing mandatory steps before a procedure.', actions:['detect_gap'] },
+    { id:'cli_medium', label:'Priority Recovery',      diff:'medium', desc:'Detect gaps then rank clinical issues by urgency.', actions:['detect_gap','rank_issues'] },
+    { id:'cli_hard',   label:'Full Re-plan',           diff:'hard',   desc:'Detect, rank, and reorder recovery steps respecting dependencies.', actions:['detect_gap','rank_issues','order_steps'] },
+  ]
+};
+const ACTION_SCHEMAS = {
+  identify_vulnerability: {
+    label: 'Identify Vuln',
+    fields: [
+      { key:'vuln_type', label:'Vulnerability Type', type:'select', options:['sql_injection','xss','idor','hardcoded_secret','missing_auth','jwt_misuse','path_traversal','ssrf','rate_limit_missing','xxe'] },
+      { key:'cvss_score', label:'CVSS Score (0–10)', type:'number', placeholder:'7.5', min:0, max:10, step:0.1 },
+      { key:'severity', label:'Severity', type:'select', options:['critical','high','medium','low','info'] },
+      { key:'affected_line', label:'Affected Line', type:'number', placeholder:'3' },
+    ]
+  },
+  propose_fix: {
+    label: 'Propose Fix',
+    fields: [
+      { key:'fix_code', label:'Fixed Code', type:'textarea', placeholder:'db.execute(sql, (param,))', full:true },
+      { key:'explanation', label:'Explanation', type:'textarea', placeholder:'Use parameterized queries', full:true },
+    ]
+  },
+  revise_fix: {
+    label: 'Revise Fix',
+    fields: [
+      { key:'fix_code', label:'Revised Code', type:'textarea', placeholder:'Complete corrected code', full:true },
+      { key:'addressed_feedback', label:'Addressed Feedback', type:'textarea', placeholder:'Paste reviewer_feedback here', full:true },
+    ]
+  },
+  flag_outdated: {
+    label: 'Flag Outdated',
+    fields: [
+      { key:'packages_json', label:'Outdated Packages (JSON)', type:'textarea', placeholder:'{"torch": "1.9.0", "numpy": "1.21.0"}', full:true },
+      { key:'deprecated_api', label:'Deprecated API', type:'text', placeholder:'torch.autograd.Variable' },
+      { key:'replacement', label:'Replacement', type:'text', placeholder:'plain tensor' },
+    ]
+  },
+  resolve_conflict: {
+    label: 'Resolve Conflict',
+    fields: [
+      { key:'packages_json', label:'Resolved Packages (JSON)', type:'textarea', placeholder:'{"torch":"2.1.0","numpy":"1.24.3"}', full:true },
+      { key:'reasoning', label:'Reasoning', type:'textarea', placeholder:'torch 2.1 requires numpy>=1.24', full:true },
+    ]
+  },
+  migrate_api: {
+    label: 'Migrate API',
+    fields: [
+      { key:'completed_items_json', label:'Completed Break IDs (JSON)', type:'textarea', placeholder:'["break_001"]', full:true },
+      { key:'code_changes_json', label:'Code Changes (JSON)', type:'textarea', placeholder:'{"break_001":"use torch.where"}', full:true },
+    ]
+  },
+  detect_gap: {
+    label: 'Detect Gap',
+    fields: [
+      { key:'missing_steps_json', label:'Missing Steps (JSON array)', type:'textarea', placeholder:'["pre_op_consent","blood_test"]', full:true },
+      { key:'risk_level', label:'Risk Level', type:'select', options:['critical','high','medium','low'] },
+    ]
+  },
+  rank_issues: {
+    label: 'Rank Issues',
+    fields: [
+      { key:'priority_order_json', label:'Priority Order (highest first)', type:'textarea', placeholder:'["blood_test","pre_op_consent"]', full:true },
+    ]
+  },
+  order_steps: {
+    label: 'Order Steps',
+    fields: [
+      { key:'recovery_steps_json', label:'Recovery Steps (ordered)', type:'textarea', placeholder:'["specialist","alt_treatment","post_op"]', full:true },
+    ]
+  }
+};
+// ═══════════════════════════════════════════════
+// STATE
+// ═══════════════════════════════════════════════
+let state = {
+  domain: 'security',
+  task: TASKS.security[0],
+  episodeId: null,
+  step: 0,
+  totalReward: 0,
+  done: false,
+  currentAction: 'identify_vulnerability',
+  rawMode: false
+};
+// ═══════════════════════════════════════════════
+// INIT
+// ═══════════════════════════════════════════════
+function init() {
+  renderTaskList();
+  selectTask(state.task);
+  checkServerHealth();
+  setInterval(checkServerHealth, 15000);
+}
+// ═══════════════════════════════════════════════
+// DOMAIN / TASK
+// ═══════════════════════════════════════════════
+function switchDomain(domain) {
+  state.domain = domain;
+  state.task = TASKS[domain][0];
+  document.querySelectorAll('.domain-tab').forEach(t => t.classList.toggle('active', t.dataset.domain === domain));
+  renderTaskList();
+  selectTask(state.task);
+}
+function renderTaskList() {
+  const list = document.getElementById('task-list');
+  list.innerHTML = '';
+  TASKS[state.domain].forEach(task => {
+    const btn = document.createElement('button');
+    btn.className = 'task-btn' + (task.id === state.task.id ? ' active' : '');
+    btn.innerHTML = `<span>${task.label}</span><span class="diff diff-${task.diff}">${task.diff.toUpperCase()}</span>`;
+    btn.onclick = () => selectTask(task);
+    list.appendChild(btn);
+  });
+}
+function selectTask(task) {
+  state.task = task;
+  state.episodeId = null;
+  state.step = 0;
+  state.totalReward = 0;
+  state.done = false;
+  document.querySelectorAll('.task-btn').forEach(b => b.classList.toggle('active', b.querySelector('span').textContent === task.label));
+  document.getElementById('task-meta').innerHTML = `<strong>${task.label} (${task.id})</strong>${task.desc}<br><br><span style="color:var(--blue)">Actions:</span> ${task.actions.join(' → ')}`;
+  document.getElementById('chip-task').textContent = task.id;
+  document.getElementById('chip-episode').textContent = '—';
+  document.getElementById('chip-step').textContent = '0';
+  document.getElementById('chip-reward').textContent = '0.0000';
+  document.getElementById('chip-done').textContent = '—';
+  document.getElementById('obs-view').innerHTML = '<span style="color:var(--muted)">Press Reset to start this task...</span>';
+  document.getElementById('reward-num').textContent = '—';
+  document.getElementById('reward-num').style.color = 'var(--muted)';
+  document.getElementById('reward-bar').style.width = '0%';
+  document.getElementById('reward-label').textContent = 'No reward yet';
+  document.getElementById('reward-breakdown').innerHTML = '';
+  document.getElementById('step-result-raw').innerHTML = '';
+  document.getElementById('btn-step').disabled = true;
+  document.getElementById('btn-step').textContent = '▶ Step';
+  state.currentAction = task.actions[0];
+  renderActionTabs();
+  renderActionFields();
+  log('info', `Selected: ${task.id} | ${task.label}`);
+}
+// ═══════════════════════════════════════════════
+// ACTION BUILDER
+// ═══════════════════════════════════════════════
+// Pre-built examples for each action type (shown when fields are empty)
+const ACTION_EXAMPLES = {
+  identify_vulnerability: {
+    action_type: 'identify_vulnerability',
+    vuln_type: 'sql_injection',
+    cvss_score: 8.5,
+    severity: 'critical',
+  },
+  propose_fix: {
+    action_type: 'propose_fix',
+    fix_code: 'db.execute("SELECT * FROM users WHERE name = ?", (user_input,))',
+    explanation: 'Use parameterized query to prevent SQL injection',
+  },
+  revise_fix: {
+    action_type: 'revise_fix',
+    fix_code: 'db.execute("SELECT * FROM users WHERE name = ?", (sanitize(user_input),))',
+    addressed_feedback: 'Added input validation on top of parameterized query',
+  },
+  flag_outdated: {
+    action_type: 'flag_outdated',
+    packages: { torch: '1.9.0' },
+    deprecated_api: 'torch.autograd.Variable',
+    replacement: 'plain tensor (remove Variable wrapper)',
+  },
+  resolve_conflict: {
+    action_type: 'resolve_conflict',
+    packages: { torch: '2.1.0', numpy: '1.24.0' },
+    reasoning: 'torch 2.1 requires numpy>=1.24 per compatibility matrix',
+  },
+  migrate_api: {
+    action_type: 'migrate_api',
+    completed_items: ['break_001', 'break_002', 'break_003'],
+    code_changes: {
+      break_001: 'use torch.where instead of if x.item()',
+      break_002: 'use tensor.shape[0] instead of len(x)',
+      break_003: 'use x.detach().numpy() outside compiled fn',
+    },
+  },
+  detect_gap: {
+    action_type: 'detect_gap',
+    missing_steps: ['pre_op_consent', 'blood_work'],
+    risk_level: 'critical',
+  },
+  rank_issues: {
+    action_type: 'rank_issues',
+    priority_order: ['resolve_insurance', 'pre_op_consent', 'book_specialist'],
+  },
+  order_steps: {
+    action_type: 'order_steps',
+    recovery_steps: ['resolve_insurance', 'book_specialist', 'complete_pre_op', 'schedule_surgery'],
+  },
+};
+function renderActionTabs() {
+  const tabs = document.getElementById('action-tabs');
+  tabs.innerHTML = '';
+  state.task.actions.forEach(a => {
+    const t = document.createElement('button');
+    t.className = 'action-tab' + (a === state.currentAction ? ' active' : '');
+    t.textContent = ACTION_SCHEMAS[a]?.label || a;
+    t.onclick = () => { state.currentAction = a; renderActionTabs(); renderActionFields(); syncRawJson(); };
+    tabs.appendChild(t);
+  });
+}
+function renderActionFields() {
+  const container = document.getElementById('action-fields-container');
+  const schema = ACTION_SCHEMAS[state.currentAction];
+  if (!schema) { container.innerHTML = '<div style="color:var(--muted);font-size:11px">No schema.</div>'; return; }
+  container.innerHTML = '';
+  const grid = document.createElement('div');
+  grid.className = 'action-fields visible';
+  schema.fields.forEach(f => {
+    const wrap = document.createElement('div');
+    wrap.className = 'field' + (f.full ? ' full' : '');
+    const lbl = document.createElement('label');
+    lbl.textContent = f.label;
+    wrap.appendChild(lbl);
+    let el;
+    if (f.type === 'select') {
+      el = document.createElement('select');
+      el.id = 'af-' + f.key;
+      f.options.forEach(o => { const op = document.createElement('option'); op.value = op.textContent = o; el.appendChild(op); });
+      el.addEventListener('change', syncRawJson);
+    } else if (f.type === 'textarea') {
+      el = document.createElement('textarea');
+      el.id = 'af-' + f.key;
+      el.placeholder = f.placeholder || '';
+      el.rows = 2;
+      el.addEventListener('input', syncRawJson);
+    } else {
+      el = document.createElement('input');
+      el.type = f.type || 'text';
+      el.id = 'af-' + f.key;
+      el.placeholder = f.placeholder || '';
+      if (f.min !== undefined) el.min = f.min;
+      if (f.max !== undefined) el.max = f.max;
+      if (f.step !== undefined) el.step = f.step;
+      el.addEventListener('input', syncRawJson);
+    }
+    wrap.appendChild(el);
+    grid.appendChild(wrap);
+  });
+  container.appendChild(grid);
+  // Set initial raw JSON
+  syncRawJson();
+}
+function buildAction() {
+  if (state.rawMode) {
+    try { return JSON.parse(document.getElementById('raw-action').value); }
+    catch(e) { log('error', 'Invalid JSON: ' + e.message); return null; }
+  }
+  return _buildActionFromFields();
+}
+function _buildActionFromFields() {
+  const schema = ACTION_SCHEMAS[state.currentAction];
+  const action = { action_type: state.currentAction };
+  schema.fields.forEach(f => {
+    const el = document.getElementById('af-' + f.key);
+    if (!el) return;
+    let val = el.value.trim();
+    if (!val) return;
+    if (f.key.endsWith('_json')) {
+      try { action[f.key.replace('_json','')] = JSON.parse(val); }
+      catch(e) { action[f.key.replace('_json','')] = val; }
+    } else if (f.type === 'number') {
+      action[f.key] = parseFloat(val);
+    } else {
+      action[f.key] = val;
+    }
+  });
+  return action;
+}
+function syncRawJson() {
+  const action = _buildActionFromFields();
+  // If form is mostly empty, show the example instead
+  const fieldCount = Object.keys(action).length;
+  const display = fieldCount <= 1 ? ACTION_EXAMPLES[state.currentAction] || action : action;
+  document.getElementById('raw-action').value = JSON.stringify(display, null, 2);
+}
+function toggleRawJson() {
+  state.rawMode = !state.rawMode;
+  document.getElementById('raw-json-area').style.display = state.rawMode ? 'block' : 'none';
+  document.getElementById('action-fields-container').style.display = state.rawMode ? 'none' : 'block';
+  if (state.rawMode) syncRawJson();
+}
+// ═══════════════════════════════════════════════
+// API CALLS
+// ═══════════════════════════════════════════════
+async function doReset() {
+  const btn = document.getElementById('btn-reset');
+  btn.disabled = true; btn.textContent = '⟳ Resetting...';
+  try {
+    log('start', `[START] task_id=${state.task.id}`);
+    const res = await fetch('/reset', {
+      method:'POST', headers:{'Content-Type':'application/json'},
+      body: JSON.stringify({ task_id: state.task.id })
+    });
+    const data = await res.json();
+    if (data.error) throw new Error(data.error);
+    state.episodeId = data.episode_id;
+    state.step = 0; state.totalReward = 0; state.done = false;
+    document.getElementById('chip-episode').textContent = (state.episodeId||'').slice(0,8)+'…';
+    document.getElementById('chip-step').textContent = '0';
+    document.getElementById('chip-reward').textContent = '0.0000';
+    document.getElementById('chip-done').textContent = 'false';
+    renderObs(data.observation || data);
+    document.getElementById('btn-step').disabled = false;
+    document.getElementById('btn-step').textContent = '▶ Step';
+    log('info', `Episode: ${state.episodeId}`);
+  } catch(e) {
+    log('error', 'Reset failed: ' + e.message);
+  } finally {
+    btn.disabled = false; btn.textContent = '⟳ Reset';
+  }
+}
+async function doStep() {
+  if (!state.episodeId) { log('error', 'No episode. Press Reset first.'); return; }
+  if (state.done) { log('info', 'Done. Press Reset for new episode.'); return; }
+  const action = buildAction();
+  if (!action) return;
+  action.episode_id = state.episodeId;
+  const btn = document.getElementById('btn-step');
+  btn.disabled = true; btn.textContent = '▶ Stepping...';
+  try {
+    const res = await fetch('/step', {
+      method:'POST', headers:{'Content-Type':'application/json'},
+      body: JSON.stringify(action)
+    });
+    const data = await res.json();
+    const reward = typeof data.reward === 'number' ? data.reward : 0;
+    const done = data.done === true || data.done === 'True';
+    state.step++; state.totalReward += reward; state.done = done;
+    document.getElementById('chip-step').textContent = state.step;
+    document.getElementById('chip-reward').textContent = state.totalReward.toFixed(4);
+    document.getElementById('chip-done').textContent = String(done);
+    document.getElementById('chip-done').style.color = done ? 'var(--green)' : 'var(--muted)';
+    renderObs(data.observation || data);
+    renderReward(reward, data);
+    // Auto-switch to next expected action if provided
+    const nextAction = (data.observation || {}).next_expected_action;
+    if (nextAction && ACTION_SCHEMAS[nextAction] && state.task.actions.includes(nextAction)) {
+      state.currentAction = nextAction;
+      renderActionTabs();
+      renderActionFields();
+    }
+    log('step', `[STEP] step=${state.step} action=${action.action_type} reward=${reward.toFixed(4)} done=${done}`);
+    if (done) {
+      log('end', `[END] task_id=${state.task.id} total_reward=${state.totalReward.toFixed(4)} steps=${state.step}`);
+      btn.disabled = true; btn.textContent = '✓ Done';
+    }
+  } catch(e) {
+    log('error', 'Step failed: ' + e.message);
+  } finally {
+    if (!state.done) { btn.disabled = false; btn.textContent = '▶ Step'; }
+  }
+}
+// ═══════════════════════════════════════════════
+// RENDER
+// ═══════════════════════════════════════════════
+function renderObs(obs) {
+  document.getElementById('obs-view').innerHTML = syntaxHighlight(JSON.stringify(obs, null, 2));
+}
+function renderReward(reward, data) {
+  const r = Math.max(0, Math.min(1, reward));
+  const color = r >= 0.7 ? 'var(--green)' : r >= 0.4 ? 'var(--amber)' : 'var(--red)';
+  document.getElementById('reward-num').textContent = reward.toFixed(4);
+  document.getElementById('reward-num').style.color = color;
+  document.getElementById('reward-bar').style.width = (r*100)+'%';
+  document.getElementById('reward-bar').style.background = r >= 0.7 ? 'linear-gradient(90deg,#16a34a,#22c55e)' : r >= 0.4 ? 'linear-gradient(90deg,#b45309,#f59e0b)' : 'linear-gradient(90deg,#991b1b,#ef4444)';
+  document.getElementById('reward-label').textContent = r >= 0.7 ? '✓ Good' : r >= 0.4 ? '⚠ Partial' : r > 0 ? '✗ Low' : '✗ Zero';
+  const bd = document.getElementById('reward-breakdown');
+  const breakdown = data.reward_breakdown || data.breakdown || null;
+  if (breakdown && typeof breakdown === 'object') {
+    bd.innerHTML = '<div style="font-size:10px;font-weight:700;color:var(--muted);text-transform:uppercase;margin:8px 0 4px">Breakdown</div>';
+    Object.entries(breakdown).forEach(([k,v]) => {
+      const pos = v >= 0;
+      bd.innerHTML += `<div class="breakdown-item"><span>${k.replace(/_/g,' ')}</span><span class="breakdown-val ${pos?'pos':'neg'}">${pos?'+':''}${typeof v==='number'?v.toFixed(4):v}</span></div>`;
+    });
+  } else bd.innerHTML = '';
+  const raw = document.getElementById('step-result-raw');
+  const filtered = {...data}; delete filtered.observation;
+  raw.innerHTML = '<div style="font-size:10px;color:var(--muted);margin-top:6px;font-family:var(--mono);white-space:pre-wrap;max-height:120px;overflow-y:auto">' + syntaxHighlight(JSON.stringify(filtered, null, 2)) + '</div>';
+}
+function syntaxHighlight(json) {
+  return json
+    .replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;')
+    .replace(/("(\\u[a-zA-Z0-9]{4}|\\[^u]|[^\\"])*"(\s*:)?|\b(true|false|null)\b|-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?)/g, m => {
+      let cls = 'json-num';
+      if (/^"/.test(m)) cls = /:$/.test(m) ? 'json-key' : 'json-str';
+      else if (/true|false/.test(m)) cls = 'json-bool';
+      else if (/null/.test(m)) cls = 'json-null';
+      return `<span class="${cls}">${m}</span>`;
+    });
+}
+// ═══════════════════════════════════════════════
+// LOG
+// ═══════════════════════════════════════════════
+function log(type, msg) {
+  const logEl = document.getElementById('step-log');
+  const line = document.createElement('div');
+  line.className = 'log-line';
+  const now = new Date();
+  const t = `${String(now.getHours()).padStart(2,'0')}:${String(now.getMinutes()).padStart(2,'0')}:${String(now.getSeconds()).padStart(2,'0')}`;
+  const tagMap = {start:'START',step:'STEP',end:'END',error:'ERROR',info:'INFO'};
+  line.innerHTML = `<span class="log-time">${t}</span><span class="log-tag ${type}">[${tagMap[type]||type.toUpperCase()}]</span><span class="log-msg">${msg}</span>`;
+  logEl.appendChild(line);
+  logEl.scrollTop = logEl.scrollHeight;
+}
+function clearLog() {
+  document.getElementById('step-log').innerHTML = '';
+  log('info', 'Log cleared.');
+}
+// ═══════════════════════════════════════════════
+// FULL INFERENCE
+// ═══════════════════════════════════════════════
+async function runFullInference() {
+  const btn = document.getElementById('inf-btn');
+  btn.disabled = true; btn.textContent = '⏳ Running...';
+  const prog = document.getElementById('inf-progress');
+  const scores = document.getElementById('inf-scores');
+  prog.style.display = 'flex'; prog.innerHTML = '';
+  scores.innerHTML = '';
+  const allTasks = ['sec_easy','sec_medium','sec_hard','dep_easy','dep_medium','dep_hard','cli_easy','cli_medium','cli_hard'];
+  allTasks.forEach(t => { prog.innerHTML += `<span class="task-chip" id="chip-inf-${t}">${t}</span>`; });
+  log('info', 'Starting full inference via /inference...');
+  try {
+    const res = await fetch('/inference', { method:'POST', headers:{'Content-Type':'application/json'}, body:'{}' });
+    const data = await res.json();
+    if (data.error) { log('error', 'Inference error: ' + data.error); return; }
+    const final = data.final_scores || {};
+    allTasks.forEach(t => {
+      const chip = document.getElementById('chip-inf-'+t);
+      const sc = final[t];
+      if (sc !== undefined) {
+        chip.classList.add(sc > 0.3 ? 'done' : 'fail');
+        chip.textContent = `${t}: ${typeof sc==='number'?sc.toFixed(3):sc}`;
+      } else chip.classList.add('fail');
+    });
+    const avg = data.average_score || 0;
+    scores.innerHTML = `<div style="padding:6px;background:var(--bg);border-radius:4px;border:1px solid var(--border)"><span style="font-size:10px;color:var(--muted)">Average: </span><strong style="color:var(--green)">${avg.toFixed ? avg.toFixed(4) : avg}</strong></div>`;
+    log('end', `Inference done. Average: ${avg}`);
+  } catch(e) {
+    log('error', 'Inference failed: ' + e.message);
+  } finally {
+    btn.disabled = false; btn.textContent = '▶ Run All 9 Tasks';
+  }
+}
+// ═══════════════════════════════════════════════
+// HEALTH CHECK — uses /reset OPTIONS or simple GET
+// ═══════════════════════════════════════════════
+async function checkServerHealth() {
+  try {
+    const res = await fetch('/', {
+      headers: { 'Accept': 'application/json' },
+      signal: AbortSignal.timeout(3000)
+    });
+    if (res.ok) {
+      document.getElementById('status-dot').className = 'logo-dot green';
+      document.getElementById('server-status').textContent = 'Server online';
+      document.getElementById('server-status').style.color = 'var(--green)';
+    } else throw new Error('not ok');
+  } catch(e) {
+    document.getElementById('status-dot').className = 'logo-dot err';
+    document.getElementById('server-status').textContent = 'Server unreachable';
+    document.getElementById('server-status').style.color = 'var(--red)';
+  }
+}
+init();
+</script>
+</div><!-- end page-debug -->
+<!-- ══ PAGE: BENCHMARK ══ -->
+<div class="page" id="page-benchmark">
+<div class="bench-layout">
+  <!-- Benchmark Sidebar -->
+  <div class="bench-sidebar">
+    <div class="bench-card">
+      <div class="bench-card-hdr">🔑 API Configuration</div>
+      <div class="bench-card-body">
+        <label style="font-size:10px;color:var(--muted);margin-bottom:6px;display:block">Quick Presets</label>
+        <div class="preset-row">
+          <button class="preset-btn" onclick="applyPreset('groq')">⚡ Groq</button>
+          <button class="preset-btn" onclick="applyPreset('openrouter')">🌐 OpenRouter</button>
+          <button class="preset-btn" onclick="applyPreset('huggingface')">🤗 HuggingFace</button>
+          <button class="preset-btn" onclick="applyPreset('custom')">✏️ Custom</button>
+        </div>
+        <div class="bench-field">
+          <label>API Base URL</label>
+          <input type="text" id="bench-api-base" placeholder="https://api.groq.com/openai/v1" />
+        </div>
+        <div class="bench-field">
+          <label>API Key</label>
+          <input type="password" id="bench-api-key" placeholder="sk-..." />
+        </div>
+        <div class="bench-field">
+          <label>Model Display Name</label>
+          <input type="text" id="bench-model-name" placeholder="Llama-3.3-70B" />
+        </div>
+        <div class="bench-field">
+          <label>Model ID</label>
+          <input type="text" id="bench-model-id" placeholder="llama-3.3-70b-versatile" />
+        </div>
+      </div>
+    </div>
+    <button class="run-btn" id="bench-run-btn" onclick="runBenchmark()">
+      🚀 Run Benchmark (9 Tasks)
+    </button>
+    <div class="bench-card" style="margin-top:12px">
+      <div class="bench-card-hdr">📊 Run History
+        <button class="btn-ghost" style="margin-left:auto;font-size:9px;padding:2px 6px" onclick="clearResults()">Clear All</button>
+      </div>
+      <div class="bench-card-body" id="bench-history" style="max-height:200px;overflow-y:auto">
+        <div style="color:var(--muted);font-size:11px;text-align:center;padding:12px">No runs yet. Configure a model above and run.</div>
+      </div>
+    </div>
+    <div class="bench-card">
+      <div class="bench-card-hdr">ℹ️ Tips</div>
+      <div class="bench-card-body" style="font-size:11px;color:var(--muted);line-height:1.5">
+        <p>• <strong>Groq</strong> — Fast, free tier, use llama-3.3-70b-versatile</p>
+        <p>• <strong>OpenRouter</strong> — Many models, free tier has rate limits</p>
+        <p>• <strong>HuggingFace</strong> — Use your HF token with router.huggingface.co/v1</p>
+        <p style="margin-top:6px;color:var(--amber)">⚠️ Free tier models may hit rate limits on 9 tasks</p>
+      </div>
+    </div>
+  </div>
+  <!-- Benchmark Main -->
+  <div class="bench-main">
+    <!-- Results Table -->
+    <div class="results-area" id="bench-results">
+      <div class="empty-state">
+        <div class="icon">📊</div>
+        <p>Run a benchmark to see results here. Configure your API key and model on the left, then click Run.</p>
+      </div>
+    </div>
+    <!-- Comparison Chart -->
+    <div class="chart-container" id="bench-chart" style="display:none">
+      <div style="font-size:11px;font-weight:700;color:var(--muted);text-transform:uppercase;letter-spacing:.04em;margin-bottom:10px">Model Comparison — Average Score</div>
+      <div id="chart-bars"></div>
+    </div>
+    <!-- Log -->
+    <div class="bench-log" id="bench-log">
+      <div style="color:var(--muted)">Benchmark logs will appear here...</div>
+    </div>
+  </div>
+</div>
+</div><!-- end page-benchmark -->
+<script>
+// ══════════════════════════════════
+// PAGE SWITCHING
+// ══════════════════════════════════
+function switchPage(page) {
+  document.querySelectorAll('.page').forEach(p => p.classList.remove('visible'));
+  document.querySelectorAll('.page-tab').forEach(t => t.classList.remove('active'));
+  document.getElementById('page-' + page).classList.add('visible');
+  document.getElementById('ptab-' + page).classList.add('active');
+  if (page === 'benchmark') loadBenchResults();
+}
+// ══════════════════════════════════
+// API PRESETS
+// ══════════════════════════════════
+const PRESETS = {
+  groq: { base: 'https://api.groq.com/openai/v1', models: ['llama-3.3-70b-versatile','mixtral-8x7b-32768','gemma2-9b-it'], default_name: 'Llama-3.3-70B', default_id: 'llama-3.3-70b-versatile' },
+  openrouter: { base: 'https://openrouter.ai/api/v1', models: ['nvidia/nemotron-3-super-120b-a12b:free','qwen/qwen3.6-plus:free','deepseek/deepseek-r1:free'], default_name: 'Nemotron-120B', default_id: 'nvidia/nemotron-3-super-120b-a12b:free' },
+  huggingface: { base: 'https://router.huggingface.co/v1', models: ['Qwen/Qwen2.5-72B-Instruct','meta-llama/Llama-3.1-70B-Instruct'], default_name: 'Qwen-2.5-72B', default_id: 'Qwen/Qwen2.5-72B-Instruct' },
+  custom: { base: '', models: [], default_name: '', default_id: '' },
+};
+function applyPreset(name) {
+  document.querySelectorAll('.preset-btn').forEach(b => b.classList.remove('active'));
+  event.target.classList.add('active');
+  const p = PRESETS[name];
+  document.getElementById('bench-api-base').value = p.base;
+  document.getElementById('bench-model-name').value = p.default_name;
+  document.getElementById('bench-model-id').value = p.default_id;
+  if (name !== 'custom') document.getElementById('bench-api-key').focus();
+}
+// ══════════════════════════════════
+// RUN BENCHMARK
+// ══════════════════════════════════
+let benchRunning = false;
+async function runBenchmark() {
+  if (benchRunning) return;
+  const apiBase = document.getElementById('bench-api-base').value.trim();
+  const apiKey = document.getElementById('bench-api-key').value.trim();
+  const modelName = document.getElementById('bench-model-name').value.trim() || 'Unknown';
+  const modelId = document.getElementById('bench-model-id').value.trim();
+  if (!apiBase || !apiKey || !modelId) {
+    alert('Please fill in API Base URL, API Key, and Model ID');
+    return;
+  }
+  benchRunning = true;
+  const btn = document.getElementById('bench-run-btn');
+  btn.disabled = true;
+  btn.classList.add('running');
+  btn.innerHTML = '⏳ Running 9 tasks...';
+  const logEl = document.getElementById('bench-log');
+  logEl.innerHTML = '';
+  benchLog('info', `Starting benchmark: ${modelName} (${modelId})`);
+  benchLog('info', `API: ${apiBase}`);
+  benchLog('info', `Running 9 tasks... This may take 2-5 minutes.`);
+  try {
+    const res = await fetch('/benchmark/run', {
+      method: 'POST',
+      headers: {'Content-Type': 'application/json'},
+      body: JSON.stringify({
+        model_name: modelName,
+        model_id: modelId,
+        api_base: apiBase,
+        api_key: apiKey,
+      })
+    });
+    if (res.headers.get('content-type').includes('application/json')) {
+      const data = await res.json();
+      if (data.error) benchLog('err', 'Error: ' + data.error);
+      throw new Error('Benchmark failed to start');
+    }
+    const reader = res.body.getReader();
+    const decoder = new TextDecoder();
+    let done = false;
+    let buffer = '';
+    while (!done) {
+      const { value, done: readerDone } = await reader.read();
+      done = readerDone;
+      if (value) {
+        buffer += decoder.decode(value, { stream: true });
+        let parts = buffer.split('\n\n');
+        buffer = parts.pop();
+        for (const part of parts) {
+          if (part.startsWith('data: ')) {
+            try {
+              const event = JSON.parse(part.substring(6));
+              if (event.type === 'log') {
+                 benchLog(event.level, event.msg);
+              } else if (event.type === 'task_done') {
+                 benchLog('info', `🎯 Task ${event.task_id} completed with score: ${event.score.toFixed(4)}`);
+              } else if (event.type === 'done') {
+                 benchLog('ok', `✅ All tasks complete! Average: ${event.result.average}`);
+                 renderResults();
+                 renderChart();
+              }
+            } catch(e) {}
+          }
+        }
+      }
+    }
+  } catch(e) {
+    benchLog('err', 'Execution error: ' + e.message);
+  } finally {
+    benchRunning = false;
+    btn.disabled = false;
+    btn.classList.remove('running');
+    btn.innerHTML = '🚀 Run Benchmark (9 Tasks)';
+  }
+}
+function benchLog(type, msg) {
+  const logEl = document.getElementById('bench-log');
+  const cls = type === 'err' ? 'log-err' : type === 'warn' ? 'log-warn' : type === 'ok' ? 'log-ok' : 'log-info';
+  const time = new Date().toLocaleTimeString('en-US',{hour12:false,hour:'2-digit',minute:'2-digit',second:'2-digit'});
+  logEl.innerHTML += `<div class="${cls}"><span style="color:var(--muted)">${time}</span> ${msg}</div>`;
+  logEl.scrollTop = logEl.scrollHeight;
+}
+// ══════════════════════════════════
+// RESULTS RENDERING
+// ══════════════════════════════════
+const BENCH_TASKS = ['sec_easy','sec_medium','sec_hard','dep_easy','dep_medium','dep_hard','cli_easy','cli_medium','cli_hard'];
+const BENCH_COLORS = ['#4f8ef7','#a855f7','#22c55e','#f59e0b','#ef4444','#22d3ee','#f472b6','#84cc16','#fb923c'];
+async function loadBenchResults() {
+  try {
+    const res = await fetch('/benchmark/results');
+    const data = await res.json();
+    if (data.results && data.results.length > 0) {
+      renderResults(data.results);
+      renderChart(data.results);
+      renderHistory(data.results);
+    }
+  } catch(e) {}
+}
+function renderResults(results) {
+  if (!results) {
+    fetch('/benchmark/results').then(r=>r.json()).then(d => { if(d.results) renderResults(d.results); });
+    return;
+  }
+  if (results.length === 0) return;
+  const el = document.getElementById('bench-results');
+  let html = '<table class="results-table"><thead><tr><th>Model</th>';
+  BENCH_TASKS.forEach(t => html += `<th>${t.replace('_',' ').toUpperCase()}</th>`);
+  html += '<th>AVG</th><th>Time</th></tr></thead><tbody>';
+  results.forEach((r, i) => {
+    html += `<tr>`;
+    html += `<td style="font-weight:700;color:${BENCH_COLORS[i % BENCH_COLORS.length]}">${r.model_name}</td>`;
+    BENCH_TASKS.forEach(t => {
+      const s = r.scores[t] || 0;
+      const cls = s >= 0.8 ? 'score-high' : s >= 0.4 ? 'score-mid' : 'score-low';
+      html += `<td class="score-cell ${cls}">${s.toFixed(2)}</td>`;
+    });
+    const avgCls = r.average >= 0.7 ? 'score-high' : r.average >= 0.4 ? 'score-mid' : 'score-low';
+    html += `<td class="score-cell avg-cell ${avgCls}">${r.average.toFixed(3)}</td>`;
+    const ts = new Date(r.timestamp);
+    html += `<td style="font-size:10px;color:var(--muted)">${ts.toLocaleTimeString()}</td>`;
+    html += '</tr>';
+  });
+  html += '</tbody></table>';
+  el.innerHTML = html;
+}
+function renderChart(results) {
+  if (!results) {
+    fetch('/benchmark/results').then(r=>r.json()).then(d => { if(d.results) renderChart(d.results); });
+    return;
+  }
+  if (results.length === 0) return;
+  const container = document.getElementById('bench-chart');
+  container.style.display = 'block';
+  const bars = document.getElementById('chart-bars');
+  let html = '';
+  results.forEach((r, i) => {
+    const pct = Math.round(r.average * 100);
+    const color = BENCH_COLORS[i % BENCH_COLORS.length];
+    const gradient = `linear-gradient(90deg, ${color}88, ${color})`;
+    html += `<div class="chart-bar-row">
+      <div class="chart-label">${r.model_name}</div>
+      <div class="chart-bar-bg">
+        <div class="chart-bar-fill" style="width:${pct}%;background:${gradient}">${r.average.toFixed(3)}</div>
+      </div>
+    </div>`;
+  });
+  bars.innerHTML = html;
+}
+function renderHistory(results) {
+  const el = document.getElementById('bench-history');
+  if (!results || results.length === 0) {
+    el.innerHTML = '<div style="color:var(--muted);font-size:11px;text-align:center;padding:12px">No runs yet.</div>';
+    return;
+  }
+  let html = '';
+  results.forEach((r, i) => {
+    const avgCls = r.average >= 0.7 ? 'score-high' : r.average >= 0.4 ? 'score-mid' : 'score-low';
+    const ts = new Date(r.timestamp);
+    html += `<div style="display:flex;align-items:center;gap:8px;padding:6px 0;border-bottom:1px solid var(--border);font-size:11px">
+      <span style="color:${BENCH_COLORS[i % BENCH_COLORS.length]};font-weight:700">${r.model_name}</span>
+      <span class="score-cell ${avgCls}" style="margin-left:auto">${r.average.toFixed(3)}</span>
+      <span style="color:var(--muted);font-size:9px">${ts.toLocaleTimeString()}</span>
+    </div>`;
+  });
+  el.innerHTML = html;
+}
+async function clearResults() {
+  if (!confirm('Clear all benchmark results?')) return;
+  await fetch('/benchmark/clear', {method:'POST'});
+  document.getElementById('bench-results').innerHTML = '<div class="empty-state"><div class="icon">📊</div><p>No results. Run a benchmark to see data.</p></div>';
+  document.getElementById('bench-chart').style.display = 'none';
+  document.getElementById('bench-history').innerHTML = '<div style="color:var(--muted);font-size:11px;text-align:center;padding:12px">No runs yet.</div>';
+}
+</script>
+</body>
+</html>

server/demo_agent.py ADDED Viewed

	@@ -0,0 +1,140 @@

+# server/demo_agent.py
+# Simple rule-based demo agent for the Gradio UI.
+# Uses hardcoded heuristics to show the environment works without calling a real LLM.
+def demo_action(obs):
+    """Generate a simple action based on observation. Used by the UI demo."""
+    task_type = obs.get('task_type', '')
+    task_id = obs.get('task_id', '')
+    turn = obs.get('turn', 0)
+    if task_type == 'security':
+        return _security_action(obs, task_id, turn)
+    elif task_type == 'dependency':
+        return _dependency_action(obs, task_id, turn)
+    elif task_type == 'clinical':
+        return _clinical_action(obs, task_id, turn)
+    else:
+        return {'action_type': 'invalid'}
+def _security_action(obs, task_id, turn):
+    if turn == 0:
+        tool_call = obs.get('tool_call', '')
+        # Simple heuristic to detect common vulnerability types
+        vuln_type = 'sql_injection'
+        severity = 'critical'
+        cvss = 8.5
+        if 'script' in tool_call.lower() or 'xss' in tool_call.lower():
+            vuln_type = 'xss'
+            severity = 'medium'
+            cvss = 5.0
+        elif 'password' in tool_call.lower() or 'secret' in tool_call.lower():
+            vuln_type = 'hardcoded_secret'
+            severity = 'high'
+            cvss = 6.5
+        elif 'jwt' in tool_call.lower() or 'token' in tool_call.lower():
+            vuln_type = 'jwt_misuse'
+            severity = 'critical'
+            cvss = 8.0
+        elif 'path' in tool_call.lower() or '..' in tool_call:
+            vuln_type = 'path_traversal'
+            severity = 'high'
+            cvss = 7.0
+        elif 'auth' in tool_call.lower() and 'no' in tool_call.lower():
+            vuln_type = 'missing_auth'
+            severity = 'critical'
+            cvss = 8.5
+        return {
+            'action_type': 'identify_vulnerability',
+            'vuln_type': vuln_type,
+            'cvss_score': cvss,
+            'severity': severity,
+            'affected_line': 1,
+        }
+    elif 'reviewer_feedback' in obs:
+        return {
+            'action_type': 'revise_fix',
+            'fix_code': 'sanitize_input(parameterized_query)',
+            'addressed_feedback': obs.get('reviewer_feedback', 'fixed the issue'),
+        }
+    else:
+        return {
+            'action_type': 'propose_fix',
+            'fix_code': 'use parameterized query with ? placeholder',
+            'explanation': 'Replace string concatenation with parameterized queries',
+        }
+def _dependency_action(obs, task_id, turn):
+    task_subtype = obs.get('task_subtype', 'flag')
+    if task_subtype == 'flag':
+        return {
+            'action_type': 'flag_outdated',
+            'packages': {'torch': '1.9.0'},
+            'deprecated_api': 'torch.autograd.Variable',
+            'replacement': 'plain tensor',
+        }
+    elif task_subtype == 'resolve':
+        return {
+            'action_type': 'resolve_conflict',
+            'packages': {'torch': '2.1.0', 'numpy': '1.24.0'},
+            'reasoning': 'PyTorch 2.1 requires NumPy 1.24+',
+        }
+    else:  # migrate
+        return {
+            'action_type': 'migrate_api',
+            'completed_items': ['break_001', 'break_002'],
+            'code_changes': {
+                'break_001': 'torch.where(condition, x*2, x)',
+                'break_002': 'x.shape[0]',
+            },
+        }
+def _clinical_action(obs, task_id, turn):
+    available_steps = obs.get('available_steps', [])
+    if turn == 0:
+        return {
+            'action_type': 'detect_gap',
+            'missing_steps': available_steps[:2] if available_steps else ['unknown_step'],
+            'risk_level': 'critical',
+        }
+    elif turn == 1:
+        return {
+            'action_type': 'rank_issues',
+            'priority_order': available_steps[:3] if available_steps else ['unknown_step'],
+        }
+    else:
+        dep_graph = obs.get('dependency_graph', {})
+        # Simple topological sort attempt
+        ordered = _simple_topo_sort(available_steps, dep_graph)
+        return {
+            'action_type': 'order_steps',
+            'recovery_steps': ordered,
+        }
+def _simple_topo_sort(steps, dep_graph):
+    """Simple topological sort for dependency ordering."""
+    if not dep_graph:
+        return steps
+    result = []
+    remaining = set(steps)
+    for _ in range(len(steps) + 1):
+        if not remaining:
+            break
+        for step in list(remaining):
+            prereqs = dep_graph.get(step, [])
+            if all(p in result for p in prereqs):
+                result.append(step)
+                remaining.remove(step)
+                break
+    # Add any unresolved steps
+    result.extend(remaining)
+    return result

server/graders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server/graders package

server/graders/base_grader.py ADDED Viewed

	@@ -0,0 +1,79 @@

+# server/graders/base_grader.py
+# Core grading utilities used by ALL domain graders.
+# Contains: safe_score (Bug 1 fix), penalty functions, grade_dynamic entry point.
+from typing import Dict, Any, List, Callable
+def safe_score(raw) -> float:
+    """Always clamp to [0.0, 1.0]. Never crash. Handles None, str, out-of-range."""
+    if raw is None:
+        return 0.0                    # BUG 1 FIX — must be first line
+    try:
+        return round(max(0.0, min(1.0, float(raw))), 4)
+    except (TypeError, ValueError):
+        return 0.0
+def repetition_penalty(action_type: str, last_actions: List[str], window: int = 3) -> float:
+    """Penalise repeating the same action type in the last N steps."""
+    count = last_actions[-window:].count(action_type)
+    return -0.15 * count
+def invalid_action_penalty(action_type: str, valid_actions: List[str]) -> float:
+    """Penalise actions not in the valid set for this domain."""
+    return -0.20 if action_type not in valid_actions else 0.0
+def harmful_output_penalty(action: Dict, forbidden_patterns: List[str]) -> float:
+    """Penalise destructive patterns like 'os.remove' or 'drop table'."""
+    action_str = str(action).lower()
+    for p in forbidden_patterns:
+        if p.lower() in action_str:
+            return -0.30
+    return 0.0
+def efficiency_bonus(step_count: int, max_steps: int, done: bool) -> float:
+    """Reward finishing early (before half the max steps)."""
+    return 0.10 if done and step_count < max_steps // 2 else 0.0
+def grade_dynamic(
+    action: Dict[str, Any],
+    session,
+    compute_correctness_fn: Callable,
+    valid_actions: List[str],
+    forbidden_patterns: List[str] = None,
+    max_steps: int = 8
+) -> float:
+    """Full reward pipeline. Entry point for all domain graders.
+    Pipeline: invalid check → repetition → correctness → harmful → efficiency → clamp
+    """
+    if forbidden_patterns is None:
+        forbidden_patterns = []
+    action_type = action.get('action_type', 'unknown')
+    # Penalties
+    inv  = invalid_action_penalty(action_type, valid_actions)
+    rep  = repetition_penalty(action_type, session.last_actions)
+    harm = harmful_output_penalty(action, forbidden_patterns)
+    # If action type is invalid, skip the grader entirely
+    if inv < 0:
+        return safe_score(inv + rep)
+    # Core correctness score from domain-specific grader
+    correctness = compute_correctness_fn(action, session.task_case)
+    # Efficiency bonus — session.done is always False at this point (set by router
+    # AFTER grade() returns), so use correctness >= 0.8 as proxy for "solved well"
+    eff = efficiency_bonus(session.step_count + 1, max_steps, correctness is not None and correctness >= 0.8)
+    # Combine and clamp
+    raw = correctness + rep + harm + eff
+    return safe_score(raw)

server/graders/clinical_grader.py ADDED Viewed

	@@ -0,0 +1,154 @@

+# server/graders/clinical_grader.py
+# Grader for Clinical Workflow Chaos Simulator tasks (cli_easy, cli_medium, cli_hard).
+# Bug 2 FIXED: propose_recovery is NOT in VALID_ACTIONS.
+# Uses NDCG ranking and dependency violation counting.
+import math
+from typing import Dict, List
+from .base_grader import grade_dynamic, safe_score
+# Bug 2 FIX: propose_recovery is NOT here — it has no grader branch
+VALID_ACTIONS = ['detect_gap', 'rank_issues', 'order_steps']
+FORBIDDEN = []
+RISK_ORDER = ['low', 'medium', 'high', 'critical']
+def _adj_risk(predicted, target):
+    """Check if risk level is off by exactly one level (partial credit)."""
+    try:
+        return abs(RISK_ORDER.index(predicted) - RISK_ORDER.index(target)) == 1
+    except ValueError:
+        return False
+def _f1(predicted: List, expected: List) -> float:
+    """Compute F1 score between predicted and expected lists."""
+    if not expected:
+        return 1.0 if not predicted else 0.0
+    if not predicted:
+        return 0.0
+    p_s = set(str(x).strip() for x in predicted)
+    e_s = set(str(x).strip() for x in expected)
+    tp = len(p_s & e_s)
+    prec = tp / len(p_s) if p_s else 0.0
+    rec = tp / len(e_s) if e_s else 0.0
+    return round(2 * prec * rec / max(prec + rec, 0.001), 4)
+def _ndcg(predicted: List, ideal: List, k: int = None) -> float:
+    """NDCG@k: rewards getting highest-priority items ranked first.
+    If ideal = ['insurance_auth', 'pre_op_consent', 'book_specialist']:
+      - Getting 'insurance_auth' first is worth more than getting it last.
+      - Each position is worth less than the previous (logarithmic discount).
+      - NDCG=1.0 means perfect ranking. NDCG=0.0 means completely reversed.
+    """
+    if not ideal:
+        return 1.0
+    if k is None:
+        k = len(ideal)
+    def dcg(order):
+        score = 0.0
+        for i, item in enumerate(order[:k]):
+            if item in ideal:
+                relevance = len(ideal) - ideal.index(item)
+                score += relevance / math.log2(i + 2)
+        return score
+    ideal_dcg = dcg(ideal)
+    return round(dcg(predicted) / ideal_dcg, 4) if ideal_dcg > 0 else 0.0
+def _count_violations(proposed: List, dep_graph: Dict) -> int:
+    """Count steps where a prerequisite appears AFTER the step needing it."""
+    violations = 0
+    for i, step in enumerate(proposed):
+        for prereq in dep_graph.get(step, []):
+            if prereq not in proposed[:i]:
+                violations += 1
+    return violations
+def _score_detect(action: Dict, case: Dict) -> float:
+    """Score gap detection (cli_easy). F1 on missing steps + risk level match."""
+    exp = case.get('expected_missing_steps', [])
+    pred = action.get('missing_steps', [])
+    # Normalize to lists
+    if isinstance(exp, str):
+        exp = [exp]
+    if isinstance(pred, str):
+        pred = [pred]
+    # F1 on missing step detection (65% weight)
+    step_score = _f1(pred, exp)
+    # Risk level match: exact or adjacent (35% weight)
+    er = case.get('expected_risk', '')
+    pr = action.get('risk_level', '')
+    risk_score = 1.0 if pr == er else (0.5 if _adj_risk(pr, er) else 0.0)
+    return 0.65 * step_score + 0.35 * risk_score
+def _score_rank(action: Dict, case: Dict) -> float:
+    """Score priority ranking (cli_medium). Completeness + NDCG ordering."""
+    ideal = case.get('priority_order', [])
+    predicted = action.get('priority_order', [])
+    if not ideal:
+        return 0.5
+    # Filter predicted to only include valid step IDs (prevents hallucinated IDs from scoring)
+    valid_ids = set(case.get('available_steps', []))
+    if valid_ids:
+        predicted = [p for p in predicted if p in valid_ids]
+    # Completeness: are all items present? (40% weight)
+    completeness = _f1(predicted, ideal)
+    # Ranking quality: NDCG (60% weight)
+    ranking = _ndcg(predicted, ideal)
+    return 0.40 * completeness + 0.60 * ranking
+def _score_order(action: Dict, case: Dict) -> float:
+    """Score dependency-ordered recovery (cli_hard). Order + completeness + efficiency."""
+    dep_graph = case.get('dependency_graph', {})
+    required = case.get('required_steps', [])
+    proposed = action.get('recovery_steps', [])
+    if not proposed:
+        return 0.0
+    # Dependency violations: -0.25 each (40% weight)
+    viol = _count_violations(proposed, dep_graph)
+    order = max(0.0, 1.0 - viol * 0.25)
+    # Completeness: F1 against required steps (40% weight)
+    completeness = _f1(proposed, required)
+    # Efficiency: penalize extra unnecessary steps (20% weight)
+    extra = max(0, len(proposed) - len(required))
+    efficiency = max(0.0, 1.0 - extra * 0.10)
+    return safe_score(order * 0.40 + completeness * 0.40 + efficiency * 0.20)
+def compute_correctness(action: Dict, case: Dict) -> float:
+    """Route to correct scoring function based on action_type."""
+    atype = action.get('action_type')
+    if atype == 'detect_gap':
+        return _score_detect(action, case)
+    if atype == 'rank_issues':
+        return _score_rank(action, case)
+    if atype == 'order_steps':
+        return _score_order(action, case)
+    return None
+def grade(action: Dict, session) -> float:
+    """Entry point called by router. Runs full reward pipeline."""
+    return grade_dynamic(action, session, compute_correctness, VALID_ACTIONS, FORBIDDEN, max_steps=6)

server/graders/dependency_grader.py ADDED Viewed

	@@ -0,0 +1,243 @@

+# server/graders/dependency_grader.py
+# Grader for PyTorch Migration Time-Machine tasks (dep_easy, dep_medium, dep_hard).
+# Covers: deprecated API detection, version conflict resolution, graph-break fixing.
+from typing import Dict
+from .base_grader import grade_dynamic, safe_score
+try:
+    from packaging.version import Version
+    from packaging.specifiers import SpecifierSet
+    _HAS_PACKAGING = True
+except ImportError:
+    _HAS_PACKAGING = False
+VALID_ACTIONS = ['flag_outdated', 'resolve_conflict', 'migrate_api', 'validate_tree']
+FORBIDDEN = []
+def _normalize_ver(v: str) -> str:
+    """Normalize version: '2.1' → '2.1.0', '1' → '1.0.0'."""
+    parts = str(v).strip().split('.')
+    while len(parts) < 3:
+        parts.append('0')
+    return '.'.join(parts[:3])
+def _parse_version_tuple(v: str) -> tuple:
+    """Parse '2.1.0' into (2, 1, 0). Robust fallback when packaging is unavailable."""
+    try:
+        parts = _normalize_ver(v).split('.')
+        return tuple(int(p) for p in parts[:3])
+    except (ValueError, AttributeError):
+        return (0, 0, 0)
+def _simple_version_check(ver_str: str, constraint: str) -> bool:
+    """Check if ver_str satisfies a constraint like '>=1.24,<2.0' WITHOUT packaging.
+    Handles: >=, <=, >, <, ==, != and comma-separated constraints.
+    """
+    ver = _parse_version_tuple(ver_str)
+    parts = [c.strip() for c in constraint.split(',') if c.strip()]
+    for part in parts:
+        if part.startswith('>='):
+            if ver < _parse_version_tuple(part[2:]):
+                return False
+        elif part.startswith('<='):
+            if ver > _parse_version_tuple(part[2:]):
+                return False
+        elif part.startswith('!='):
+            if ver == _parse_version_tuple(part[2:]):
+                return False
+        elif part.startswith('>'):
+            if ver <= _parse_version_tuple(part[1:]):
+                return False
+        elif part.startswith('<'):
+            if ver >= _parse_version_tuple(part[1:]):
+                return False
+        elif part.startswith('=='):
+            if ver != _parse_version_tuple(part[2:]):
+                return False
+        else:
+            # Bare version string — treat as ==
+            if ver != _parse_version_tuple(part):
+                return False
+    return True
+def _f1(predicted, expected):
+    """Compute F1 score between predicted and expected sets."""
+    if not expected:
+        return 1.0 if not predicted else 0.0
+    if not predicted:
+        return 0.0
+    pred_s = set(str(p).strip() for p in predicted)
+    exp_s = set(str(e).strip() for e in expected)
+    tp = len(pred_s & exp_s)
+    p = tp / len(pred_s) if pred_s else 0.0
+    r = tp / len(exp_s) if exp_s else 0.0
+    return round(2 * p * r / max(p + r, 0.001), 4)
+def _downgrades(proposed: Dict, case: Dict) -> int:
+    """Count unnecessary version downgrades (dep_medium penalty)."""
+    reqs = case.get('requirements', {})
+    count = 0
+    for pkg, ver in proposed.items():
+        if pkg in reqs:
+            try:
+                if _HAS_PACKAGING:
+                    if Version(_normalize_ver(ver)) < Version(_normalize_ver(reqs[pkg])):
+                        count += 1
+                else:
+                    if _parse_version_tuple(ver) < _parse_version_tuple(reqs[pkg]):
+                        count += 1
+            except Exception:
+                pass
+    return count
+def _score_flag(action: Dict, case: Dict) -> float:
+    """Score deprecated API detection (dep_easy)."""
+    exp = set(case.get('expected_outdated_packages', []))
+    flagged = set(action.get('packages', {}).keys())
+    # F1 on package detection (55% weight)
+    p = len(flagged & exp) / max(len(flagged), 1)
+    r = len(flagged & exp) / max(len(exp), 1)
+    f1 = 2 * p * r / max(p + r, 0.001)
+    # Deprecated API match (45% weight) — fuzzy for model variations
+    expected_api = case.get('expected_deprecated_api', '')
+    actual_api = action.get('deprecated_api', '') or ''
+    if actual_api == expected_api:
+        dep_ok = 1.0
+    elif expected_api and expected_api.split('.')[-1] in actual_api:
+        dep_ok = 0.7  # last segment match e.g. "Variable" in "autograd.Variable"
+    elif expected_api and any(p in actual_api for p in expected_api.split('.')):
+        dep_ok = 0.4  # partial segment match
+    else:
+        dep_ok = 0.0
+    return f1 * 0.55 + dep_ok * 0.45
+def _score_resolve(action: Dict, case: Dict) -> float:
+    """Score version conflict resolution (dep_medium). Cross-checks compatibility matrix constraints."""
+    compat = case.get('compatibility_matrix', {})
+    proposed = action.get('packages', {})
+    conflict_pkgs = case.get('conflict_packages', [])
+    # Count valid proposed versions WITH cross-constraint checking
+    valid = 0
+    for pkg, ver in proposed.items():
+        if pkg not in compat:
+            continue
+        norm_ver = _normalize_ver(ver)
+        # Try exact match first, then normalized
+        pkg_versions = compat[pkg]
+        matched_ver = None
+        if ver in pkg_versions:
+            matched_ver = ver
+        elif norm_ver in pkg_versions:
+            matched_ver = norm_ver
+        else:
+            for k in pkg_versions:
+                if _normalize_ver(k) == norm_ver:
+                    matched_ver = k
+                    break
+        # Patch-level fuzzy: match major.minor only (e.g. "2.1.1" → "2.1.0")
+        if not matched_ver:
+            norm_major_minor = '.'.join(norm_ver.split('.')[:2])
+            for k in pkg_versions:
+                if '.'.join(_normalize_ver(k).split('.')[:2]) == norm_major_minor:
+                    matched_ver = k
+                    break
+        if not matched_ver:
+            continue
+        # Check cross-dependency constraints using packaging or fallback
+        deps = pkg_versions[matched_ver]
+        cross_ok = True
+        if isinstance(deps, dict):
+            for dep_pkg, constraint in deps.items():
+                if dep_pkg in proposed:
+                    dep_ver = _normalize_ver(proposed[dep_pkg])
+                    try:
+                        if _HAS_PACKAGING:
+                            if Version(dep_ver) not in SpecifierSet(constraint):
+                                cross_ok = False
+                                break
+                        else:
+                            if not _simple_version_check(dep_ver, constraint):
+                                cross_ok = False
+                                break
+                    except Exception:
+                        pass
+        if cross_ok:
+            valid += 1
+    base = valid / max(len(conflict_pkgs), 1)
+    bonus = 0.15 if valid == len(conflict_pkgs) else 0.0
+    down = _downgrades(proposed, case) * 0.10
+    return safe_score(base + bonus - down)
+def _score_migrate(action: Dict, case: Dict) -> float:
+    """Score graph-break migration (dep_hard). Checks coverage, order, fix quality."""
+    checklist = case.get('graph_breaks', [])       # list of break IDs
+    dep_graph = case.get('checklist_dependency_graph', {})
+    completed = action.get('completed_items', [])
+    fix_map = case.get('correct_fix_map', {})      # break_id -> required_token
+    if not checklist:
+        return 0.5
+    # Early exit: if agent submitted nothing, score is 0
+    if not completed:
+        return 0.0
+    # Dependency order violations
+    viol = sum(
+        1 for item in completed
+        for pre in dep_graph.get(item, [])
+        if pre not in completed
+    )
+    order_score = max(0.0, 1.0 - viol * 0.20)
+    # Checklist coverage
+    covered = [b for b in checklist if b in completed]
+    completeness = len(covered) / max(len(checklist), 1)
+    # Fix quality: does each fix contain the required token?
+    fix_qs = []
+    for b in covered:
+        if b not in fix_map:
+            continue
+        expected_token = fix_map[b].lower()
+        actual_fix = str(action.get('code_changes', {}).get(b, '')).lower()
+        if expected_token in actual_fix or actual_fix in expected_token:
+            fix_qs.append(1.0)
+        else:
+            fix_qs.append(0.6)  # Generous partial credit
+    fix_quality = sum(fix_qs) / max(len(fix_qs), 1) if fix_qs else 0.0
+    return safe_score(order_score * 0.30 + completeness * 0.40 + fix_quality * 0.30)
+def compute_correctness(action: Dict, case: Dict) -> float:
+    """Route to correct scoring function based on action_type."""
+    atype = action.get('action_type')
+    if atype == 'flag_outdated':
+        return _score_flag(action, case)
+    if atype == 'resolve_conflict':
+        return _score_resolve(action, case)
+    if atype in ('migrate_api', 'validate_tree'):
+        return _score_migrate(action, case)
+    return None
+def grade(action: Dict, session) -> float:
+    """Entry point called by router. Runs full reward pipeline."""
+    return grade_dynamic(action, session, compute_correctness, VALID_ACTIONS, FORBIDDEN, max_steps=8)

server/graders/security_grader.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# server/graders/security_grader.py
+# Grader for MCP Security Sandbox tasks (sec_easy, sec_medium, sec_hard).
+# Bug 4 FIXED: _score_identify does NOT early-return on wrong vuln_type.
+from typing import Dict
+from .base_grader import grade_dynamic, safe_score
+VALID_ACTIONS = ['identify_vulnerability', 'propose_fix', 'revise_fix']
+FORBIDDEN = ['os.remove', 'shutil.rmtree', 'drop table', 'delete from']
+SEV_ORDER = ['low', 'medium', 'high', 'critical']
+def _adj_sev(predicted, target):
+    """Check if severity is off by exactly one level (partial credit)."""
+    try:
+        return abs(SEV_ORDER.index(predicted) - SEV_ORDER.index(target)) == 1
+    except ValueError:
+        return False
+def _score_identify(action: Dict, case: Dict) -> float:
+    """Score vulnerability identification. Bug 4 FIX: always score all 3 components."""
+    # Detection: correct vuln_type? (45% weight)
+    det = 1.0 if action.get('vuln_type') == case.get('expected_vuln_type', '') else 0.0
+    # BUG 4 FIX: do NOT early-return here. Always score CVSS and severity.
+    # This gives the agent partial credit even when vuln_type is wrong.
+    # CVSS: within expected range? (30% weight)
+    lo, hi = case.get('cvss_range', [0.0, 10.0])
+    v = float(action.get('cvss_score', -1))
+    cvss = 1.0 if lo <= v <= hi else (0.5 if abs(v - (lo + hi) / 2) <= 3.0 else 0.0)
+    # Severity: exact match or adjacent? (25% weight)
+    s, es = action.get('severity', ''), case.get('expected_severity', '')
+    sev = 1.0 if s == es else (0.4 if _adj_sev(s, es) else 0.0)
+    return det * 0.45 + cvss * 0.30 + sev * 0.25
+def _score_propose(action: Dict, case: Dict) -> float:
+    """Score proposed fix. Checks token coverage and identifier preservation."""
+    tokens = case.get('required_fix_tokens', [])
+    if isinstance(tokens, dict):
+        tokens = tokens.get(case.get('expected_vuln_type', ''), [])
+    # Safety: flatten to list of strings only
+    tokens = [t for t in tokens if isinstance(t, str)]
+    fix = action.get('fix_code', '')
+    if not fix:
+        return 0.0
+    # Token coverage: allow missing 1 token to still get full score
+    if not tokens:
+        coverage = 0.5
+    else:
+        divisor = max(1, len(tokens) - 1)
+        coverage = min(1.0, sum(1 for t in tokens if t.lower() in fix.lower()) / divisor)
+    # Identifier preservation: did the fix keep the key function name?
+    key_id = case.get('must_preserve_identifier', '')
+    preservation = 0.15 if key_id and key_id in fix else 0.0
+    # Floor: any non-empty fix_code gets at least 0.25 (agent showed correct workflow)
+    return max(0.25, safe_score(coverage + preservation))
+def _score_revise(action: Dict, case: Dict) -> float:
+    """Score revised fix after reviewer feedback. Checks coverage and regression."""
+    kw = case.get('current_feedback_keywords', [])
+    addressed = action.get('addressed_feedback', '')
+    fix = action.get('fix_code', '')
+    # Feedback keyword coverage: allow missing 1 keyword
+    divisor = max(1, len(kw) - 1)
+    cov = min(1.0, sum(1 for k in kw if k.lower() in addressed.lower()) / divisor)
+    # Regression check: does the fix_code still contain the original vulnerability? (-20%)
+    reg = 0.20 if case.get('original_vuln_pattern', '') in fix else 0.0
+    # Floor: any non-empty addressed_feedback gets at least 0.20
+    return max(0.20, safe_score(cov - reg))
+def compute_correctness(action: Dict, case: Dict) -> float:
+    """Route to correct scoring function based on action_type."""
+    atype = action.get('action_type')
+    if atype == 'identify_vulnerability':
+        return _score_identify(action, case)
+    if atype == 'propose_fix':
+        return _score_propose(action, case)
+    if atype == 'revise_fix':
+        return _score_revise(action, case)
+    return None  # safe_score(None) = 0.0
+def grade(action: Dict, session) -> float:
+    """Entry point called by router. Runs full reward pipeline."""
+    return grade_dynamic(action, session, compute_correctness, VALID_ACTIONS, FORBIDDEN, max_steps=8)

server/models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server/models package

server/models/clinical_models.py ADDED Viewed

	@@ -0,0 +1,19 @@

+# server/models/clinical_models.py
+from pydantic import BaseModel, Field
+from typing import List
+class DetectGap(BaseModel):
+    action_type: str = 'detect_gap'
+    missing_steps: List[str] = Field(..., description='IDs of missing workflow steps')
+    risk_level: str = Field(..., description='critical|high|medium|low')
+class RankIssues(BaseModel):
+    action_type: str = 'rank_issues'
+    priority_order: List[str] = Field(..., description='step IDs from highest to lowest priority')
+class OrderSteps(BaseModel):
+    action_type: str = 'order_steps'
+    recovery_steps: List[str] = Field(..., description='step IDs in dependency-safe execution order')

server/models/dependency_models.py ADDED Viewed

	@@ -0,0 +1,22 @@

+# server/models/dependency_models.py
+from pydantic import BaseModel, Field
+from typing import Dict, List, Optional
+class FlagOutdated(BaseModel):
+    action_type: str = 'flag_outdated'
+    packages: Dict[str, str] = Field(..., description='package_name: current_version')
+    deprecated_api: Optional[str] = None
+    replacement: Optional[str] = None
+class ResolveConflict(BaseModel):
+    action_type: str = 'resolve_conflict'
+    packages: Dict[str, str] = Field(..., description='package_name: proposed_version')
+    reasoning: str = Field(..., max_length=100)
+class MigrateApi(BaseModel):
+    action_type: str = 'migrate_api'
+    completed_items: List[str] = Field(..., description='list of break_ids fixed')
+    code_changes: Dict[str, str] = Field(..., description='break_id: fix summary')

server/models/security_models.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# server/models/security_models.py
+from pydantic import BaseModel, Field
+from typing import Optional
+class IdentifyVulnerability(BaseModel):
+    action_type: str = 'identify_vulnerability'
+    vuln_type: str = Field(..., description='Type of vulnerability detected')
+    cvss_score: float = Field(..., ge=0.0, le=10.0)
+    severity: str = Field(..., description='critical|high|medium|low')
+    affected_line: int = Field(..., ge=1)
+class ProposeFix(BaseModel):
+    action_type: str = 'propose_fix'
+    fix_code: str = Field(..., max_length=500)
+    explanation: str = Field(..., max_length=200)
+class ReviseFix(BaseModel):
+    action_type: str = 'revise_fix'
+    fix_code: str = Field(..., max_length=500)
+    addressed_feedback: str = Field(..., max_length=200)

server/router.py ADDED Viewed

	@@ -0,0 +1,322 @@

+# server/router.py
+# Central dispatcher. Routes validated actions to the correct domain grader.
+# Returns rich observations with task_subtype, score_details, and data-driven done conditions.
+from typing import Dict
+from .session import SessionState
+from .graders import security_grader, dependency_grader, clinical_grader
+# Map domain names to their grader modules
+GRADERS = {
+    'security': security_grader,
+    'dependency': dependency_grader,
+    'clinical': clinical_grader,
+}
+def route_step(session: SessionState, action: Dict) -> Dict:
+    """Route a validated action to the correct grader and return enriched result."""
+    grader = GRADERS.get(session.task_type)
+    if not grader:
+        return {
+            'reward': 0.0,
+            'done': True,
+            'observation': {'error': f'Unknown task_type: {session.task_type}'},
+        }
+    # Run the domain grader
+    reward = grader.grade(action, session)
+    # Check if episode is done (data-driven from case)
+    case = session.task_case
+    max_steps = case.get('max_steps', 8)
+    done = _check_done(session, action, reward, max_steps)
+    # Build the next observation (rich, self-describing)
+    obs = _build_step_obs(session, action, reward, done)
+    # Score breakdown for debugging and UI
+    score_details = _compute_score_details(action, session)
+    return {
+        'episode_id': session.episode_id,
+        'step_count': session.step_count + 1,
+        'reward': round(float(reward), 4),
+        'done': bool(done),
+        'observation': obs,
+        'score_details': score_details,
+    }
+def _check_done(session: SessionState, action: Dict, reward: float, max_steps: int) -> bool:
+    """Data-driven done condition from case definition.
+    Three triggers (from OpenEnv tech ref Section 7.2):
+    1. required_sequence complete (all required action types performed)
+    2. reward >= completion_threshold
+    3. max steps reached
+    """
+    next_step = session.step_count + 1
+    case = session.task_case
+    # Always done if max steps reached
+    if next_step >= max_steps:
+        return True
+    # Check minimum actions before allowing completion by threshold
+    done_conditions = case.get('done_conditions', {})
+    min_actions = done_conditions.get('min_actions', 1)
+    if next_step < min_actions:
+        return False
+    # Completion threshold from case
+    threshold = case.get('completion_threshold', 0.85)
+    if reward >= threshold:
+        return True
+    # Required sequence check — only end if agent is actually scoring well
+    # This prevents premature termination when all action types are done but rewards are 0.0
+    required_seq = done_conditions.get('required_sequence', [])
+    if required_seq and reward >= 0.3:
+        all_actions = session.last_actions + [action.get('action_type', '')]
+        seq_complete = all(a in all_actions for a in required_seq)
+        if seq_complete:
+            return True
+    return False
+def build_initial_obs(session: SessionState) -> dict:
+    """Build the initial observation returned by /reset.
+    CRITICAL: Every observation MUST include task_type, task_subtype,
+    task_description, and available_actions with params.
+    """
+    case = session.task_case
+    task_type = session.task_type
+    task_id = session.task_id
+    obs = {
+        'task_type': task_type,
+        'task_id': task_id,
+        'task_subtype': case.get('task_subtype', 'standard'),
+        'task_description': case.get('task_description', ''),
+        'turn': 0,
+        'done': False,
+    }
+    if task_type == 'security':
+        obs['code_snippet'] = case.get('tool_call', '')
+        obs['reviewer_feedback'] = None
+        obs['available_actions'] = [
+            {'name': 'identify_vulnerability',
+             'params': ['vuln_type:str', 'cvss_score:float', 'severity:str', 'affected_line:int']},
+            {'name': 'propose_fix',
+             'params': ['fix_code:str', 'explanation:str']},
+            {'name': 'revise_fix',
+             'params': ['fix_code:str', 'addressed_feedback:str']},
+        ]
+    elif task_type == 'dependency':
+        obs['code_snippet'] = case.get('code_snippet', '')
+        subtype = case.get('task_subtype', '')
+        if subtype == 'flag':
+            obs['requirements'] = case.get('requirements', {})
+            obs['available_actions'] = [
+                {'name': 'flag_outdated',
+                 'params': ['packages:dict', 'deprecated_api:str|null', 'replacement:str|null']},
+            ]
+        elif subtype == 'resolve':
+            obs['conflict_packages'] = case.get('conflict_packages', [])
+            obs['compatibility_matrix'] = case.get('compatibility_matrix', {})
+            obs['current_requirements'] = case.get('requirements', {})
+            obs['compatibility_hint'] = 'Check torch 2.x compatibility with numpy and cuda-toolkit versions'
+            obs['available_actions'] = [
+                {'name': 'resolve_conflict',
+                 'params': ['packages:dict', 'reasoning:str']},
+            ]
+        elif subtype == 'migrate':
+            obs['graph_break_report'] = case.get('graph_break_report', case.get('break_descriptions', []))
+            obs['available_actions'] = [
+                {'name': 'migrate_api',
+                 'params': ['completed_items:list', 'code_changes:dict']},
+                {'name': 'validate_tree',
+                 'params': ['completed_items:list']},
+            ]
+    elif task_type == 'clinical':
+        obs['patient_id'] = case.get('patient_id', '')
+        obs['events'] = case.get('events', case.get('patient_events', []))
+        obs['available_steps'] = case.get('available_steps', [])
+        if task_id in ('cli_medium', 'cli_hard'):
+            obs['dependency_graph'] = case.get('dependency_graph', {})
+        obs['available_actions'] = [
+            {'name': 'detect_gap',
+             'params': ['missing_steps:list', 'risk_level:str']},
+            {'name': 'rank_issues',
+             'params': ['priority_order:list']},
+            {'name': 'order_steps',
+             'params': ['recovery_steps:list']},
+        ]
+    return obs
+def _build_step_obs(session: SessionState, action: Dict, reward: float, done: bool) -> Dict:
+    """Build observation returned after each step().
+    Always includes: task_type, task_id, task_subtype, turn, done.
+    Includes domain-specific data so generic agents can navigate.
+    """
+    case = session.task_case
+    task_type = session.task_type
+    obs = {
+        'task_type': task_type,
+        'task_id': session.task_id,
+        'task_subtype': case.get('task_subtype', 'standard'),
+        'turn': session.step_count + 1,
+        'done': done,
+        'last_reward': round(reward, 4),
+    }
+    if done:
+        obs['message'] = 'Episode complete.'
+        return obs
+    if task_type == 'security':
+        obs['task_description'] = case.get('task_description', '')
+        obs['code_snippet'] = case.get('tool_call', '')
+        atype = action.get('action_type', '')
+        # Provide reviewer feedback after propose_fix (for medium/hard)
+        if atype == 'propose_fix':
+            fb = case.get('reviewer_feedback', '')
+            if fb:
+                obs['reviewer_feedback'] = fb
+        elif atype == 'revise_fix':
+            # For hard tasks with feedback sequence
+            fb_seq = case.get('reviewer_feedback_sequence', [])
+            if fb_seq:
+                fb_idx = min(len(session.history), len(fb_seq) - 1)
+                if fb_idx >= 0:
+                    obs['reviewer_feedback'] = fb_seq[fb_idx]
+        obs['available_actions'] = [
+            {'name': 'identify_vulnerability',
+             'params': ['vuln_type:str', 'cvss_score:float', 'severity:str', 'affected_line:int']},
+            {'name': 'propose_fix',
+             'params': ['fix_code:str', 'explanation:str']},
+            {'name': 'revise_fix',
+             'params': ['fix_code:str', 'addressed_feedback:str']},
+        ]
+    elif task_type == 'dependency':
+        obs['task_description'] = case.get('task_description', '')
+        obs['code_snippet'] = case.get('code_snippet', '')
+        subtype = case.get('task_subtype', '')
+        if subtype == 'migrate':
+            obs['graph_break_report'] = case.get('graph_break_report', case.get('break_descriptions', []))
+            obs['available_actions'] = [
+                {'name': 'migrate_api', 'params': ['completed_items:list', 'code_changes:dict']},
+                {'name': 'validate_tree', 'params': ['completed_items:list']},
+            ]
+        elif subtype == 'resolve':
+            obs['conflict_packages'] = case.get('conflict_packages', [])
+            obs['available_actions'] = [
+                {'name': 'resolve_conflict', 'params': ['packages:dict', 'reasoning:str']},
+            ]
+        else:
+            obs['available_actions'] = [
+                {'name': 'flag_outdated',
+                 'params': ['packages:dict', 'deprecated_api:str|null', 'replacement:str|null']},
+            ]
+    elif task_type == 'clinical':
+        obs['task_description'] = case.get('task_description', '')
+        obs['patient_id'] = case.get('patient_id', '')
+        obs['events'] = case.get('events', case.get('patient_events', []))
+        obs['available_steps'] = case.get('available_steps', [])
+        if session.task_id in ('cli_medium', 'cli_hard'):
+            obs['dependency_graph'] = case.get('dependency_graph', {})
+        obs['available_actions'] = [
+            {'name': 'detect_gap', 'params': ['missing_steps:list', 'risk_level:str']},
+            {'name': 'rank_issues', 'params': ['priority_order:list']},
+            {'name': 'order_steps', 'params': ['recovery_steps:list']},
+        ]
+    return obs
+def _compute_score_details(action: Dict, session: SessionState) -> Dict[str, float]:
+    """Compute per-component score breakdown for UI display and judge transparency."""
+    atype = action.get('action_type', '')
+    case = session.task_case
+    details = {}
+    if session.task_type == 'security':
+        if atype == 'identify_vulnerability':
+            details['vuln_type_match'] = 1.0 if action.get('vuln_type') == case.get('expected_vuln_type') else 0.0
+            lo, hi = case.get('cvss_range', [0, 10])
+            try:
+                v = float(action.get('cvss_score', -1))
+                details['cvss_in_range'] = 1.0 if lo <= v <= hi else (0.5 if abs(v - (lo + hi) / 2) <= 3.0 else 0.0)
+            except (TypeError, ValueError):
+                details['cvss_in_range'] = 0.0
+            details['severity_match'] = 1.0 if action.get('severity') == case.get('expected_severity') else 0.0
+        elif atype == 'propose_fix':
+            tokens = case.get('required_fix_tokens', [])
+            if isinstance(tokens, dict):
+                tokens = tokens.get(case.get('expected_vuln_type', ''), [])
+            tokens = [t for t in tokens if isinstance(t, str)]
+            fix = action.get('fix_code', '')
+            details['token_coverage'] = sum(1 for t in tokens if t.lower() in fix.lower()) / max(len(tokens), 1) if fix else 0.0
+            key_id = case.get('must_preserve_identifier', '')
+            details['id_preserved'] = 1.0 if key_id and key_id in fix else 0.0
+        elif atype == 'revise_fix':
+            kws = case.get('current_feedback_keywords', [])
+            addressed = action.get('addressed_feedback', '')
+            details['feedback_addressed'] = sum(1 for kw in kws if kw.lower() in addressed.lower()) / max(len(kws), 1) if addressed else 0.0
+            orig = case.get('original_vuln_pattern', '')
+            fix = action.get('fix_code', '')
+            details['vuln_removed'] = 1.0 if orig and orig not in fix else 0.3
+    elif session.task_type == 'dependency':
+        if atype == 'flag_outdated':
+            expected = set(case.get('expected_outdated_packages', []))
+            provided = set(action.get('packages', {}).keys())
+            if expected:
+                tp = len(expected & provided)
+                p = tp / max(len(provided), 1)
+                r = tp / max(len(expected), 1)
+                details['pkg_f1'] = round(2 * p * r / max(p + r, 0.001), 4)
+            details['api_match'] = 1.0 if action.get('deprecated_api') == case.get('expected_deprecated_api') else 0.0
+        elif atype == 'resolve_conflict':
+            proposed = action.get('packages', {})
+            conflict = case.get('conflict_packages', [])
+            details['packages_proposed'] = len(proposed)
+            details['conflict_count'] = len(conflict)
+        elif atype in ('migrate_api', 'validate_tree'):
+            checklist = case.get('graph_breaks', [])
+            completed = action.get('completed_items', [])
+            details['items_completed'] = len(completed)
+            details['total_items'] = len(checklist)
+    elif session.task_type == 'clinical':
+        if atype == 'detect_gap':
+            expected = set(case.get('expected_missing_steps', []))
+            provided = set(action.get('missing_steps', []))
+            if expected:
+                tp = len(expected & provided)
+                p = tp / max(len(provided), 1)
+                r = tp / max(len(expected), 1)
+                details['step_f1'] = round(2 * p * r / max(p + r, 0.001), 4)
+            details['risk_match'] = 1.0 if action.get('risk_level') == case.get('expected_risk') else 0.0
+        elif atype == 'rank_issues':
+            expected = case.get('priority_order', [])
+            provided = action.get('priority_order', [])
+            details['ranking_overlap'] = len(set(expected) & set(provided)) / max(len(expected), 1) if expected else 0.0
+        elif atype == 'order_steps':
+            expected = case.get('required_steps', case.get('expected_missing_steps', []))
+            provided = action.get('recovery_steps', [])
+            details['steps_overlap'] = len(set(expected) & set(provided)) / max(len(expected), 1) if expected else 0.0
+    return details

server/session.py ADDED Viewed

	@@ -0,0 +1,41 @@

+# server/session.py
+# Foundation module — everything depends on this.
+# Manages episode state, task-to-domain mapping, and in-memory session storage.
+from dataclasses import dataclass, field
+from typing import List, Dict, Any
+import uuid
+@dataclass
+class SessionState:
+    """Holds all data for a single episode (one run of one task)."""
+    episode_id: str = field(default_factory=lambda: str(uuid.uuid4()))
+    task_type: str = ''        # 'security' | 'dependency' | 'clinical'
+    task_id: str = ''          # e.g. 'sec_easy'
+    task_case: Dict[str, Any] = field(default_factory=dict)   # ground truth — NEVER shared with agent
+    history: List[Dict] = field(default_factory=list)          # all past actions
+    last_actions: List[str] = field(default_factory=list)      # action_type strings for repetition penalty
+    step_count: int = 0
+    reward_acc: float = 0.0
+    done: bool = False
+# Maps each of the 9 task IDs to its domain
+TASK_TYPE_MAP = {
+    'sec_easy': 'security',   'sec_medium': 'security',   'sec_hard': 'security',
+    'dep_easy': 'dependency', 'dep_medium': 'dependency', 'dep_hard': 'dependency',
+    'cli_easy': 'clinical',   'cli_medium': 'clinical',   'cli_hard': 'clinical',
+}
+# In-memory store for all active sessions
+SESSIONS: Dict[str, SessionState] = {}
+def create_session(task_id: str, task_case: Dict) -> SessionState:
+    """Create a new session for a given task. Returns the SessionState object."""
+    s = SessionState()
+    s.task_id = task_id
+    s.task_type = TASK_TYPE_MAP.get(task_id, 'unknown')
+    s.task_case = task_case
+    return s

server/validation/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server/validation package

server/validation/validator.py ADDED Viewed

	@@ -0,0 +1,282 @@

+# server/validation/validator.py
+# 3-stage pre-action validation: Schema → Domain → Consistency.
+# IMPORTANT: Validator should HELP agents, not trap them.
+# - Auto-coerce types where possible (string "8.5" → float 8.5)
+# - Only hard-reject truly unrecoverable actions (wrong domain)
+# - Silently truncate oversized fields instead of rejecting
+# - Rich hints so agent can self-correct on next step
+from typing import Dict, Tuple
+VALID_VULN_TYPES = {
+    'sql_injection', 'xss', 'idor', 'hardcoded_secret', 'missing_auth',
+    'jwt_misuse', 'path_traversal', 'ssrf', 'rate_limit_missing', 'xxe'
+}
+VALID_SEVERITIES = {'critical', 'high', 'medium', 'low'}
+VALID_RISK_LEVELS = {'critical', 'high', 'medium', 'low'}
+# Which actions belong to which domain
+DOMAIN_ACTIONS = {
+    'security':   {'identify_vulnerability', 'propose_fix', 'revise_fix'},
+    'dependency': {'flag_outdated', 'resolve_conflict', 'migrate_api', 'validate_tree'},
+    'clinical':   {'detect_gap', 'rank_issues', 'order_steps'},
+}
+# Required fields and their types for each action
+ACTION_SCHEMAS = {
+    'identify_vulnerability': {
+        'vuln_type': str,
+        'cvss_score': (int, float),
+        'severity': str,
+    },
+    'propose_fix': {
+        'fix_code': str,
+        'explanation': str,
+    },
+    'revise_fix': {
+        'fix_code': str,
+        'addressed_feedback': str,
+    },
+    'flag_outdated': {
+        'packages': dict,
+        # deprecated_api and replacement are optional — handled below
+    },
+    'resolve_conflict': {
+        'packages': dict,
+        'reasoning': str,
+    },
+    'migrate_api': {
+        'completed_items': list,
+        'code_changes': dict,
+    },
+    'validate_tree': {
+        'completed_items': list,
+    },
+    'detect_gap': {
+        'missing_steps': list,
+        'risk_level': str,
+    },
+    'rank_issues': {
+        'priority_order': list,
+    },
+    'order_steps': {
+        'recovery_steps': list,
+    },
+}
+# Fields that are optional (won't cause hard rejection if missing)
+OPTIONAL_FIELDS = {
+    'flag_outdated': {'deprecated_api', 'replacement'},
+    'identify_vulnerability': {'affected_line'},
+}
+def _coerce(action: Dict, schema: Dict) -> Dict:
+    """Try to coerce field types before validating. Modifies action in-place.
+    This is critical for model compatibility — different LLMs output
+    numbers as strings, lists as comma-separated strings, etc.
+    """
+    for field, expected_type in schema.items():
+        if field not in action:
+            continue
+        val = action[field]
+        # Already correct type
+        if isinstance(val, expected_type):
+            continue
+        # Try coercions
+        try:
+            target = expected_type[0] if isinstance(expected_type, tuple) else expected_type
+            if target == float:
+                action[field] = float(val)
+            elif target == int:
+                action[field] = int(val)
+            elif target == str and not isinstance(val, str):
+                action[field] = str(val)
+            elif target == list and isinstance(val, str):
+                # Try JSON parse first, then comma split
+                try:
+                    import json as _j
+                    parsed = _j.loads(val)
+                    if isinstance(parsed, list):
+                        action[field] = parsed
+                except Exception:
+                    action[field] = [x.strip(' "\'') for x in val.strip('[]').split(',') if x.strip()]
+            elif target == dict and isinstance(val, str):
+                import json as _j
+                action[field] = _j.loads(val)
+        except Exception:
+            pass  # Leave as-is; domain check will catch real problems
+    return action
+def validate_action(action: Dict, session) -> Tuple[bool, Dict]:
+    """3-stage validation. Returns (is_valid, feedback_observation).
+    Philosophy: be lenient on format (coerce types), strict on cross-domain actions.
+    An action in the wrong domain = hard reject.
+    An action with slightly wrong types = coerce and pass through.
+    """
+    atype = action.get('action_type', '')
+    # ── Stage 1: Is this a known action type? ──
+    all_valid = set(ACTION_SCHEMAS.keys())
+    if atype not in all_valid:
+        return False, _fb(
+            'invalid_action_type',
+            f'Unknown action_type: {repr(atype)}',
+            session,
+            hint=f'Valid actions for {session.task_type}: {sorted(DOMAIN_ACTIONS.get(session.task_type, []))}',
+        )
+    # ── Cross-domain check FIRST (before coercion) ──
+    domain_valid = DOMAIN_ACTIONS.get(session.task_type, set())
+    if atype not in domain_valid:
+        return False, _fb(
+            'wrong_domain_action',
+            f'{repr(atype)} is not valid for task_type={repr(session.task_type)}',
+            session,
+            hint=f'Valid actions: {sorted(domain_valid)}',
+        )
+    # ── Coerce types before schema check (be helpful to all models) ──
+    schema = ACTION_SCHEMAS.get(atype, {})
+    action = _coerce(action, schema)
+    # ── Stage 2: Check required fields are present ──
+    optional = OPTIONAL_FIELDS.get(atype, set())
+    required_fields = [f for f in schema if f not in optional]
+    missing = [f for f in required_fields if f not in action]
+    if missing:
+        return False, _fb(
+            'missing_fields',
+            f'Missing required fields: {missing}',
+            session,
+            hint=f'Required for {atype}: {required_fields}',
+        )
+    # ── Stage 3: Domain value validation ──
+    errs = _domain_check(action, atype)
+    if errs:
+        return False, _fb(
+            'domain_error',
+            f'Invalid field values: {errs}',
+            session,
+            hint=_domain_hint(atype, errs),
+        )
+    # ── Stage 4: Consistency check ──
+    cons = _consistency_check(action, atype, session)
+    if cons:
+        return False, _fb('consistency_error', cons['message'], session, hint=cons['hint'])
+    return True, {}
+def _domain_check(action: Dict, atype: str) -> list:
+    """Check values are within allowed ranges/enums. Returns list of error dicts."""
+    errors = []
+    if atype == 'identify_vulnerability':
+        vt = action.get('vuln_type', '')
+        if vt not in VALID_VULN_TYPES:
+            errors.append({'field': 'vuln_type', 'value': vt, 'allowed': sorted(VALID_VULN_TYPES)})
+        try:
+            cvss = float(action.get('cvss_score', -1))
+            if not (0.0 <= cvss <= 10.0):
+                errors.append({'field': 'cvss_score', 'value': cvss, 'allowed': '0.0 to 10.0'})
+        except (TypeError, ValueError):
+            errors.append({'field': 'cvss_score', 'value': action.get('cvss_score'), 'allowed': '0.0 to 10.0'})
+        sev = action.get('severity', '')
+        if sev not in VALID_SEVERITIES:
+            errors.append({'field': 'severity', 'value': sev, 'allowed': sorted(VALID_SEVERITIES)})
+    elif atype in ('propose_fix', 'revise_fix'):
+        fix = action.get('fix_code', '')
+        if len(fix) > 2000:
+            # Silently truncate instead of rejecting — don't penalize verbose agents
+            action['fix_code'] = fix[:2000]
+    elif atype == 'detect_gap':
+        rl = action.get('risk_level', '')
+        if rl not in VALID_RISK_LEVELS:
+            errors.append({'field': 'risk_level', 'value': rl, 'allowed': sorted(VALID_RISK_LEVELS)})
+    elif atype == 'resolve_conflict':
+        pkgs = action.get('packages', {})
+        if not isinstance(pkgs, dict) or len(pkgs) == 0:
+            errors.append({'field': 'packages', 'issue': 'must be a non-empty dict of {package: version}'})
+    elif atype == 'migrate_api':
+        items = action.get('completed_items', [])
+        changes = action.get('code_changes', {})
+        if not isinstance(items, list) or len(items) == 0:
+            errors.append({'field': 'completed_items', 'issue': 'must be a non-empty list of break IDs'})
+        if not isinstance(changes, dict):
+            errors.append({'field': 'code_changes', 'issue': 'must be a dict of {break_id: fix_description}'})
+    return errors
+def _domain_hint(atype: str, errors: list) -> str:
+    """Generate a helpful hint for domain errors."""
+    fields = [e.get('field', '') for e in errors]
+    if 'vuln_type' in fields:
+        return "vuln_type must be one of: sql_injection, xss, idor, hardcoded_secret, missing_auth, jwt_misuse, path_traversal, ssrf, rate_limit_missing, xxe"
+    if 'severity' in fields:
+        return "severity must be one of: critical, high, medium, low"
+    if 'risk_level' in fields:
+        return "risk_level must be one of: critical, high, medium, low"
+    if 'cvss_score' in fields:
+        return "cvss_score must be a float between 0.0 and 10.0"
+    return f"Check field values for: {fields}"
+def _consistency_check(action: Dict, atype: str, session) -> dict:
+    """Check that action makes sense given session history."""
+    hist_types = [h.get('action_type') for h in session.history]
+    if atype == 'revise_fix' and 'propose_fix' not in hist_types:
+        return {
+            'message': 'Cannot call revise_fix before propose_fix',
+            'hint': 'Call propose_fix first, then revise_fix if you get reviewer feedback'
+        }
+    if atype == 'rank_issues' and 'detect_gap' not in hist_types:
+        return {
+            'message': 'Cannot call rank_issues before detect_gap',
+            'hint': 'Call detect_gap first, then rank_issues'
+        }
+    if atype == 'order_steps' and 'detect_gap' not in hist_types:
+        return {
+            'message': 'Cannot call order_steps before detect_gap',
+            'hint': 'Call detect_gap first, then rank_issues, then order_steps'
+        }
+    # Reject identical resolve_conflict proposals (infinite loop prevention)
+    if atype == 'resolve_conflict':
+        for prev in session.history:
+            if (prev.get('action_type') == 'resolve_conflict' and
+                    prev.get('packages') == action.get('packages', {})):
+                return {
+                    'message': 'Identical version proposal already submitted — this combination was rejected',
+                    'hint': 'Try different package versions. Check the compatibility_matrix in the observation.'
+                }
+    return {}
+def _fb(error_type: str, message: str, session, **kwargs) -> Dict:
+    """Build a feedback observation for validation failures."""
+    obs = {
+        'validation_failed': True,
+        'error_type': error_type,
+        'message': message,
+        'turn': session.step_count,
+        'task_type': session.task_type,
+        'task_id': getattr(session, 'task_id', ''),
+        'available_actions': sorted(DOMAIN_ACTIONS.get(session.task_type, [])),
+    }
+    obs.update(kwargs)
+    return obs

server/web_ui.py ADDED Viewed

	@@ -0,0 +1,365 @@

+# server/web_ui.py
+# Gradio UI with task descriptions, how-it-works, model performance tracking.
+import os
+import gradio as gr
+import requests
+import json
+import time
+from datetime import datetime
+ENV_URL = 'http://localhost:7860'
+RESULTS_FILE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'results', 'run_history.json')
+os.makedirs(os.path.dirname(RESULTS_FILE), exist_ok=True)
+# ── Task info for the UI ──
+TASK_INFO = {
+    'sec_easy': {
+        'name': '🔒 Security — Easy',
+        'desc': 'Identify a single vulnerability in a code snippet.\nThe agent must classify the vulnerability type (e.g., SQL injection, XSS), estimate the CVSS score, and determine severity.',
+        'domain': 'Security (MCP Sandbox)',
+        'example': '{"action_type":"identify_vulnerability","vuln_type":"sql_injection","cvss_score":9.1,"severity":"critical","affected_line":3}',
+    },
+    'sec_medium': {
+        'name': '🔒 Security — Medium',
+        'desc': 'Identify a vulnerability AND propose a secure code fix.\nThe agent performs vulnerability identification on turn 1, then proposes a fix on turn 2.',
+        'domain': 'Security (MCP Sandbox)',
+        'example': 'Turn 1: identify_vulnerability → Turn 2: propose_fix with fix_code',
+    },
+    'sec_hard': {
+        'name': '🔒 Security — Hard',
+        'desc': 'Identify → Fix → Revise based on reviewer feedback.\nMulti-turn: the agent must iteratively improve its fix when a reviewer provides feedback.',
+        'domain': 'Security (MCP Sandbox)',
+        'example': 'Turn 1: identify → Turn 2: propose_fix → Turn 3+: revise_fix (with reviewer feedback)',
+    },
+    'dep_easy': {
+        'name': '📦 Dependency — Easy',
+        'desc': 'Flag outdated packages and deprecated API usage.\nThe agent scans code for old package versions and deprecated function calls.',
+        'domain': 'PyTorch Migration',
+        'example': '{"action_type":"flag_outdated","packages":{"torch":"1.7.0"},"deprecated_api":"torch.no_grad","replacement":"torch.inference_mode"}',
+    },
+    'dep_medium': {
+        'name': '📦 Dependency — Medium',
+        'desc': 'Resolve version conflicts using a compatibility matrix.\nThe agent must propose compatible versions that satisfy cross-package constraints.',
+        'domain': 'PyTorch Migration',
+        'example': '{"action_type":"resolve_conflict","packages":{"torch":"2.1.0","numpy":"1.24.0"},"reasoning":"torch 2.1 requires numpy >= 1.24"}',
+    },
+    'dep_hard': {
+        'name': '📦 Dependency — Hard',
+        'desc': 'Fix torch.compile graph-break patterns in dependency order.\nThe agent must fix multiple graph-break issues in the correct order based on their dependencies.',
+        'domain': 'PyTorch Migration',
+        'example': '{"action_type":"migrate_api","completed_items":["break_1"],"code_changes":{"break_1":"replaced torch.no_grad with inference_mode"}}',
+    },
+    'cli_easy': {
+        'name': '🏥 Clinical — Easy',
+        'desc': 'Detect missing steps in a clinical workflow and assess risk.\nThe agent identifies which required steps are missing from a patient workflow.',
+        'domain': 'Clinical Workflow Recovery',
+        'example': '{"action_type":"detect_gap","missing_steps":["insurance_auth","pre_op_consent"],"risk_level":"critical"}',
+    },
+    'cli_medium': {
+        'name': '🏥 Clinical — Medium',
+        'desc': 'Detect gaps AND rank them by clinical priority.\nThe agent must both find missing steps and rank them by importance.',
+        'domain': 'Clinical Workflow Recovery',
+        'example': 'Turn 1: detect_gap → Turn 2: rank_issues with priority_order list',
+    },
+    'cli_hard': {
+        'name': '🏥 Clinical — Hard',
+        'desc': 'Plan a dependency-ordered recovery sequence.\nThe agent must respect the dependency graph when ordering recovery steps.',
+        'domain': 'Clinical Workflow Recovery',
+        'example': 'insurance_auth → pre_op_consent → specialist → surgery (respecting dependencies)',
+    },
+}
+def _load_history():
+    if os.path.exists(RESULTS_FILE):
+        try:
+            with open(RESULTS_FILE, 'r') as f:
+                return json.load(f)
+        except Exception:
+            return []
+    return []
+def _save_run(run_data):
+    history = _load_history()
+    history.append(run_data)
+    with open(RESULTS_FILE, 'w') as f:
+        json.dump(history, f, indent=2)
+def get_task_info(task_id):
+    """Return description for selected task."""
+    info = TASK_INFO.get(task_id, {})
+    return (
+        f"### {info.get('name', task_id)}\n\n"
+        f"**Domain:** {info.get('domain', '?')}\n\n"
+        f"{info.get('desc', '')}\n\n"
+        f"**Example action:**\n```json\n{info.get('example', '')}\n```"
+    )
+def run_single_task(task_id: str):
+    """Run a single task with the demo agent."""
+    from .demo_agent import demo_action
+    logs = []
+    rewards = []
+    r = requests.post(f'{ENV_URL}/reset', json={'task_id': task_id}, timeout=30).json()
+    ep_id = r.get('episode_id', '')
+    obs = r.get('observation', r)
+    logs.append(f'[START] task={task_id} episode={ep_id[:12]}...')
+    done = False
+    step = 0
+    while not done and step < 8:
+        action = demo_action(obs)
+        action['episode_id'] = ep_id
+        sr = requests.post(f'{ENV_URL}/step', json=action, timeout=30).json()
+        reward = sr.get('reward', 0.0)
+        done = sr.get('done', False)
+        obs = sr.get('observation', sr)
+        rewards.append(round(reward, 4))
+        atype = action.get('action_type', '?')
+        logs.append(f'  Step {step + 1}: action={atype}  reward={reward:.4f}  done={done}')
+        step += 1
+    total = round(sum(rewards), 4)
+    logs.append(f'[END] total_reward={total}  steps={step}')
+    return '\n'.join(logs), rewards, total
+def run_task_ui(task_id: str, model_name: str):
+    """Run a single task and return display outputs."""
+    if not model_name.strip():
+        model_name = 'Demo Agent (rule-based)'
+    log_str, rewards, total = run_single_task(task_id)
+    reward_lines = ['Reward per step:']
+    for i, r in enumerate(rewards):
+        bar = '█' * int(r * 20)
+        reward_lines.append(f'  Step {i + 1}: {bar}  {r:.4f}')
+    reward_str = '\n'.join(reward_lines)
+    info = TASK_INFO.get(task_id, {})
+    domain = info.get('domain', 'Unknown')
+    difficulty = task_id.split('_')[1].upper()
+    score = min(max(total / max(len(rewards), 1), 0), 1)
+    score_md = f'''### ✅ Results
+| Field | Value |
+|-------|-------|
+| **Model** | `{model_name}` |
+| **Task** | `{task_id}` |
+| **Domain** | {domain} |
+| **Difficulty** | {difficulty} |
+| **Score** | **{score:.4f}** |
+| **Total Reward** | {total:.4f} |
+| **Steps** | {len(rewards)} |
+'''
+    _save_run({
+        'model': model_name, 'task_id': task_id, 'domain': domain,
+        'total_reward': total, 'score': round(score, 4),
+        'steps': len(rewards), 'timestamp': datetime.now().isoformat(),
+    })
+    return log_str, reward_str, score_md
+def run_all_tasks_ui(model_name: str):
+    """Run all 9 tasks and return a performance dashboard."""
+    if not model_name.strip():
+        model_name = 'Demo Agent (rule-based)'
+    tasks = list(TASK_INFO.keys())
+    all_logs = []
+    all_scores = {}
+    for task_id in tasks:
+        log_str, rewards, total = run_single_task(task_id)
+        all_logs.append(log_str)
+        score = min(max(total / max(len(rewards), 1), 0), 1)
+        all_scores[task_id] = round(score, 4)
+    full_log = '\n\n'.join(all_logs)
+    sec = [all_scores[t] for t in tasks if t.startswith('sec')]
+    dep = [all_scores[t] for t in tasks if t.startswith('dep')]
+    cli = [all_scores[t] for t in tasks if t.startswith('cli')]
+    rows = []
+    for task_id, score in all_scores.items():
+        info = TASK_INFO.get(task_id, {})
+        bar = '█' * int(min(score, 1.0) * 15)
+        rows.append(f'| `{task_id}` | {info.get("domain", "?")} | {bar} | **{score:.4f}** |')
+    avg = sum(all_scores.values()) / 9
+    sec_avg = sum(sec) / 3
+    dep_avg = sum(dep) / 3
+    cli_avg = sum(cli) / 3
+    dashboard = f'''## 📊 Model Performance Dashboard
+**Model:** `{model_name}`
+**Time:** {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+### Per-Task Scores
+| Task | Domain | Performance | Score |
+|------|--------|-------------|-------|
+{chr(10).join(rows)}
+### Domain Averages
+| Domain | Avg Score | Rating |
+|--------|-----------|--------|
+| 🔒 Security | {sec_avg:.4f} | {"🟢 Excellent" if sec_avg > 0.7 else "🟡 Good" if sec_avg > 0.4 else "🔴 Needs Work"} |
+| 📦 PyTorch Migration | {dep_avg:.4f} | {"🟢 Excellent" if dep_avg > 0.7 else "🟡 Good" if dep_avg > 0.4 else "🔴 Needs Work"} |
+| 🏥 Clinical Workflow | {cli_avg:.4f} | {"🟢 Excellent" if cli_avg > 0.7 else "🟡 Good" if cli_avg > 0.4 else "🔴 Needs Work"} |
+### Overall: **{avg:.4f}**
+'''
+    _save_run({
+        'model': model_name, 'type': 'full_run', 'scores': all_scores,
+        'avg': round(avg, 4), 'timestamp': datetime.now().isoformat(),
+    })
+    return full_log, dashboard
+def show_history():
+    history = _load_history()
+    if not history:
+        return 'No runs yet. Run a task first!'
+    lines = ['## 📜 Run History\n']
+    for i, run in enumerate(reversed(history[-10:])):
+        ts = run.get('timestamp', '?')[:19]
+        model = run.get('model', '?')
+        if run.get('type') == 'full_run':
+            avg = run.get('avg', 0)
+            lines.append(f'**#{len(history) - i}** | `{ts}` | `{model}` | All 9 tasks | Avg: **{avg:.4f}**')
+        else:
+            task = run.get('task_id', '?')
+            score = run.get('score', 0)
+            lines.append(f'**#{len(history) - i}** | `{ts}` | `{model}` | `{task}` | Score: **{score:.4f}**')
+    return '\n\n'.join(lines)
+def build_ui():
+    with gr.Blocks(title='Multi-Agent Dev Tools Env', theme=gr.themes.Soft()) as demo:
+        gr.Markdown('''# 🛠️ Multi-Agent Dev Tools Environment
+**A multi-domain RL environment for training AI agents on real-world tasks.**
+This environment tests AI agents across **3 domains** with **9 tasks** of increasing difficulty.
+Agents receive observations (problems), send actions (answers), and get reward scores (0.0 – 1.0).
+''')
+        with gr.Tab('🎯 Single Task'):
+            with gr.Row():
+                task_dd = gr.Dropdown(
+                    choices=list(TASK_INFO.keys()),
+                    value='sec_easy',
+                    label='🎯 Select Task',
+                )
+                model_input = gr.Textbox(
+                    label='🤖 Model Name',
+                    value='Demo Agent (rule-based)',
+                    placeholder='e.g. Qwen/Qwen2.5-72B-Instruct',
+                )
+                run_btn = gr.Button('▶️ Run Task', variant='primary', scale=1)
+            task_info_md = gr.Markdown(get_task_info('sec_easy'))
+            task_dd.change(fn=get_task_info, inputs=[task_dd], outputs=[task_info_md])
+            with gr.Row():
+                logs_box = gr.Textbox(label='📋 Episode Log', lines=10)
+                rewards_box = gr.Textbox(label='📊 Reward History', lines=10)
+            score_md = gr.Markdown('*Results will appear after running a task...*')
+            run_btn.click(
+                fn=run_task_ui,
+                inputs=[task_dd, model_input],
+                outputs=[logs_box, rewards_box, score_md],
+            )
+        with gr.Tab('🏆 Run All 9 Tasks'):
+            gr.Markdown('Run all 9 tasks at once and see a full performance dashboard with domain averages.')
+            with gr.Row():
+                model_all = gr.Textbox(
+                    label='🤖 Model Name',
+                    value='Demo Agent (rule-based)',
+                )
+                run_all_btn = gr.Button('🚀 Run All 9 Tasks', variant='primary')
+            all_logs = gr.Textbox(label='📋 Full Run Log', lines=12)
+            dashboard_md = gr.Markdown('*Dashboard will appear after running all tasks...*')
+            run_all_btn.click(
+                fn=run_all_tasks_ui,
+                inputs=[model_all],
+                outputs=[all_logs, dashboard_md],
+            )
+        with gr.Tab('📜 Run History'):
+            history_md = gr.Markdown('Click refresh to see past runs.')
+            refresh_btn = gr.Button('🔄 Refresh History')
+            refresh_btn.click(fn=show_history, outputs=[history_md])
+        with gr.Tab('📖 How It Works'):
+            gr.Markdown('''## How This Environment Works
+### Overview
+This is a **training gym for AI agents**. You build an agent, connect it to this environment
+via the API, and it gets scored on how well it solves real-world tasks.
+### The Flow
+```
+1. Agent calls POST /reset with a task_id → Gets an observation (the problem)
+2. Agent analyzes the observation and sends POST /step with its action
+3. Environment validates the action and grades it
+4. Returns a reward score (0.0 – 1.0) and the next observation
+5. Repeat until the episode ends (done=true) or max steps reached
+```
+### Three Domains
+| Domain | Tasks | What Agents Do |
+|--------|-------|---------------|
+| 🔒 **Security** | sec_easy, sec_medium, sec_hard | Identify vulnerabilities, propose fixes, revise based on feedback |
+| 📦 **Dependency** | dep_easy, dep_medium, dep_hard | Flag outdated packages, resolve conflicts, fix graph-breaks |
+| 🏥 **Clinical** | cli_easy, cli_medium, cli_hard | Detect workflow gaps, rank by priority, plan recovery |
+### Reward Signals
+- Scores range from **0.0** (completely wrong) to **1.0** (perfect)
+- Partial credit is awarded for partially correct answers
+- Invalid or malformed actions receive lower scores
+- The environment provides feedback on validation failures to help agents improve
+### API Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET /` | Health check | Returns status and task count |
+| `POST /reset` | Start episode | `{"task_id":"sec_easy"}` → observation |
+| `POST /step` | Submit action | `{action_type, ...}` → reward + next observation |
+| `GET /state` | Get state | Query current episode state |
+### Getting Started
+```python
+import requests
+# Start an episode
+resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"})
+data = resp.json()
+episode_id = data["episode_id"]
+observation = data["observation"]
+# Send an action
+action = {"episode_id": episode_id, "action_type": "identify_vulnerability", ...}
+result = requests.post("http://localhost:7860/step", json=action)
+print(result.json())  # {"reward": 0.85, "done": true, "observation": {...}}
+```
+''')
+    return demo

tests/test_endpoints.py ADDED Viewed

	@@ -0,0 +1,131 @@

+# tests/test_endpoints.py
+# Basic endpoint tests for the environment.
+# Run: python -m pytest tests/ -v
+import requests
+import pytest
+BASE_URL = 'http://localhost:7860'
+def test_health_check():
+    """GET / should return 200 with status ok."""
+    r = requests.get(f'{BASE_URL}/')
+    assert r.status_code == 200
+    data = r.json()
+    assert data['status'] == 'ok'
+    assert data['tasks'] == 9
+def test_reset_valid_task():
+    """POST /reset with valid task_id should return episode_id and observation."""
+    r = requests.post(f'{BASE_URL}/reset', json={'task_id': 'sec_easy'})
+    assert r.status_code == 200
+    data = r.json()
+    assert 'episode_id' in data
+    assert 'observation' in data
+    assert data['observation']['task_type'] == 'security'
+def test_reset_all_tasks():
+    """POST /reset should work for all 9 task IDs."""
+    tasks = [
+        'sec_easy', 'sec_medium', 'sec_hard',
+        'dep_easy', 'dep_medium', 'dep_hard',
+        'cli_easy', 'cli_medium', 'cli_hard',
+    ]
+    for task_id in tasks:
+        r = requests.post(f'{BASE_URL}/reset', json={'task_id': task_id})
+        assert r.status_code == 200
+        data = r.json()
+        assert 'episode_id' in data, f'No episode_id for {task_id}'
+        assert 'observation' in data, f'No observation for {task_id}'
+def test_reset_invalid_task():
+    """POST /reset with invalid task_id should still return 200."""
+    r = requests.post(f'{BASE_URL}/reset', json={'task_id': 'nonexistent'})
+    assert r.status_code == 200
+def test_step_valid_action():
+    """POST /step with valid action should return reward and observation."""
+    # Reset first
+    r = requests.post(f'{BASE_URL}/reset', json={'task_id': 'sec_easy'})
+    ep_id = r.json()['episode_id']
+    # Step
+    action = {
+        'episode_id': ep_id,
+        'action_type': 'identify_vulnerability',
+        'vuln_type': 'sql_injection',
+        'cvss_score': 9.1,
+        'severity': 'critical',
+        'affected_line': 1,
+    }
+    r = requests.post(f'{BASE_URL}/step', json=action)
+    assert r.status_code == 200
+    data = r.json()
+    assert 'reward' in data
+    assert 'done' in data
+    assert 'observation' in data
+    assert 0.0 <= data['reward'] <= 1.0
+def test_step_invalid_episode():
+    """POST /step with invalid episode_id should return 200 with done=True."""
+    r = requests.post(f'{BASE_URL}/step', json={
+        'episode_id': 'nonexistent',
+        'action_type': 'identify_vulnerability',
+    })
+    assert r.status_code == 200
+    data = r.json()
+    assert data['done'] is True
+def test_state_endpoint():
+    """GET /state should return episode info."""
+    r = requests.post(f'{BASE_URL}/reset', json={'task_id': 'sec_easy'})
+    ep_id = r.json()['episode_id']
+    r = requests.get(f'{BASE_URL}/state', params={'episode_id': ep_id})
+    assert r.status_code == 200
+    data = r.json()
+    assert data['episode_id'] == ep_id
+    assert data['done'] is False
+def test_reward_range():
+    """Rewards should always be in [0.0, 1.0]."""
+    tasks = ['sec_easy', 'dep_easy', 'cli_easy']
+    for task_id in tasks:
+        r = requests.post(f'{BASE_URL}/reset', json={'task_id': task_id})
+        ep_id = r.json()['episode_id']
+        # Send an invalid action
+        r = requests.post(f'{BASE_URL}/step', json={
+            'episode_id': ep_id,
+            'action_type': 'invalid_action_type',
+        })
+        data = r.json()
+        assert 0.0 <= data['reward'] <= 1.0, f'Reward out of range for {task_id}'
+def test_step_enriched_observation():
+    """Step observations should include task context fields."""
+    r = requests.post(f'{BASE_URL}/reset', json={'task_id': 'sec_easy'})
+    ep_id = r.json()['episode_id']
+    action = {
+        'episode_id': ep_id,
+        'action_type': 'identify_vulnerability',
+        'vuln_type': 'sql_injection',
+        'cvss_score': 9.1,
+        'severity': 'critical',
+        'affected_line': 1,
+    }
+    r = requests.post(f'{BASE_URL}/step', json=action)
+    obs = r.json()['observation']
+    assert 'task_type' in obs
+    assert 'max_steps' in obs
+    assert 'steps_remaining' in obs

tests/test_grader_variance.py ADDED Viewed

	@@ -0,0 +1,138 @@

+# tests/test_grader_variance.py
+# Phase 2 of judging runs a variance check. If all graders return the same score
+# for different quality answers, the submission is DISQUALIFIED.
+# Run: python -m pytest tests/test_grader_variance.py -v
+import sys
+sys.path.insert(0, '.')
+from server.graders.base_grader import safe_score
+from server.graders.security_grader import compute_correctness as sec_cc
+from server.graders.dependency_grader import compute_correctness as dep_cc
+from server.graders.clinical_grader import compute_correctness as cli_cc
+# ── Security Case for Testing ──
+SEC_CASE = {
+    'expected_vuln_type': 'sql_injection',
+    'cvss_range': [7.5, 9.8],
+    'expected_severity': 'critical',
+    'required_fix_tokens': ['?', 'parameterized'],
+    'current_feedback_keywords': ['sql', 'injection'],
+    'original_vuln_pattern': 'query+',
+}
+def test_sec_identify_variance():
+    """Security grader must return 3+ different scores for different quality answers."""
+    perfect = {
+        'action_type': 'identify_vulnerability',
+        'vuln_type': 'sql_injection',
+        'cvss_score': 8.5,
+        'severity': 'critical',
+        'affected_line': 1,
+    }
+    partial = {
+        'action_type': 'identify_vulnerability',
+        'vuln_type': 'xss',           # wrong vuln_type
+        'cvss_score': 8.5,            # but correct CVSS
+        'severity': 'critical',       # and correct severity
+        'affected_line': 1,
+    }
+    wrong = {
+        'action_type': 'identify_vulnerability',
+        'vuln_type': 'xss',           # wrong everything
+        'cvss_score': 2.0,
+        'severity': 'low',
+        'affected_line': 1,
+    }
+    s1 = safe_score(sec_cc(perfect, SEC_CASE))
+    s2 = safe_score(sec_cc(partial, SEC_CASE))
+    s3 = safe_score(sec_cc(wrong, SEC_CASE))
+    assert len({round(s, 2) for s in [s1, s2, s3]}) >= 3, f'No variance: {s1},{s2},{s3}'
+    assert s1 > s2 > s3, f'Wrong ordering: {s1},{s2},{s3}'
+    print(f'  Security identify variance: {s1:.4f} > {s2:.4f} > {s3:.4f} PASS')
+def test_dep_resolve_variance():
+    """Dependency grader must return different scores for different quality answers."""
+    case = {
+        'conflict_packages': ['torch', 'numpy'],
+        'compatibility_matrix': {
+            'torch': {'2.1.0': {'numpy': '>=1.24'}, '1.9.0': {}},
+            'numpy': {'1.24.0': {}, '1.16.0': {}},
+        },
+        'requirements': {'torch': '1.9.0', 'numpy': '1.16.0'},
+    }
+    full = {'action_type': 'resolve_conflict', 'packages': {'torch': '2.1.0', 'numpy': '1.24.0'}, 'reasoning': 'ok'}
+    part = {'action_type': 'resolve_conflict', 'packages': {'torch': '2.1.0', 'numpy': '1.16.0'}, 'reasoning': 'ok'}
+    empty = {'action_type': 'resolve_conflict', 'packages': {}, 'reasoning': 'ok'}
+    s1 = safe_score(dep_cc(full, case))
+    s2 = safe_score(dep_cc(part, case))
+    s3 = safe_score(dep_cc(empty, case))
+    assert s1 > s2 >= s3, f'No variance: {s1},{s2},{s3}'
+    print(f'  Dependency resolve variance: {s1:.4f} > {s2:.4f} >= {s3:.4f} PASS')
+def test_cli_order_variance():
+    """Clinical grader must return different scores for correct vs violated dependency order."""
+    case = {
+        'dependency_graph': {
+            'schedule_surgery': ['resolve_insurance', 'complete_pre_op'],
+            'complete_pre_op': ['resolve_insurance'],
+            'resolve_insurance': [],
+        },
+        'required_steps': ['resolve_insurance', 'complete_pre_op', 'schedule_surgery'],
+    }
+    correct = {
+        'action_type': 'order_steps',
+        'recovery_steps': ['resolve_insurance', 'complete_pre_op', 'schedule_surgery'],
+    }
+    violated = {
+        'action_type': 'order_steps',
+        'recovery_steps': ['schedule_surgery', 'complete_pre_op', 'resolve_insurance'],
+    }
+    partial = {
+        'action_type': 'order_steps',
+        'recovery_steps': ['resolve_insurance', 'complete_pre_op'],
+    }
+    s1 = safe_score(cli_cc(correct, case))
+    s2 = safe_score(cli_cc(violated, case))
+    s3 = safe_score(cli_cc(partial, case))
+    assert s1 > s2, f'Violation not penalised: correct={s1}, violated={s2}'
+    assert s1 > s3, f'Completeness not rewarded: correct={s1}, partial={s3}'
+    print(f'  Clinical order variance: {s1:.4f} > violated:{s2:.4f}, partial:{s3:.4f} PASS')
+def test_safe_score_none():
+    """Bug 1 fix: safe_score(None) must return 0.0, not crash."""
+    assert safe_score(None) == 0.0
+    assert safe_score(1.5) == 1.0
+    assert safe_score(-0.5) == 0.0
+    assert safe_score('bad') == 0.0
+    print('  safe_score(None) guard: PASS')
+def test_clinical_valid_actions():
+    """Bug 2 fix: propose_recovery must NOT be in clinical VALID_ACTIONS."""
+    from server.graders.clinical_grader import VALID_ACTIONS
+    assert 'propose_recovery' not in VALID_ACTIONS, 'Bug 2 still present!'
+    assert set(VALID_ACTIONS) == {'detect_gap', 'rank_issues', 'order_steps'}
+    print('  Clinical VALID_ACTIONS (Bug 2): PASS')
+if __name__ == '__main__':
+    test_safe_score_none()
+    test_clinical_valid_actions()
+    test_sec_identify_variance()
+    test_dep_resolve_variance()
+    test_cli_order_variance()
+    print('\nALL VARIANCE TESTS PASSED ✅')

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+# Usage: ./validate-submission.sh <ping_url> [repo_dir]
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  exit 1
+fi
+REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"
+PING_URL="${PING_URL%/}"
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+printf "\n${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+# Step 1: Ping
+log "${BOLD}Step 1/3: Pinging HF Space${NC}"
+HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>/dev/null || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE"
+fi
+# Step 2: Docker build
+log "${BOLD}Step 2/3: Docker build${NC}"
+if command -v docker &>/dev/null; then
+  docker build "$REPO_DIR" && pass "Docker build succeeded" || fail "Docker build failed"
+else
+  fail "docker not found"
+fi
+# Step 3: openenv validate
+log "${BOLD}Step 3/3: openenv validate${NC}"
+if command -v openenv &>/dev/null; then
+  (cd "$REPO_DIR" && openenv validate) && pass "openenv validate passed" || fail "openenv validate failed"
+else
+  fail "openenv not found"
+fi
+printf "\n${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  $PASS/3 checks passed${NC}\n"
+printf "${BOLD}========================================${NC}\n"