Spaces:

Elliot89
/

cloud-incident

Sleeping

App Files Files Community

Elliot89 commited on Mar 30

Commit

d5fc8a7

0 Parent(s):

feat: SRE incident response OpenEnv v0.1.0

Browse files

Files changed (12) hide show

.gitignore +20 -0
Dockerfile +27 -0
README.md +155 -0
graders.py +294 -0
inference.py +204 -0
openenv.yaml +53 -0
requirements.txt +6 -0
server/__init__.py +0 -0
server/app.py +196 -0
server/environment.py +309 -0
server/models.py +71 -0
tasks.py +664 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,20 @@

+__pycache__/
+*.pyc
+*.pyo
+.Python
+*.egg-info/
+dist/
+build/
+.env
+.venv/
+venv/
+env/
+*.log
+.DS_Store
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+tests/
+conftest.py
+pyproject.toml
+*.sqlite

Dockerfile ADDED Viewed

	@@ -0,0 +1,27 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install Python dependencies first (maximizes Docker layer cache)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY tasks.py      .
+COPY graders.py    .
+COPY openenv.yaml  .
+COPY inference.py  .
+COPY server/       ./server/
+# Ensure server is a Python package
+RUN touch server/__init__.py
+# HuggingFace Spaces requires port 7860
+EXPOSE 7860
+# Add project root to Python path
+ENV PYTHONPATH=/app
+# Start the FastAPI server
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+---
+title: SRE Incident Response OpenEnv
+emoji: 🚨
+colorFrom: red
+colorTo: blue
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - sre
+  - devops
+  - incident-response
+  - real-world
+  - agentic
+---
+# SRE Incident Response — OpenEnv Environment
+An OpenEnv environment for training and evaluating AI agents on Site Reliability Engineering (SRE) tasks. Agents handle real production incident scenarios: triaging alerts, identifying root causes through log/metric correlation, and executing remediation runbooks to resolve cascading failures.
+## Why This Environment
+Every cloud company employs SREs who respond to production incidents under time pressure. This environment simulates the exact decision loop an on-call SRE follows:
+1. **Triage** — Read alert payload, assess blast radius, classify severity (P1–P4)
+2. **Investigate** — Query logs, metrics, dependency graphs, recent deploys
+3. **Diagnose** — Correlate signals across services to find the root cause
+4. **Remediate** — Execute the correct runbook steps in the right order
+5. **Document** — Submit a resolution summary for post-incident review
+Scenarios include cascading DB failures, CDN cache storms, OOM kills, and BGP network partitions — all modeled from real production incident patterns.
+## Tasks
+| Task ID | Difficulty | Max Steps | Description |
+|---|---|---|---|
+| `alert_classification` | Easy | 3 | Classify alert severity (P1–P4) from metrics and symptoms |
+| `root_cause_analysis` | Medium | 10 | Trace logs/metrics/deps to find root cause service + failure mode |
+| `remediation_planning` | Hard | 15 | Diagnose, remediate, and document full incident resolution |
+Each task has 2 scenarios:
+| Scenario | Incident Type |
+|---|---|
+| AC-001 | Cascading DB connection pool exhaustion (postgres → auth → api-gateway) |
+| AC-002 | CDN cache invalidation storm (misconfigured purge → 40× origin traffic) |
+| RCA-001 | Postgres OOM kill by runaway analytics query |
+| RCA-002 | BGP route withdrawal → AZ network partition → 61% checkout failures |
+| RP-001 | Full OOM remediation (stop job → restart DB → restore services) |
+| RP-002 | Full BGP remediation (restore routes → roll back config → verify recovery) |
+## Action Space
+**Diagnostic:**
+```json
+{"action_type": "query_logs",           "parameters": {"service": "postgres-db"}}
+{"action_type": "check_metrics",        "parameters": {"service": "auth-service"}}
+{"action_type": "check_dependencies",   "parameters": {"service": "api-gateway"}}
+{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
+{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
+```
+**Remediation:**
+```json
+{"action_type": "restart_service",      "parameters": {"service": "postgres-db"}}
+{"action_type": "rollback_deploy",      "parameters": {"service": "network-infra", "target_version": "previous"}}
+{"action_type": "scale_service",        "parameters": {"service": "image-service", "replicas": 10}}
+{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
+{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
+```
+**Submission:**
+```json
+{"action_type": "submit_severity",   "parameters": {"severity": "P1", "service": "postgres-db"}}
+{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM"}}
+{"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics cron job, restarted postgres-db..."}}
+```
+## Observation Space
+Each step returns:
+| Field | Type | Description |
+|---|---|---|
+| `episode_id` | string | Unique episode UUID |
+| `task_id` | string | Active task |
+| `scenario_id` | string | Scenario identifier (e.g. `AC-001`) |
+| `step_count` / `max_steps` | int | Current step and budget |
+| `incident_summary` | string | Plain-text incident description |
+| `alert` | dict | Alert payload with severity, affected services, symptoms |
+| `available_actions` | list | Valid action types for this task |
+| `queried_data` | dict | All tool responses gathered so far |
+| `cumulative_reward` | float | Running reward total |
+| `done` | bool | Episode terminal flag |
+| `feedback` | string | Per-step feedback string |
+## Reward Function
+| Event | Reward |
+|---|---|
+| Query known service (first time) | +0.05 |
+| Query known service (repeat) | +0.01 |
+| Query unknown service | -0.05 |
+| Correct remediation action | +0.10 |
+| Wrong remediation action | -0.10 |
+| Step past halfway (non-submit) | -0.02 |
+| Timeout without submission | -0.10 |
+| Grader score (on terminal step) | 0.0–1.0 |
+## API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| GET | `/health` | `{"status": "ok", "version": "0.1.0"}` |
+| POST | `/reset?task_id=...&scenario_index=...` | Start new episode |
+| POST | `/step` | Submit action (JSON body) |
+| GET | `/state` | Full current episode state |
+| GET | `/tasks` | All tasks with schemas |
+| GET | `/grader` | Score current episode (0.0–1.0) |
+| POST | `/baseline` | Run inference.py, return scores |
+## Setup
+```bash
+# Local development
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+# Docker
+docker build -t sre-incident-env .
+docker run -p 7860:7860 \
+  -e API_BASE_URL="https://api.groq.com/openai/v1" \
+  -e MODEL_NAME="llama-3.1-8b-instant" \
+  -e HF_TOKEN="your_api_key" \
+  sre-incident-env
+# Run baseline inference
+export API_BASE_URL="https://api.groq.com/openai/v1"
+export MODEL_NAME="llama-3.1-8b-instant"
+export HF_TOKEN="your_groq_key"
+python inference.py
+```
+## Baseline Scores
+Using `llama-3.1-8b-instant` via Groq:
+| Task | Score |
+|---|---|
+| `alert_classification` | ~0.75 |
+| `root_cause_analysis` | ~0.35 |
+| `remediation_planning` | ~0.20 |
+| **overall** | **~0.43** |
+*Run `python inference.py` to reproduce.*

graders.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""
+graders.py — Deterministic graders for all 3 SRE Incident Response tasks.
+Public API:
+    grade(task_id, state, scenario) -> {"total": float, "breakdown": dict, "feedback": str}
+All scores are in [0.0, 1.0]. Graders are deterministic and reproducible.
+"""
+from __future__ import annotations
+def grade(task_id: str, state: dict, scenario: dict) -> dict:
+    """
+    Entry point. Routes to the correct task grader.
+    Args:
+        task_id:  One of alert_classification, root_cause_analysis, remediation_planning
+        state:    Current episode state dict from IncidentEnvironment
+        scenario: The scenario dict that was loaded for this episode
+    Returns:
+        {
+            "total":     float in [0.0, 1.0],
+            "breakdown": dict of sub-scores,
+            "feedback":  human-readable string
+        }
+    """
+    graders = {
+        "alert_classification": _grade_alert_classification,
+        "root_cause_analysis":  _grade_root_cause_analysis,
+        "remediation_planning": _grade_remediation_planning,
+    }
+    if task_id not in graders:
+        return {"total": 0.0, "breakdown": {}, "feedback": f"Unknown task_id: {task_id}"}
+    return graders[task_id](state, scenario)
+# ── Task 1: Alert Classification ────────────────────────────────────────────
+def _grade_alert_classification(state: dict, scenario: dict) -> dict:
+    """
+    Scoring:
+      1.0 — exact severity match
+      0.5 — adjacent severity (one level off)
+      0.25 — two levels off
+      0.0  — opposite end or no submission
+    """
+    action_history = state.get("action_history", [])
+    correct = scenario.get("correct_severity", "P1")
+    adjacent = scenario.get("adjacent_severities", [])
+    submitted_severity = None
+    for action in action_history:
+        if action.get("action_type") == "submit_severity":
+            submitted_severity = (
+                action.get("parameters", {}).get("severity", "")
+                .upper()
+                .strip()
+            )
+            break
+    if not submitted_severity:
+        return {
+            "total": 0.0,
+            "breakdown": {"severity_match": 0.0, "submitted": False},
+            "feedback": "No severity submitted — score 0.0",
+        }
+    severity_order = ["P1", "P2", "P3", "P4"]
+    if submitted_severity == correct:
+        score = 1.0
+        feedback = f"Exact match: {submitted_severity} == {correct}"
+    elif submitted_severity in adjacent:
+        score = 0.5
+        feedback = f"Adjacent severity: submitted {submitted_severity}, correct {correct}"
+    else:
+        # Distance-based fallback
+        try:
+            dist = abs(severity_order.index(submitted_severity) - severity_order.index(correct))
+        except ValueError:
+            dist = 4
+        if dist == 2:
+            score = 0.25
+        else:
+            score = 0.0
+        feedback = f"Wrong severity: submitted {submitted_severity}, correct {correct} (dist={dist})"
+    return {
+        "total": score,
+        "breakdown": {
+            "submitted_severity": submitted_severity,
+            "correct_severity": correct,
+            "severity_match": score,
+        },
+        "feedback": feedback,
+    }
+# ── Task 2: Root Cause Analysis ─────────────────────────────────────────────
+def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
+    """
+    Scoring:
+      Base score (0.0–0.6):
+        0.6 — correct service AND correct failure_mode
+        0.35 — correct service only
+        0.0  — wrong service
+      Efficiency bonus (0.0–0.4):
+        Based on how many unique relevant services were queried before submitting.
+        More targeted = higher bonus (penalises random querying).
+    """
+    action_history = state.get("action_history", [])
+    correct_rc = scenario.get("correct_root_cause", {})
+    correct_service = correct_rc.get("service", "").lower().strip()
+    correct_mode = correct_rc.get("failure_mode", "").lower().strip()
+    known_services = {s.lower() for s in scenario.get("known_services", set())}
+    # Find the submit_root_cause action
+    submitted_service = ""
+    submitted_mode = ""
+    submit_step = None
+    for action in action_history:
+        if action.get("action_type") == "submit_root_cause":
+            params = action.get("parameters", {})
+            submitted_service = params.get("service", "").lower().strip()
+            submitted_mode = params.get("failure_mode", "").lower().strip()
+            submit_step = action.get("step", len(action_history))
+            break
+    if not submitted_service:
+        return {
+            "total": 0.0,
+            "breakdown": {"base": 0.0, "efficiency": 0.0, "submitted": False},
+            "feedback": "No root cause submitted — score 0.0",
+        }
+    # Base score
+    service_match = submitted_service == correct_service
+    mode_keywords = [w for w in correct_mode.split() if len(w) > 3]
+    mode_match = service_match and any(
+        kw in submitted_mode for kw in mode_keywords
+    ) if mode_keywords else service_match
+    if mode_match:
+        base = 0.6
+        base_feedback = f"Correct service ({submitted_service}) + failure mode matched"
+    elif service_match:
+        base = 0.35
+        base_feedback = f"Correct service ({submitted_service}) but failure mode unclear"
+    else:
+        base = 0.0
+        base_feedback = f"Wrong service: submitted '{submitted_service}', correct '{correct_service}'"
+    # Efficiency bonus — only awarded if service was correct
+    efficiency = 0.0
+    if service_match and submit_step is not None:
+        diagnostic_actions = {"query_logs", "check_metrics", "check_dependencies",
+                               "check_recent_deploys", "check_service_status"}
+        queried = {
+            a.get("parameters", {}).get("service", "").lower()
+            for a in action_history[:submit_step]
+            if a.get("action_type") in diagnostic_actions
+        }
+        relevant_queried = queried & known_services
+        # Reward for querying relevant services efficiently
+        # Full bonus for querying 2-3 key services; less for spraying all services
+        total_queries = sum(
+            1 for a in action_history[:submit_step]
+            if a.get("action_type") in diagnostic_actions
+        )
+        if total_queries > 0:
+            precision = len(relevant_queried) / max(total_queries, 1)
+            efficiency = round(min(0.4, precision * 0.4 + min(len(relevant_queried), 3) * 0.05), 4)
+    total = round(min(1.0, base + efficiency), 4)
+    return {
+        "total": total,
+        "breakdown": {
+            "base": base,
+            "efficiency_bonus": efficiency,
+            "service_match": service_match,
+            "mode_match": mode_match,
+            "submitted_service": submitted_service,
+            "correct_service": correct_service,
+        },
+        "feedback": f"{base_feedback} | efficiency bonus: {efficiency:.2f} | total: {total:.2f}",
+    }
+# ── Task 3: Remediation Planning ────────────────────────────────────────────
+def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
+    """
+    Scoring:
+      Resolution base (0.0 or 0.6):
+        0.6 — submit_resolution with non-empty summary after ≥1 investigation action
+      Efficiency bonus (0.0–0.3):
+        Fraction of correct remediation actions executed (from correct_remediation_sequence)
+      Wrong action penalty (up to -0.15):
+        -0.05 per wrong action (capped at -0.15)
+      Summary quality bonus (0.0–0.1):
+        +0.1 if summary contains ≥3 resolution keywords from scenario
+    """
+    action_history = state.get("action_history", [])
+    correct_seq = scenario.get("correct_remediation_sequence", [])
+    wrong_actions_map = scenario.get("wrong_actions", {})
+    resolution_keywords = scenario.get("resolution_keywords", [])
+    diagnostic_actions = {"query_logs", "check_metrics", "check_dependencies",
+                           "check_recent_deploys", "check_service_status"}
+    remediation_actions = {"restart_service", "rollback_deploy", "scale_service",
+                           "disable_feature_flag", "clear_cache", "execute_runbook_step"}
+    # Find submit_resolution
+    submitted_summary = ""
+    for action in action_history:
+        if action.get("action_type") == "submit_resolution":
+            submitted_summary = action.get("parameters", {}).get("summary", "")
+            break
+    investigation_count = sum(
+        1 for a in action_history
+        if a.get("action_type") in diagnostic_actions | remediation_actions
+    )
+    if not submitted_summary or investigation_count < 1:
+        return {
+            "total": 0.0,
+            "breakdown": {"base": 0.0, "efficiency": 0.0, "penalty": 0.0, "summary": 0.0},
+            "feedback": "No resolution submitted or no investigation — score 0.0",
+        }
+    base = 0.6
+    # Efficiency bonus — which correct actions were executed?
+    executed_action_keys = set()
+    for a in action_history:
+        at = a.get("action_type", "")
+        svc = a.get("parameters", {}).get("service", "")
+        flag = a.get("parameters", {}).get("flag", "")
+        step_action = a.get("parameters", {}).get("runbook_action", "")
+        target = a.get("parameters", {}).get("target", "")
+        # Build key variants that match correct_remediation_sequence format
+        executed_action_keys.add(at)
+        if svc:
+            executed_action_keys.add(f"{at}:{svc}")
+        if flag:
+            executed_action_keys.add(f"{at}:{flag}")
+        if step_action:
+            executed_action_keys.add(f"execute_runbook_step:{step_action}")
+        if target:
+            executed_action_keys.add(f"execute_runbook_step:{target}")
+    matched = sum(1 for key in correct_seq if key in executed_action_keys)
+    efficiency = round((matched / len(correct_seq)) * 0.3, 4) if correct_seq else 0.0
+    # Wrong action penalty
+    wrong_count = 0
+    for a in action_history:
+        at = a.get("action_type", "")
+        svc = a.get("parameters", {}).get("service", "")
+        key1 = at
+        key2 = f"{at}:{svc}"
+        if key1 in wrong_actions_map or key2 in wrong_actions_map:
+            wrong_count += 1
+    penalty = round(min(0.15, wrong_count * 0.05), 4)
+    # Summary quality bonus
+    summary_lower = submitted_summary.lower()
+    keyword_hits = sum(1 for kw in resolution_keywords if kw in summary_lower)
+    summary_bonus = 0.1 if keyword_hits >= 3 else 0.05 if keyword_hits >= 1 else 0.0
+    total = round(max(0.0, min(1.0, base + efficiency - penalty + summary_bonus)), 4)
+    return {
+        "total": total,
+        "breakdown": {
+            "base": base,
+            "efficiency_bonus": efficiency,
+            "wrong_action_penalty": -penalty,
+            "summary_bonus": summary_bonus,
+            "correct_actions_matched": matched,
+            "correct_actions_total": len(correct_seq),
+            "wrong_actions_count": wrong_count,
+            "summary_keywords_hit": keyword_hits,
+        },
+        "feedback": (
+            f"base={base} | efficiency={efficiency:.2f} ({matched}/{len(correct_seq)} correct actions) "
+            f"| penalty=-{penalty:.2f} | summary_bonus={summary_bonus:.2f} | total={total:.2f}"
+        ),
+    }

inference.py ADDED Viewed

	@@ -0,0 +1,204 @@

+"""
+inference.py — OpenEnv Hackathon baseline inference script.
+Required env vars (set in HF Space secrets or .env):
+    API_BASE_URL   OpenAI-compatible LLM endpoint
+    MODEL_NAME     Model identifier
+    HF_TOKEN       API key for the LLM endpoint
+Runs the agent against all 3 tasks × 2 scenarios each.
+Final stdout line is valid JSON — required by the hackathon validator.
+Usage:
+    export API_BASE_URL="https://api.groq.com/openai/v1"
+    export MODEL_NAME="llama-3.1-8b-instant"
+    export HF_TOKEN="gsk_your_key_here"
+    python inference.py
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import requests
+from openai import OpenAI
+from dotenv import load_dotenv
+load_dotenv()
+# ── Config from env vars (hackathon required names) ──────────────────────────
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
+MODEL_NAME   = os.environ.get("MODEL_NAME",   "llama-3.1-8b-instant")
+HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
+if not HF_TOKEN:
+    print("[WARN] HF_TOKEN is not set — LLM calls will fail.", file=sys.stderr)
+client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
+# ── System prompt ─────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """You are an expert Site Reliability Engineer (SRE) responding to a live production incident.
+You receive an incident observation as JSON. Respond with ONLY a single valid JSON action object — no markdown, no explanation.
+Available action_types and their parameters:
+  Diagnostic (gather info):
+    {"action_type": "query_logs",           "parameters": {"service": "<name>"}}
+    {"action_type": "check_metrics",        "parameters": {"service": "<name>"}}
+    {"action_type": "check_dependencies",   "parameters": {"service": "<name>"}}
+    {"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}}
+    {"action_type": "check_service_status", "parameters": {"service": "<name>"}}
+  Remediation (fix the issue):
+    {"action_type": "restart_service",      "parameters": {"service": "<name>"}}
+    {"action_type": "rollback_deploy",      "parameters": {"service": "<name>", "target_version": "previous"}}
+    {"action_type": "scale_service",        "parameters": {"service": "<name>", "replicas": 5}}
+    {"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}}
+    {"action_type": "clear_cache",          "parameters": {"service": "<name>"}}
+    {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>", "target": "<name>"}}
+  Submission (end the episode — choose ONE based on task):
+    {"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "<root_cause_service>"}}
+    {"action_type": "submit_root_cause", "parameters": {"service": "<root_cause>", "failure_mode": "<what_went_wrong>"}}
+    {"action_type": "submit_resolution", "parameters": {"summary": "<full description of what happened and what you did>"}}
+Strategy by task:
+  alert_classification (max 3 steps): Query 1-2 services for evidence, then submit_severity.
+  root_cause_analysis (max 10 steps): Query logs/metrics/deps for multiple services, trace the failure chain, then submit_root_cause.
+  remediation_planning (max 15 steps): Investigate, execute fix actions, then submit_resolution with a detailed summary.
+Output ONLY the JSON object. Nothing else."""
+def _format_obs(obs: dict) -> str:
+    parts = [
+        f"TASK: {obs.get('task_id')} | Step {obs.get('step_count')}/{obs.get('max_steps')}",
+        f"INCIDENT: {obs.get('incident_summary', '')}",
+    ]
+    alert = obs.get("alert", {})
+    if alert:
+        parts.append("ALERT:\n" + json.dumps(alert, indent=2))
+    if obs.get("available_actions"):
+        parts.append(f"AVAILABLE ACTIONS: {obs['available_actions']}")
+    if obs.get("queried_data"):
+        parts.append("DATA GATHERED:\n" + json.dumps(obs["queried_data"], indent=2))
+    parts.append(f"LAST REWARD: {obs.get('cumulative_reward', 0.0)}")
+    parts.append(f"FEEDBACK: {obs.get('feedback', '')}")
+    return "\n\n".join(parts)
+def _parse_action(text: str) -> dict:
+    text = text.strip()
+    # Strip markdown code fences if present
+    if text.startswith("```"):
+        lines = [l for l in text.splitlines() if not l.startswith("```")]
+        text = "\n".join(lines).strip()
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        start, end = text.find("{"), text.rfind("}") + 1
+        if start != -1 and end > start:
+            return json.loads(text[start:end])
+        raise
+def _run_episode(task_id: str, scenario_index: int) -> float:
+    r = requests.post(
+        f"{ENV_BASE_URL}/reset",
+        params={"task_id": task_id, "scenario_index": scenario_index},
+        timeout=30,
+    )
+    r.raise_for_status()
+    obs = r.json()
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    for _step in range(obs.get("max_steps", 10)):
+        messages.append({"role": "user", "content": _format_obs(obs)})
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=0.0,
+            max_tokens=256,
+        )
+        raw = response.choices[0].message.content
+        messages.append({"role": "assistant", "content": raw})
+        try:
+            action = _parse_action(raw)
+        except Exception as e:
+            print(f"  [WARN] parse failed at step {_step+1}: {e}", file=sys.stderr)
+            # Graceful fallback per task
+            if task_id == "alert_classification":
+                action = {"action_type": "submit_severity",
+                          "parameters": {"severity": "P2", "service": "unknown"}}
+            elif task_id == "root_cause_analysis":
+                action = {"action_type": "submit_root_cause",
+                          "parameters": {"service": "unknown", "failure_mode": "unknown"}}
+            else:
+                action = {"action_type": "submit_resolution",
+                          "parameters": {"summary": "Unable to determine root cause."}}
+        step_r = requests.post(
+            f"{ENV_BASE_URL}/step",
+            json=action,
+            headers={"Content-Type": "application/json"},
+            timeout=30,
+        )
+        step_r.raise_for_status()
+        result = step_r.json()
+        obs = result["observation"]
+        if result.get("done"):
+            break
+    # Get final grader score
+    g = requests.get(f"{ENV_BASE_URL}/grader", timeout=30)
+    g.raise_for_status()
+    return g.json().get("total", 0.0)
+def main():
+    runs = [
+        ("alert_classification", 0),
+        ("alert_classification", 1),
+        ("root_cause_analysis",  0),
+        ("root_cause_analysis",  1),
+        ("remediation_planning", 0),
+        ("remediation_planning", 1),
+    ]
+    results: dict[str, list[float]] = {}
+    print(f"{'Task':<30} {'Scenario':>8}  {'Score':>8}")
+    print("-" * 52)
+    for task_id, scenario_index in runs:
+        try:
+            score = _run_episode(task_id, scenario_index)
+        except Exception as e:
+            print(f"  [ERROR] {task_id} s{scenario_index}: {e}", file=sys.stderr)
+            score = 0.0
+        label = f"{task_id} [s{scenario_index}]"
+        print(f"{label:<30} {scenario_index:>8}  {score:>8.4f}")
+        results.setdefault(task_id, []).append(score)
+    print("-" * 52)
+    summary = {task: round(sum(v) / len(v), 4) for task, v in results.items()}
+    summary["overall"] = round(sum(summary.values()) / len(summary), 4)
+    print("\nBaseline Summary:")
+    for k, v in summary.items():
+        print(f"  {k:<30}: {v:.4f}")
+    # Final line must be valid JSON — parsed by /baseline endpoint
+    print(json.dumps(summary))
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+name: sre-incident-response
+version: "0.1.0"
+app_port: 7860
+description: >
+  OpenEnv environment simulating SRE/DevOps on-call incident response.
+  An AI agent classifies alert severity, performs root cause analysis
+  through log/metric queries, and executes remediation actions to
+  resolve production incidents end-to-end.
+author: Elliot89
+license: MIT
+tags:
+  - openenv
+  - sre
+  - devops
+  - incident-response
+  - real-world
+  - agentic
+tasks:
+  - id: alert_classification
+    name: "Task 1: Alert Severity Classification"
+    difficulty: easy
+    max_steps: 3
+    score_range: [0.0, 1.0]
+    description: >
+      Classify incoming alert severity (P1–P4) using diagnostic tools.
+  - id: root_cause_analysis
+    name: "Task 2: Root Cause Analysis"
+    difficulty: medium
+    max_steps: 10
+    score_range: [0.0, 1.0]
+    description: >
+      Trace a live incident through logs/metrics/dependencies
+      to identify the exact root cause service and failure mode.
+  - id: remediation_planning
+    name: "Task 3: Incident Remediation"
+    difficulty: hard
+    max_steps: 15
+    score_range: [0.0, 1.0]
+    description: >
+      Fully resolve a production incident: diagnose, remediate,
+      and submit a documented resolution summary.
+endpoints:
+  health:   "GET /health"
+  reset:    "POST /reset"
+  step:     "POST /step"
+  state:    "GET /state"
+  tasks:    "GET /tasks"
+  grader:   "GET /grader"
+  baseline: "POST /baseline"

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi==0.115.0
+uvicorn[standard]==0.30.6
+pydantic==2.9.2
+openai==1.51.0
+requests==2.32.3
+python-dotenv==1.0.1

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,196 @@

+"""
+server/app.py — FastAPI server exposing the OpenEnv HTTP interface.
+Endpoints:
+  GET  /health
+  GET  /
+  POST /reset?task_id=...&scenario_index=...
+  POST /step
+  GET  /state
+  GET  /tasks
+  GET  /grader
+  POST /baseline
+"""
+from __future__ import annotations
+import json
+import os
+import subprocess
+import sys
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from fastapi import FastAPI, HTTPException, Query
+from fastapi.middleware.cors import CORSMiddleware
+from server.models import Action
+from server.environment import IncidentEnvironment
+from tasks import list_tasks, ALL_TASKS
+_PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+app = FastAPI(
+    title="SRE Incident Response — OpenEnv",
+    version="0.1.0",
+    description="OpenEnv environment for training AI agents on SRE incident response tasks.",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+env = IncidentEnvironment()
+# ── Health / root ────────────────────────────────────────────────────────────
+@app.get("/health")
+def health():
+    return {"status": "ok", "version": "0.1.0"}
+@app.get("/")
+def root():
+    return {
+        "name": "SRE Incident Response OpenEnv",
+        "version": "0.1.0",
+        "docs": "/docs",
+        "health": "/health",
+        "tasks": "/tasks",
+    }
+# ── Core OpenEnv endpoints ───────────────────────────────────────────────────
+@app.post("/reset")
+def reset(
+    task_id: str = Query(default="alert_classification"),
+    scenario_index: int = Query(default=0),
+):
+    """Start a new episode. Returns the initial observation."""
+    try:
+        obs = env.reset(task_id=task_id, scenario_index=scenario_index)
+        return obs.model_dump()
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/step")
+def step(action: Action):
+    """Submit an action. Returns observation, reward, done, info."""
+    try:
+        obs, reward, done, info = env.step(action)
+        return {
+            "observation": obs.model_dump(),
+            "reward": reward.model_dump(),
+            "done": done,
+            "info": info,
+        }
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/state")
+def state():
+    """Return the full current episode state."""
+    try:
+        return env.state().model_dump()
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/tasks")
+def tasks():
+    """Return all available tasks with descriptions and action schemas."""
+    return {
+        "tasks": list_tasks(),
+        "total": len(ALL_TASKS),
+        "action_schema": {
+            "diagnostic": [
+                {"action_type": "query_logs",           "parameters": {"service": "string"}},
+                {"action_type": "check_metrics",        "parameters": {"service": "string"}},
+                {"action_type": "check_dependencies",   "parameters": {"service": "string"}},
+                {"action_type": "check_recent_deploys", "parameters": {"service": "string"}},
+                {"action_type": "check_service_status", "parameters": {"service": "string"}},
+            ],
+            "remediation": [
+                {"action_type": "restart_service",       "parameters": {"service": "string"}},
+                {"action_type": "rollback_deploy",       "parameters": {"service": "string", "target_version": "string"}},
+                {"action_type": "scale_service",         "parameters": {"service": "string", "replicas": "int"}},
+                {"action_type": "disable_feature_flag",  "parameters": {"flag": "string"}},
+                {"action_type": "clear_cache",           "parameters": {"service": "string"}},
+                {"action_type": "execute_runbook_step",  "parameters": {"runbook_action": "string", "target": "string"}},
+            ],
+            "submission": [
+                {"action_type": "submit_severity",    "parameters": {"severity": "P1|P2|P3|P4", "service": "string"}},
+                {"action_type": "submit_root_cause",  "parameters": {"service": "string", "failure_mode": "string"}},
+                {"action_type": "submit_resolution",  "parameters": {"summary": "string"}},
+            ],
+        },
+    }
+@app.get("/grader")
+def grader():
+    """Run the grader on the current episode. Returns score in [0.0, 1.0]."""
+    try:
+        s = env.state()
+        from graders import grade
+        result = grade(s.task_id, s.model_dump(), env._scenario)
+        return {
+            "total": result["total"],
+            "breakdown": result["breakdown"],
+            "feedback": result["feedback"],
+            "task_id": s.task_id,
+            "scenario_id": s.scenario_id,
+            "steps_used": s.step_count,
+            "done": s.done,
+        }
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/baseline")
+def baseline():
+    """Run inference.py and return the JSON score summary."""
+    script = os.path.join(_PROJECT_ROOT, "inference.py")
+    if not os.path.exists(script):
+        raise HTTPException(status_code=500, detail="inference.py not found in project root")
+    try:
+        result = subprocess.run(
+            [sys.executable, script],
+            capture_output=True,
+            text=True,
+            timeout=1200,
+            cwd=_PROJECT_ROOT,
+            env={**os.environ, "ENV_BASE_URL": "http://localhost:7860"},
+        )
+    except subprocess.TimeoutExpired:
+        raise HTTPException(status_code=500, detail="inference.py timed out (>20 min)")
+    if result.returncode != 0:
+        raise HTTPException(status_code=500, detail=result.stderr[-2000:])
+    lines = result.stdout.strip().splitlines()
+    last_line = lines[-1] if lines else ""
+    try:
+        return json.loads(last_line)
+    except Exception:
+        return {"raw_output": result.stdout[-3000:]}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=False)

server/environment.py ADDED Viewed

	@@ -0,0 +1,309 @@

+"""
+server/environment.py — Core OpenEnv environment for SRE Incident Response.
+Implements the full OpenEnv interface:
+  reset(task_id, scenario_index) -> Observation
+  step(action)                   -> (Observation, Reward, done, info)
+  state()                        -> EpisodeState
+All state is in-memory. Thread-safe via a lock.
+"""
+from __future__ import annotations
+import uuid
+import threading
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from tasks import ALL_TASKS, get_task, get_scenario
+from graders import grade
+from server.models import Action, ActionParameters, Observation, Reward, EpisodeState
+# ── Action type sets ─────────────────────────────────────────────────────────
+_DIAGNOSTIC = {
+    "query_logs", "check_metrics", "check_dependencies",
+    "check_recent_deploys", "check_service_status",
+}
+_REMEDIATION = {
+    "restart_service", "rollback_deploy", "scale_service",
+    "disable_feature_flag", "clear_cache", "execute_runbook_step",
+}
+_SUBMIT = {
+    "submit_severity", "submit_root_cause", "submit_resolution",
+}
+# ── Reward constants ─────────────────────────────────────────────────────────
+R_QUERY_KNOWN_FIRST  = +0.05
+R_QUERY_KNOWN_REPEAT = +0.01
+R_QUERY_UNKNOWN      = -0.05
+R_REMEDIATION_GOOD   = +0.10
+R_REMEDIATION_WRONG  = -0.10
+R_STEP_PAST_HALF     = -0.02
+R_TIMEOUT            = -0.10
+R_UNKNOWN_ACTION     = -0.03
+class IncidentEnvironment:
+    """
+    OpenEnv environment for SRE Incident Response.
+    One instance handles one episode at a time.
+    """
+    def __init__(self):
+        self._lock = threading.Lock()
+        self._s: dict = {}
+        self._scenario: dict = {}
+        self._task_def: dict = {}
+        self._ready = False
+    # ── Public OpenEnv API ───────────────────────────────────────────────────
+    def reset(self, task_id: str, scenario_index: int = 0) -> Observation:
+        """Start a fresh episode. Returns the initial Observation."""
+        with self._lock:
+            task_def = get_task(task_id)
+            scenario = get_scenario(task_id, scenario_index)
+            self._task_def = task_def
+            self._scenario = scenario
+            self._s = {
+                "episode_id":       str(uuid.uuid4()),
+                "task_id":          task_id,
+                "scenario_id":      scenario["scenario_id"],
+                "step_count":       0,
+                "max_steps":        task_def["max_steps"],
+                "action_history":   [],
+                "queried_data":     {},
+                "queried_keys":     set(),   # tracks (action_type, service) for repeat detection
+                "submitted":        False,
+                "resolved":         False,
+                "done":             False,
+                "cumulative_reward": 0.0,
+                "feedback":         f"Episode started. {scenario['description']}",
+            }
+            self._ready = True
+            return self._build_obs()
+    def step(self, action: Action) -> tuple[Observation, Reward, bool, dict]:
+        """Process one agent action. Returns (Observation, Reward, done, info)."""
+        with self._lock:
+            if not self._ready:
+                raise RuntimeError("Call reset() before step().")
+            s = self._s
+            if s["done"]:
+                obs = self._build_obs()
+                return obs, Reward(value=0.0, reason="episode already done",
+                                   cumulative=s["cumulative_reward"]), True, {}
+            s["step_count"] += 1
+            step_num = s["step_count"]
+            max_steps = s["max_steps"]
+            at = action.action_type
+            params = action.parameters
+            # Record action
+            s["action_history"].append({
+                "action_type": at,
+                "parameters":  params.model_dump(exclude_none=True),
+                "step":        step_num,
+            })
+            # ── Compute step reward ──────────────────────────────────────────
+            r = 0.0
+            fb: list[str] = []
+            # Efficiency penalty past halfway
+            if step_num > max_steps // 2:
+                r += R_STEP_PAST_HALF
+                fb.append("efficiency penalty")
+            if at in _DIAGNOSTIC:
+                r, fb = self._handle_diagnostic(at, params, r, fb)
+            elif at in _REMEDIATION:
+                r, fb = self._handle_remediation(at, params, r, fb)
+            elif at in _SUBMIT:
+                r, fb, terminal = self._handle_submit(at, params, r, fb)
+                if terminal:
+                    s["done"] = True
+            else:
+                r += R_UNKNOWN_ACTION
+                fb.append(f"unknown action_type '{at}'")
+            # Timeout
+            if step_num >= max_steps and not s["done"]:
+                r += R_TIMEOUT
+                fb.append("timeout — no submission made")
+                s["done"] = True
+            # Run grader on terminal step
+            if s["done"]:
+                result = grade(s["task_id"], s, self._scenario)
+                s["cumulative_reward"] = round(
+                    s["cumulative_reward"] + result["total"], 4
+                )
+                fb.append(f"grader → {result['feedback']}")
+            s["cumulative_reward"] = round(s["cumulative_reward"] + r, 4)
+            s["feedback"] = " | ".join(fb) if fb else "ok"
+            reward_obj = Reward(
+                value=round(r, 4),
+                reason=s["feedback"],
+                cumulative=s["cumulative_reward"],
+            )
+            return self._build_obs(), reward_obj, s["done"], {"step": step_num, "feedback": s["feedback"]}
+    def state(self) -> EpisodeState:
+        """Return the full current episode state."""
+        with self._lock:
+            if not self._ready:
+                raise RuntimeError("No active episode — call reset() first.")
+            s = self._s
+            return EpisodeState(
+                episode_id=s["episode_id"],
+                task_id=s["task_id"],
+                scenario_id=s["scenario_id"],
+                step_count=s["step_count"],
+                max_steps=s["max_steps"],
+                action_history=s["action_history"],
+                queried_data=dict(s["queried_data"]),
+                submitted=s["submitted"],
+                resolved=s["resolved"],
+                done=s["done"],
+                cumulative_reward=s["cumulative_reward"],
+                feedback=s["feedback"],
+            )
+    # ── Action handlers ──────────────────────────────────────────────────────
+    def _handle_diagnostic(
+        self, at: str, params: ActionParameters, r: float, fb: list[str]
+    ) -> tuple[float, list[str]]:
+        s = self._s
+        service = (params.service or "").lower().strip()
+        known = {sv.lower() for sv in self._scenario.get("known_services", set())}
+        tool_data = self._scenario.get("tool_responses", {}).get(at, {})
+        query_key = (at, service)
+        if service and service in known:
+            if query_key not in s["queried_keys"]:
+                r += R_QUERY_KNOWN_FIRST
+                fb.append(f"queried {service} (+{R_QUERY_KNOWN_FIRST})")
+                s["queried_keys"].add(query_key)
+            else:
+                r += R_QUERY_KNOWN_REPEAT
+                fb.append(f"re-queried {service} (+{R_QUERY_KNOWN_REPEAT})")
+            result = tool_data.get(service, f"No data for '{service}'.")
+            s["queried_data"].setdefault(at, {})[service] = result
+        elif service:
+            r += R_QUERY_UNKNOWN
+            fb.append(f"unknown service '{service}' ({R_QUERY_UNKNOWN})")
+        else:
+            fb.append(f"{at}: no service specified")
+        return r, fb
+    def _handle_remediation(
+        self, at: str, params: ActionParameters, r: float, fb: list[str]
+    ) -> tuple[float, list[str]]:
+        s = self._s
+        service = (params.service or "").lower().strip()
+        flag = (params.flag or "").lower().strip()
+        runbook_action = (params.runbook_action or "").lower().strip()
+        target = (params.target or "").lower().strip()
+        # Build lookup keys
+        keys_to_check = {at}
+        if service:
+            keys_to_check.add(f"{at}:{service}")
+        if flag:
+            keys_to_check.add(f"{at}:{flag}")
+        if runbook_action:
+            keys_to_check.add(f"execute_runbook_step:{runbook_action}")
+        if target:
+            keys_to_check.add(f"execute_runbook_step:{target}")
+        wrong_map = self._scenario.get("wrong_actions", {})
+        rem_data = self._scenario.get("remediation_data", {})
+        is_wrong = any(k in wrong_map for k in keys_to_check)
+        if is_wrong:
+            r += R_REMEDIATION_WRONG
+            reason = next((wrong_map[k] for k in keys_to_check if k in wrong_map), "wrong action")
+            fb.append(f"wrong: {at} — {str(reason)[:80]}")
+        else:
+            r += R_REMEDIATION_GOOD
+            fb.append(f"executed {at}" + (f" on {service}" if service else ""))
+            # Store remediation result if available
+            at_data = rem_data.get(at, {})
+            result = (
+                at_data.get(service)
+                or at_data.get(flag)
+                or at_data.get(runbook_action)
+                or at_data.get(target)
+                or "action executed"
+            )
+            s["queried_data"].setdefault(at, {})[service or flag or runbook_action or at] = result
+        return r, fb
+    def _handle_submit(
+        self, at: str, params: ActionParameters, r: float, fb: list[str]
+    ) -> tuple[float, list[str], bool]:
+        s = self._s
+        s["submitted"] = True
+        if at == "submit_severity":
+            severity = (params.severity or "").upper()
+            fb.append(f"submitted severity: {severity}")
+        elif at == "submit_root_cause":
+            svc = params.service or ""
+            mode = params.failure_mode or ""
+            fb.append(f"submitted root cause: service={svc}, failure_mode={mode}")
+        elif at == "submit_resolution":
+            summary = params.summary or ""
+            diag_rem_count = sum(
+                1 for a in s["action_history"]
+                if a.get("action_type") in _DIAGNOSTIC | _REMEDIATION
+            )
+            if summary.strip() and diag_rem_count >= 1:
+                s["resolved"] = True
+                fb.append("resolution submitted — incident resolved")
+            else:
+                fb.append("resolution submitted (insufficient investigation)")
+        return r, fb, True  # always terminal
+    # ── Build observation ────────────────────────────────────────────────────
+    def _build_obs(self) -> Observation:
+        s = self._s
+        sc = self._scenario
+        td = self._task_def
+        return Observation(
+            episode_id=s["episode_id"],
+            task_id=s["task_id"],
+            scenario_id=s["scenario_id"],
+            step_count=s["step_count"],
+            max_steps=s["max_steps"],
+            incident_summary=sc.get("incident_summary", sc.get("description", "")),
+            alert=sc.get("alert", {}),
+            available_actions=td.get("available_actions", []),
+            queried_data=dict(s["queried_data"]),
+            cumulative_reward=s["cumulative_reward"],
+            done=s["done"],
+            feedback=s["feedback"],
+        )

server/models.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+server/models.py — Typed Pydantic models for the OpenEnv interface.
+OpenEnv requires three typed models: Action, Observation, Reward.
+"""
+from __future__ import annotations
+from pydantic import BaseModel, Field
+class ActionParameters(BaseModel):
+    """Flexible parameter bag — different action types use different fields."""
+    service: str | None = None
+    severity: str | None = None
+    failure_mode: str | None = None
+    summary: str | None = None
+    target_version: str | None = None
+    replicas: int | None = None
+    flag: str | None = None
+    runbook_action: str | None = None
+    target: str | None = None
+    reasoning: str | None = None
+    model_config = {"extra": "allow"}
+class Action(BaseModel):
+    """An action submitted by the agent to the environment."""
+    action_type: str
+    parameters: ActionParameters = Field(default_factory=ActionParameters)
+    model_config = {"extra": "allow"}
+class Observation(BaseModel):
+    """Observation returned after reset() or step()."""
+    episode_id: str
+    task_id: str
+    scenario_id: str
+    step_count: int
+    max_steps: int
+    incident_summary: str
+    alert: dict
+    available_actions: list[str]
+    queried_data: dict
+    cumulative_reward: float
+    done: bool
+    feedback: str
+class Reward(BaseModel):
+    """Reward signal returned after each step()."""
+    value: float
+    reason: str
+    cumulative: float
+class EpisodeState(BaseModel):
+    """Full episode state returned by GET /state."""
+    episode_id: str
+    task_id: str
+    scenario_id: str
+    step_count: int
+    max_steps: int
+    action_history: list[dict]
+    queried_data: dict
+    submitted: bool
+    resolved: bool
+    done: bool
+    cumulative_reward: float
+    feedback: str

tasks.py ADDED Viewed

	@@ -0,0 +1,664 @@

+"""
+tasks.py — Task definitions and scenario data for SRE Incident Response OpenEnv.
+Structure:
+  ALL_TASKS  : dict[task_id -> task metadata]
+  SCENARIOS  : dict[task_id -> list[scenario_dict]]
+Public API:
+  get_task(task_id)              -> task metadata dict
+  get_scenario(task_id, index)   -> scenario dict
+  list_tasks()                   -> list of task metadata dicts
+"""
+ALL_TASKS: dict = {
+    "alert_classification": {
+        "id": "alert_classification",
+        "name": "Task 1: Alert Severity Classification",
+        "difficulty": "easy",
+        "max_steps": 3,
+        "score_range": [0.0, 1.0],
+        "description": (
+            "Given an incoming alert with symptoms, affected services, and error rates, "
+            "classify the incident severity as P1 (CRITICAL), P2 (HIGH), P3 (MEDIUM), "
+            "or P4 (LOW). Use available diagnostic tools to gather evidence before submitting."
+        ),
+        "available_actions": [
+            "query_logs", "check_metrics", "check_dependencies",
+            "check_recent_deploys", "submit_severity",
+        ],
+        "submission_action": "submit_severity",
+    },
+    "root_cause_analysis": {
+        "id": "root_cause_analysis",
+        "name": "Task 2: Root Cause Analysis",
+        "difficulty": "medium",
+        "max_steps": 10,
+        "score_range": [0.0, 1.0],
+        "description": (
+            "An active incident is in progress. Use diagnostic tools to query logs, "
+            "metrics, dependencies, and recent deploys across services. Identify the "
+            "exact root cause service and failure mode, then submit your findings."
+        ),
+        "available_actions": [
+            "query_logs", "check_metrics", "check_dependencies",
+            "check_recent_deploys", "check_service_status", "submit_root_cause",
+        ],
+        "submission_action": "submit_root_cause",
+    },
+    "remediation_planning": {
+        "id": "remediation_planning",
+        "name": "Task 3: Incident Remediation",
+        "difficulty": "hard",
+        "max_steps": 15,
+        "score_range": [0.0, 1.0],
+        "description": (
+            "A production incident requires full resolution. Diagnose the root cause, "
+            "execute the correct remediation sequence (restart, rollback, scale, drain), "
+            "then submit a resolution summary. Scored on investigation quality, "
+            "remediation correctness, efficiency, and documentation."
+        ),
+        "available_actions": [
+            "query_logs", "check_metrics", "check_dependencies",
+            "check_recent_deploys", "check_service_status",
+            "restart_service", "rollback_deploy", "scale_service",
+            "disable_feature_flag", "clear_cache", "execute_runbook_step",
+            "submit_resolution",
+        ],
+        "submission_action": "submit_resolution",
+    },
+}
+# ---------------------------------------------------------------------------
+# Scenario data
+# Each scenario has:
+#   scenario_id, description, incident_summary, alert, known_services,
+#   tool_responses, correct_severity, correct_root_cause, correct_remediation,
+#   wrong_actions
+# ---------------------------------------------------------------------------
+SCENARIOS: dict = {
+    # ── ALERT CLASSIFICATION ─────────────────────────────────────────────────
+    "alert_classification": [
+        # Scenario 0: DB connection pool exhaustion cascading up
+        {
+            "scenario_id": "AC-001",
+            "description": (
+                "Cascading failure: postgres-db connection pool exhausted, "
+                "causing auth-service timeouts, which is blocking api-gateway requests."
+            ),
+            "incident_summary": (
+                "P1 ALERT — api-gateway 5xx rate 78%, auth-service timeout rate 94%, "
+                "postgres-db connection pool at 100% (500/500). "
+                "Checkout flow completely down. Revenue impact: $12k/min."
+            ),
+            "alert": {
+                "id": "ALT-20240315-001",
+                "title": "CRITICAL: api-gateway error rate 78%",
+                "severity_fired": "P1",
+                "affected_services": ["api-gateway", "auth-service", "postgres-db"],
+                "symptoms": [
+                    "api-gateway: HTTP 503 rate 78% (up from baseline 0.1%)",
+                    "auth-service: connection timeout 94% of requests",
+                    "postgres-db: connection pool 500/500 (100% utilized)",
+                    "checkout flow: completely unavailable",
+                    "Active user sessions: 0 new logins succeeding",
+                ],
+                "error_rate": 0.78,
+                "duration_minutes": 4,
+                "revenue_impact_per_min": 12000,
+            },
+            "known_services": {"api-gateway", "auth-service", "postgres-db"},
+            "tool_responses": {
+                "query_logs": {
+                    "api-gateway": (
+                        "2024-03-15T10:04:12Z ERROR upstream connect error or disconnect/reset "
+                        "before headers. reset reason: connection timeout — auth-service:8080\n"
+                        "2024-03-15T10:04:13Z ERROR 503 Service Unavailable — upstream: auth-service"
+                    ),
+                    "auth-service": (
+                        "2024-03-15T10:04:10Z ERROR pq: sorry, too many clients already\n"
+                        "2024-03-15T10:04:11Z ERROR dial tcp postgres-db:5432: connect: "
+                        "connection refused — pool exhausted"
+                    ),
+                    "postgres-db": (
+                        "2024-03-15T10:03:58Z LOG connection received: host=auth-service\n"
+                        "2024-03-15T10:04:00Z FATAL remaining connection slots are reserved "
+                        "for non-replication superuser connections\n"
+                        "2024-03-15T10:04:01Z LOG max_connections=500 currently active=500"
+                    ),
+                },
+                "check_metrics": {
+                    "api-gateway": "HTTP 5xx rate: 78% | Latency p99: 30s (timeout) | RPS: 1200",
+                    "auth-service": "Error rate: 94% | DB connection wait: 28s | Active conns: 0",
+                    "postgres-db": "Connections: 500/500 (100%) | Query queue depth: 847 | CPU: 98%",
+                },
+                "check_dependencies": {
+                    "api-gateway": "Depends on: auth-service [DEGRADED], product-service [OK]",
+                    "auth-service": "Depends on: postgres-db [CRITICAL], redis-session [OK]",
+                    "postgres-db": "No upstream dependencies",
+                },
+                "check_recent_deploys": {
+                    "api-gateway": "Last deploy: 3 days ago — no recent changes",
+                    "auth-service": "Last deploy: 47 mins ago — added connection pool monitoring",
+                    "postgres-db": "Last deploy: 12 days ago — no recent changes",
+                },
+            },
+            "correct_severity": "P1",
+            "adjacent_severities": ["P2"],
+            "correct_root_cause": {
+                "service": "postgres-db",
+                "failure_mode": "connection pool exhaustion",
+            },
+            "correct_remediation": [
+                "restart_service:auth-service",
+                "execute_runbook_step:increase_max_connections",
+                "scale_service:postgres-db",
+            ],
+            "wrong_actions": {
+                "rollback_deploy": "Rolling back auth-service monitoring changes won't fix pool exhaustion",
+                "restart_service:api-gateway": "api-gateway is a victim, not the cause",
+                "clear_cache": "Cache is not related to DB connection pool exhaustion",
+            },
+        },
+        # Scenario 1: CDN cache invalidation storm
+        {
+            "scenario_id": "AC-002",
+            "description": (
+                "CDN cache invalidation storm: a misconfigured cache purge script ran "
+                "against all product images, sending 40x normal traffic to origin."
+            ),
+            "incident_summary": (
+                "P2 ALERT — product-service origin traffic spike 4000%, "
+                "image-service CPU 95%, CDN cache hit rate dropped from 94% to 3%. "
+                "Site slow but partially functional. Latency p99: 18s."
+            ),
+            "alert": {
+                "id": "ALT-20240315-002",
+                "title": "HIGH: product-service origin traffic anomaly",
+                "severity_fired": "P2",
+                "affected_services": ["cdn-edge", "product-service", "image-service"],
+                "symptoms": [
+                    "CDN cache hit rate: 3% (normal: 94%)",
+                    "product-service: origin RPS 48,000 (normal: 1,200)",
+                    "image-service: CPU 95%, latency p99 18s",
+                    "User-facing: product pages loading slowly, some images timing out",
+                    "No complete outage — checkout still working",
+                ],
+                "error_rate": 0.15,
+                "duration_minutes": 8,
+                "revenue_impact_per_min": 800,
+            },
+            "known_services": {"cdn-edge", "product-service", "image-service"},
+            "tool_responses": {
+                "query_logs": {
+                    "cdn-edge": (
+                        "2024-03-15T10:22:00Z INFO cache MISS ratio: 97% (last 5min)\n"
+                        "2024-03-15T10:20:11Z WARN mass cache invalidation event detected "
+                        "— 2.1M keys purged by purge-job-prod\n"
+                        "2024-03-15T10:20:10Z INFO purge request from 10.0.1.45 — pattern: /*"
+                    ),
+                    "product-service": (
+                        "2024-03-15T10:22:05Z WARN request queue depth: 12,400\n"
+                        "2024-03-15T10:22:06Z ERROR timeout fetching image from image-service"
+                    ),
+                    "image-service": (
+                        "2024-03-15T10:22:00Z WARN CPU throttling engaged\n"
+                        "2024-03-15T10:22:01Z ERROR worker pool exhausted — dropping requests"
+                    ),
+                },
+                "check_metrics": {
+                    "cdn-edge": "Cache hit rate: 3% | Purge events last hour: 1 (mass) | Origin RPS: 48k",
+                    "product-service": "Origin RPS: 48,000 (normal 1,200) | Queue depth: 12,400",
+                    "image-service": "CPU: 95% | Worker pool: 0 free / 200 | Latency p99: 18s",
+                },
+                "check_dependencies": {
+                    "cdn-edge": "Origin: product-service [OVERLOADED]",
+                    "product-service": "Depends on: image-service [DEGRADED], postgres-db [OK]",
+                    "image-service": "Depends on: object-storage [OK] — no upstream issues",
+                },
+                "check_recent_deploys": {
+                    "cdn-edge": "Cronjob purge-job-prod modified 2 hours ago — pattern changed from /images/* to /*",
+                    "product-service": "Last deploy: 5 days ago",
+                    "image-service": "Last deploy: 2 days ago",
+                },
+            },
+            "correct_severity": "P2",
+            "adjacent_severities": ["P1", "P3"],
+            "correct_root_cause": {
+                "service": "cdn-edge",
+                "failure_mode": "mass cache invalidation / misconfigured purge job",
+            },
+            "correct_remediation": [
+                "disable_feature_flag:purge-job-prod",
+                "execute_runbook_step:warm_cdn_cache",
+                "scale_service:image-service",
+            ],
+            "wrong_actions": {
+                "restart_service:image-service": "Restarting won't fix the CDN cache miss storm at source",
+                "rollback_deploy:product-service": "product-service has no recent changes",
+                "restart_service:cdn-edge": "Restarting CDN edge nodes will make cache miss rate worse temporarily",
+            },
+        },
+    ],
+    # ── ROOT CAUSE ANALYSIS ──────────────────────────────────────────────────
+    "root_cause_analysis": [
+        # Scenario 0: Postgres OOM killed by runaway analytics query
+        {
+            "scenario_id": "RCA-001",
+            "description": (
+                "postgres-db was OOM-killed by the Linux kernel after a runaway analytics "
+                "query consumed all available memory, taking down all dependent services."
+            ),
+            "incident_summary": (
+                "Multiple services down: api-gateway 503, auth-service failing, "
+                "order-service unable to write. postgres-db restarting repeatedly. "
+                "Root cause is upstream — needs investigation."
+            ),
+            "alert": {
+                "id": "ALT-RCA-001",
+                "title": "CRITICAL: postgres-db repeated restarts, all dependents degraded",
+                "severity_fired": "P1",
+                "affected_services": ["api-gateway", "auth-service", "order-service", "postgres-db"],
+                "symptoms": [
+                    "postgres-db: restarted 4 times in 12 minutes",
+                    "auth-service: connection refused errors 100%",
+                    "order-service: write failures 100%",
+                    "api-gateway: 503 on all authenticated routes",
+                ],
+                "error_rate": 0.95,
+                "duration_minutes": 14,
+            },
+            "known_services": {
+                "api-gateway", "auth-service", "order-service",
+                "postgres-db", "analytics-service", "redis-session",
+            },
+            "tool_responses": {
+                "query_logs": {
+                    "postgres-db": (
+                        "2024-03-16T02:11:00Z LOG database system was shut down at 2024-03-16 02:10:58\n"
+                        "2024-03-16T02:10:58Z FATAL Out of Memory: Kill process 1847 (postgres) "
+                        "score 982 or sacrifice child\n"
+                        "2024-03-16T02:10:30Z LOG process 1847 still running query started "
+                        "2024-03-16 01:58:00: SELECT * FROM events JOIN user_sessions JOIN orders "
+                        "JOIN products — no LIMIT clause"
+                    ),
+                    "analytics-service": (
+                        "2024-03-16T01:58:00Z INFO starting scheduled report: full_history_export\n"
+                        "2024-03-16T02:10:55Z ERROR query killed by OOM — report failed\n"
+                        "2024-03-16T01:58:01Z WARN query has no LIMIT — estimated rows: 847M"
+                    ),
+                    "auth-service": (
+                        "2024-03-16T02:11:05Z ERROR connect ECONNREFUSED postgres-db:5432\n"
+                        "2024-03-16T02:11:06Z ERROR all retries exhausted"
+                    ),
+                    "api-gateway": (
+                        "2024-03-16T02:11:10Z ERROR upstream auth-service: 503 Service Unavailable"
+                    ),
+                    "order-service": (
+                        "2024-03-16T02:11:08Z ERROR pq: the database system is starting up"
+                    ),
+                    "redis-session": "No errors — operating normally",
+                },
+                "check_metrics": {
+                    "postgres-db": "Memory: 0% free (OOM killed) | Restarts: 4 | Last crash: 2min ago",
+                    "analytics-service": "Memory used: 31GB / 32GB at time of crash | Query runtime: 12min",
+                    "auth-service": "Connection success rate: 0% | DB dependency: CRITICAL",
+                    "api-gateway": "503 rate: 95% | Auth dependency: DOWN",
+                    "order-service": "Write success rate: 0% | DB dependency: RESTARTING",
+                    "redis-session": "Hit rate: 99.2% | Memory: 42% | All normal",
+                },
+                "check_dependencies": {
+                    "postgres-db": "Clients: auth-service, order-service, analytics-service, product-service",
+                    "analytics-service": "Depends on: postgres-db [RESTARTING]",
+                    "auth-service": "Depends on: postgres-db [RESTARTING], redis-session [OK]",
+                    "api-gateway": "Depends on: auth-service [DOWN]",
+                    "order-service": "Depends on: postgres-db [RESTARTING]",
+                    "redis-session": "No DB dependency — standalone",
+                },
+                "check_recent_deploys": {
+                    "analytics-service": (
+                        "Deploy 6 hours ago: added full_history_export scheduled job — "
+                        "no LIMIT on cross-table JOIN query, runs at 02:00 UTC daily"
+                    ),
+                    "postgres-db": "No recent deploys — last change 3 weeks ago",
+                    "auth-service": "No recent deploys",
+                    "order-service": "No recent deploys",
+                },
+                "check_service_status": {
+                    "postgres-db": "Status: RESTARTING | Uptime: 47s | Previous uptime: 14min",
+                    "analytics-service": "Status: ERROR | Last job: full_history_export FAILED",
+                    "auth-service": "Status: DOWN | Waiting for postgres-db",
+                    "api-gateway": "Status: DEGRADED | 95% requests failing",
+                },
+            },
+            "correct_root_cause": {
+                "service": "analytics-service",
+                "failure_mode": "unbounded query causing OOM kill on postgres-db",
+            },
+            "correct_remediation": [
+                "restart_service:analytics-service",
+                "disable_feature_flag:full_history_export",
+                "execute_runbook_step:restart_postgres",
+            ],
+            "wrong_actions": {
+                "restart_service:auth-service": "auth-service is a victim, restarting won't fix the DB",
+                "restart_service:api-gateway": "api-gateway is a victim downstream",
+                "scale_service:postgres-db": "Scaling DB won't prevent OOM if the bad query runs again",
+                "rollback_deploy:postgres-db": "postgres-db has no recent deploys",
+            },
+        },
+        # Scenario 1: Network partition isolating payment service
+        {
+            "scenario_id": "RCA-002",
+            "description": (
+                "A BGP route withdrawal caused a network partition isolating the "
+                "payment-service AZ, resulting in 61% checkout failure rate."
+            ),
+            "incident_summary": (
+                "Checkout failures 61%, payment-service unreachable from AZ-2 and AZ-3. "
+                "fraud-detection-service also unreachable. "
+                "Partial service — users in AZ-1 unaffected."
+            ),
+            "alert": {
+                "id": "ALT-RCA-002",
+                "title": "HIGH: checkout failure rate 61%, payment-service connectivity loss",
+                "severity_fired": "P2",
+                "affected_services": ["order-service", "payment-service", "fraud-detection-service"],
+                "symptoms": [
+                    "checkout failure rate: 61% (only AZ-2 and AZ-3 affected)",
+                    "payment-service: unreachable from AZ-2, AZ-3",
+                    "fraud-detection-service: timeout from AZ-2, AZ-3",
+                    "AZ-1 users: completely unaffected",
+                    "Network latency AZ-2→AZ-1: infinite (no route)",
+                ],
+                "error_rate": 0.61,
+                "duration_minutes": 9,
+            },
+            "known_services": {
+                "order-service", "payment-service", "fraud-detection-service",
+                "postgres-db", "redis-payment-cache", "network-infra",
+            },
+            "tool_responses": {
+                "query_logs": {
+                    "order-service": (
+                        "2024-03-17T14:32:10Z ERROR connection timeout payment-service:8080 "
+                        "(AZ-2 → AZ-1: no route to host)\n"
+                        "2024-03-17T14:32:11Z ERROR fraud-detection-service: i/o timeout"
+                    ),
+                    "payment-service": (
+                        "2024-03-17T14:31:58Z WARN health check failing from AZ-2 load balancer\n"
+                        "2024-03-17T14:31:59Z INFO all local (AZ-1) requests processing normally"
+                    ),
+                    "fraud-detection-service": (
+                        "2024-03-17T14:32:00Z INFO processing normally within AZ-1\n"
+                        "2024-03-17T14:32:01Z WARN cross-AZ health checks timing out"
+                    ),
+                    "network-infra": (
+                        "2024-03-17T14:31:45Z CRITICAL BGP peer 10.0.2.1 route withdrawal — "
+                        "AZ-2 lost route to AZ-1 CIDR 10.0.1.0/24\n"
+                        "2024-03-17T14:31:45Z CRITICAL BGP peer 10.0.3.1 route withdrawal — "
+                        "AZ-3 lost route to AZ-1 CIDR 10.0.1.0/24"
+                    ),
+                    "postgres-db": "Operating normally — no errors",
+                    "redis-payment-cache": "Operating normally — AZ-1 only traffic, all good",
+                },
+                "check_metrics": {
+                    "order-service": "AZ-2 checkout failure: 99% | AZ-1 checkout failure: 0.2% (baseline)",
+                    "payment-service": "AZ-1 traffic: normal | AZ-2/AZ-3 inbound: 0 (blocked by network)",
+                    "fraud-detection-service": "AZ-1 normal | Cross-AZ: 100% timeout",
+                    "network-infra": "BGP sessions AZ-2/AZ-3: DOWN | AZ-1 internal: all UP",
+                    "postgres-db": "All metrics normal",
+                    "redis-payment-cache": "All metrics normal",
+                },
+                "check_dependencies": {
+                    "order-service": "Depends on: payment-service [PARTITIONED], fraud-detection-service [PARTITIONED]",
+                    "payment-service": "Depends on: postgres-db [OK], redis-payment-cache [OK]",
+                    "fraud-detection-service": "Depends on: postgres-db [OK]",
+                    "network-infra": "BGP peers: AZ-2 [DOWN], AZ-3 [DOWN], AZ-1 [UP]",
+                },
+                "check_recent_deploys": {
+                    "network-infra": (
+                        "Router config change 18 mins ago: updated BGP route advertisement policy — "
+                        "inadvertently withdrew AZ-1 routes from AZ-2/AZ-3 peers"
+                    ),
+                    "payment-service": "No recent deploys",
+                    "order-service": "No recent deploys",
+                },
+                "check_service_status": {
+                    "payment-service": "Status: HEALTHY (within AZ-1) | Cross-AZ: UNREACHABLE",
+                    "order-service": "Status: DEGRADED | AZ-2/3 instances failing",
+                    "network-infra": "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN | AZ-1: UP",
+                    "fraud-detection-service": "Status: HEALTHY (within AZ-1) | Cross-AZ: UNREACHABLE",
+                },
+            },
+            "correct_root_cause": {
+                "service": "network-infra",
+                "failure_mode": "BGP route withdrawal causing AZ network partition",
+            },
+            "correct_remediation": [
+                "execute_runbook_step:restore_bgp_routes",
+                "rollback_deploy:network-infra",
+            ],
+            "wrong_actions": {
+                "restart_service:payment-service": "payment-service is healthy — network is the issue",
+                "restart_service:order-service": "order-service is a victim of the network partition",
+                "scale_service:payment-service": "Scaling won't fix a network routing problem",
+                "clear_cache:redis-payment-cache": "Cache is operating normally — not the cause",
+            },
+        },
+    ],
+    # ── REMEDIATION PLANNING ─────────────────────────────────────────────────
+    "remediation_planning": [
+        # Scenario 0: Postgres OOM — full remediation required
+        {
+            "scenario_id": "RP-001",
+            "description": (
+                "Full remediation required: analytics-service OOM-killed postgres-db. "
+                "Must stop the offending job, restart DB, restore services, document."
+            ),
+            "incident_summary": (
+                "CRITICAL — postgres-db repeatedly OOM-killed by analytics runaway query. "
+                "auth-service, order-service, api-gateway all down. "
+                "Requires: stop analytics job, restart postgres, verify service recovery, document."
+            ),
+            "alert": {
+                "id": "ALT-RP-001",
+                "title": "CRITICAL: postgres-db OOM killed — full stack down",
+                "severity_fired": "P1",
+                "affected_services": ["postgres-db", "auth-service", "order-service", "api-gateway"],
+            },
+            "known_services": {
+                "postgres-db", "auth-service", "order-service",
+                "api-gateway", "analytics-service",
+            },
+            "tool_responses": {
+                "query_logs": {
+                    "postgres-db": (
+                        "FATAL: Out of Memory: Kill process (postgres) — analytics query running 12min with no LIMIT"
+                    ),
+                    "analytics-service": "ERROR: full_history_export job — unbounded JOIN query killed by OOM",
+                    "auth-service": "ERROR: connect ECONNREFUSED postgres-db:5432",
+                    "order-service": "ERROR: pq: the database system is starting up",
+                    "api-gateway": "ERROR: upstream auth-service 503",
+                },
+                "check_metrics": {
+                    "postgres-db": "Memory: OOM | Restarts: 4 | Status: RESTARTING",
+                    "analytics-service": "Status: ERROR | Memory spike to 31GB before crash",
+                    "auth-service": "Connection success: 0% | Waiting for DB",
+                    "order-service": "Write success: 0% | Waiting for DB",
+                },
+                "check_dependencies": {
+                    "postgres-db": "Clients: auth-service, order-service, analytics-service",
+                    "analytics-service": "Depends on: postgres-db",
+                    "auth-service": "Depends on: postgres-db [DOWN]",
+                    "order-service": "Depends on: postgres-db [DOWN]",
+                },
+                "check_recent_deploys": {
+                    "analytics-service": "Deploy 6h ago: added full_history_export cron job — unbounded query",
+                    "postgres-db": "No recent changes",
+                },
+                "check_service_status": {
+                    "postgres-db": "RESTARTING | Uptime: 47s",
+                    "analytics-service": "ERROR | Last job failed",
+                    "auth-service": "DOWN",
+                    "order-service": "DOWN",
+                },
+            },
+            "remediation_data": {
+                "disable_feature_flag": {
+                    "full_history_export": "Cron job full_history_export disabled — analytics queries halted",
+                },
+                "restart_service": {
+                    "postgres-db": "postgres-db restarted cleanly — accepting connections",
+                    "analytics-service": "analytics-service restarted — no active queries",
+                    "auth-service": "auth-service restarted — reconnected to postgres-db successfully",
+                    "order-service": "order-service restarted — write operations resuming",
+                },
+                "execute_runbook_step": {
+                    "verify_db_health": "postgres-db connections: 12/500 — healthy",
+                    "check_service_recovery": "auth-service OK, order-service OK, api-gateway OK",
+                },
+            },
+            "correct_severity": "P1",
+            "correct_root_cause": {
+                "service": "analytics-service",
+                "failure_mode": "unbounded query OOM killing postgres-db",
+            },
+            "correct_remediation_sequence": [
+                "disable_feature_flag:full_history_export",
+                "restart_service:analytics-service",
+                "restart_service:postgres-db",
+                "restart_service:auth-service",
+                "restart_service:order-service",
+            ],
+            "wrong_actions": {
+                "rollback_deploy:postgres-db": "postgres-db has no recent deploy to roll back",
+                "scale_service:postgres-db": "Scaling won't stop the OOM query from running again",
+                "restart_service:api-gateway": "api-gateway is downstream victim — fix DB first",
+            },
+            "resolution_keywords": [
+                "analytics", "oom", "memory", "postgres", "query", "full_history_export",
+                "disabled", "restarted", "recovered",
+            ],
+        },
+        # Scenario 1: BGP network partition — full remediation
+        {
+            "scenario_id": "RP-002",
+            "description": (
+                "Full remediation: BGP route withdrawal partitioned AZ-2/AZ-3 from AZ-1 "
+                "where payment-service runs. Must restore BGP routes, roll back network config."
+            ),
+            "incident_summary": (
+                "P2 — BGP route withdrawal isolating payment-service from 61% of users. "
+                "Requires: restore BGP routes, roll back router config, verify checkout recovery."
+            ),
+            "alert": {
+                "id": "ALT-RP-002",
+                "title": "HIGH: checkout 61% failure — BGP network partition AZ-2/AZ-3",
+                "severity_fired": "P2",
+                "affected_services": ["network-infra", "order-service", "payment-service"],
+            },
+            "known_services": {
+                "network-infra", "order-service", "payment-service",
+                "fraud-detection-service", "postgres-db",
+            },
+            "tool_responses": {
+                "query_logs": {
+                    "network-infra": (
+                        "CRITICAL: BGP route withdrawal — AZ-2/AZ-3 lost route to AZ-1 10.0.1.0/24\n"
+                        "Router config change 18min ago: BGP advertisement policy update"
+                    ),
+                    "order-service": "ERROR: connection timeout payment-service — no route to host",
+                    "payment-service": "INFO: AZ-1 traffic normal | WARN: cross-AZ health checks failing",
+                },
+                "check_metrics": {
+                    "network-infra": "BGP AZ-2: DOWN | BGP AZ-3: DOWN | AZ-1: UP",
+                    "order-service": "AZ-2 failure: 99% | AZ-1 failure: 0.2%",
+                    "payment-service": "AZ-1 normal | Cross-AZ inbound: 0",
+                },
+                "check_dependencies": {
+                    "order-service": "Depends on: payment-service [PARTITIONED]",
+                    "payment-service": "Depends on: postgres-db [OK]",
+                    "network-infra": "BGP peers: AZ-2 [DOWN], AZ-3 [DOWN]",
+                },
+                "check_recent_deploys": {
+                    "network-infra": "Config change 18min ago — BGP policy update withdrew AZ-1 routes",
+                    "payment-service": "No recent deploys",
+                },
+                "check_service_status": {
+                    "network-infra": "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN",
+                    "payment-service": "HEALTHY (AZ-1 only) | Cross-AZ: UNREACHABLE",
+                    "order-service": "DEGRADED",
+                },
+            },
+            "remediation_data": {
+                "rollback_deploy": {
+                    "network-infra": "Router config rolled back — BGP advertisement policy restored",
+                },
+                "execute_runbook_step": {
+                    "restore_bgp_routes": "BGP routes restored — AZ-2/AZ-3 can reach AZ-1",
+                    "verify_checkout_recovery": "Checkout failure rate: 0.3% — incident resolved",
+                },
+            },
+            "correct_severity": "P2",
+            "correct_root_cause": {
+                "service": "network-infra",
+                "failure_mode": "BGP route withdrawal network partition",
+            },
+            "correct_remediation_sequence": [
+                "execute_runbook_step:restore_bgp_routes",
+                "rollback_deploy:network-infra",
+                "execute_runbook_step:verify_checkout_recovery",
+            ],
+            "wrong_actions": {
+                "restart_service:payment-service": "payment-service is healthy — network is the issue",
+                "scale_service:payment-service": "Scaling won't fix a routing problem",
+                "restart_service:order-service": "order-service is a victim",
+                "clear_cache": "Cache is unrelated to network routing",
+            },
+            "resolution_keywords": [
+                "bgp", "network", "route", "rollback", "partition", "restored",
+                "az-1", "az-2", "az-3", "checkout",
+            ],
+        },
+    ],
+}
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def get_task(task_id: str) -> dict:
+    if task_id not in ALL_TASKS:
+        raise ValueError(f"Unknown task_id '{task_id}'. Valid: {list(ALL_TASKS)}")
+    return ALL_TASKS[task_id]
+def get_scenario(task_id: str, index: int) -> dict:
+    if task_id not in SCENARIOS:
+        raise ValueError(f"No scenarios for task_id '{task_id}'.")
+    scenarios = SCENARIOS[task_id]
+    if index < 0 or index >= len(scenarios):
+        raise ValueError(
+            f"Scenario index {index} out of range for task '{task_id}' "
+            f"(has {len(scenarios)} scenarios: 0–{len(scenarios)-1})."
+        )
+    return scenarios[index]
+def list_tasks() -> list:
+    return list(ALL_TASKS.values())