Spaces:

Elliot89
/

cloud-incident-response

Sleeping

App Files Files Community

Elliot89 commited on Apr 1

Commit

a50dd28

1 Parent(s): f97cc91

Initial commit (from agent)

Browse files

Files changed (13) hide show

Dockerfile +3 -5
README.md +201 -6
graders.py +111 -24
inference.py +118 -149
openenv.yaml +11 -8
pyproject.toml +10 -1
requirements.txt +3 -1
server/__init__.py +2 -0
server/app.py +486 -86
server/environment.py +207 -195
server/models.py +70 -11
tasks.py +857 -371
uv.lock +0 -0

Dockerfile CHANGED Viewed

@@ -1,14 +1,12 @@
 FROM python:3.11-slim
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
 WORKDIR /app
-COPY pyproject.toml uv.lock ./
-RUN uv sync --frozen --no-dev
 COPY . .
 EXPOSE 7860
-CMD ["uv", "run", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

 FROM python:3.11-slim
 WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
 EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,205 @@
 ---
-title: Cloud Incident Response Openenv
-emoji: 🌍
-colorFrom: indigo
-colorTo: indigo
 sdk: docker
 pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Cloud Incident Response OpenEnv
+emoji: 🚨
+colorFrom: red
+colorTo: yellow
 sdk: docker
+app_port: 7860
 pinned: false
+tags:
+  - openenv
+  - sre
+  - cloud
+  - incident-response
+  - devops
+  - real-world
+  - agentic
+---
+# Cloud Incident Response — OpenEnv Environment
+An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** — the real-world on-call workflow that engineers at every cloud company perform daily.
+Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures — connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.
+## Why This Environment
+Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
+1. **Triage** — Read alert, assess blast radius, classify severity (P1–P4)
+2. **Investigate** — Query logs, metrics, dependencies, recent deploys
+3. **Diagnose** — Correlate signals across services to find the root cause
+4. **Remediate** — Execute the correct runbook steps in the right sequence
+5. **Document** — Submit a resolution summary for post-incident review
+Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
+## Tasks
+| Task ID | Difficulty | Max Steps | What the Agent Does |
+|---|---|---|---|
+| `alert_classification` | Easy | 3 | Classify alert severity (P1–P4) from metrics and symptoms |
+| `root_cause_analysis` | Medium | 10 | Trace logs/metrics/deps to find root cause service and failure mode |
+| `remediation_planning` | Hard | 15 | Diagnose, remediate, and document full incident resolution |
+### Scenarios
+| ID | Incident Type | Root Cause | Failure Pattern |
+|---|---|---|---|
+| AC-001 | DB connection pool exhaustion | postgres-db / auth-service deploy | api-gateway → auth-service → postgres-db cascade |
+| AC-002 | CDN cache invalidation storm | cdn-edge purge cronjob misconfigured | 40× origin traffic spike |
+| RCA-001 | Postgres OOM kill | analytics-service unbounded query | Kernel OOM → DB crash loop → all dependents down |
+| RCA-002 | BGP network partition | network-infra config change | Route withdrawal → AZ isolation → 61% checkout failures |
+| RP-001 | Full OOM remediation | analytics-service | Disable job → restart DB → restore services → document |
+| RP-002 | Full BGP remediation | network-infra | Restore routes → rollback config → verify recovery → document |
+## Action Space
+**Diagnostic actions** (gather evidence):
+```json
+{"action_type": "query_logs",           "parameters": {"service": "postgres-db"}}
+{"action_type": "check_metrics",        "parameters": {"service": "auth-service"}}
+{"action_type": "check_dependencies",   "parameters": {"service": "api-gateway"}}
+{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
+{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
+```
+**Remediation actions** (fix the incident):
+```json
+{"action_type": "restart_service",      "parameters": {"service": "postgres-db"}}
+{"action_type": "rollback_deploy",      "parameters": {"service": "network-infra", "target_version": "previous"}}
+{"action_type": "scale_service",        "parameters": {"service": "image-service", "replicas": 10}}
+{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
+{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
+```
+**Submission actions** (end the episode):
+```json
+{"action_type": "submit_severity",   "parameters": {"severity": "P1", "service": "postgres-db"}}
+{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}}
+{"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}
+```
+## Observation Space
+| Field | Type | Description |
+|---|---|---|
+| `episode_id` | string | Unique episode UUID |
+| `task_id` | string | Active task |
+| `scenario_id` | string | Scenario (e.g. `AC-001`) |
+| `step_count` / `max_steps` | int | Current step and budget |
+| `incident_summary` | string | Plain-text incident description |
+| `alert` | dict | Alert payload with severity, symptoms, affected services |
+| `available_actions` | list[str] | Valid action types for this task |
+| `queried_data` | dict | All tool responses gathered so far |
+| `known_services` | list[str] | Exact service names to use in actions |
+| `cumulative_reward` | float | Running reward total |
+| `done` | bool | Episode terminal flag |
+| `feedback` | string | Per-step feedback string |
+## Reward Function
+Dense reward shaping throughout the trajectory:
+| Event | Reward |
+|---|---|
+| Query known service (first time) | +0.05 |
+| Query known service (repeat) | +0.01 |
+| Query unknown service | −0.05 |
+| Correct remediation action | +0.10 |
+| Wrong remediation action | −0.10 |
+| Step past halfway (non-submit) | −0.02 |
+| Timeout without submission | −0.10 |
+| Grader score (terminal step) | 0.0–1.0 |
+**Grader scoring** (deterministic, via `GET /grader`):
+| Task | Scoring Logic |
+|---|---|
+| `alert_classification` | 1.0 exact · 0.5 adjacent · 0.25 two-off · 0.0 wrong/none |
+| `root_cause_analysis` | 0.6 base (svc+mode) + up to 0.4 efficiency bonus |
+| `remediation_planning` | 0.6 base + 0.3 efficiency − 0.15 wrong penalty + 0.1 summary |
+## API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| GET | `/` | `{"status":"running",...}` — HF Space health |
+| GET | `/health` | `{"status":"ok","version":"0.1.0"}` |
+| POST | `/reset?task_id=...&scenario_index=...` | Start new episode |
+| POST | `/step` | Submit action (JSON body) |
+| GET | `/state` | Full current episode state |
+| GET | `/tasks` | All tasks with action schemas |
+| GET | `/grader` | Score current episode (0.0–1.0) |
+| POST | `/baseline` | Run inference.py, return scores |
+## Setup & Usage
+### Local development
+```bash
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+```
+### Docker
+```bash
+docker build -t cloud-incident-env .
+docker run -p 7860:7860 \
+  -e API_BASE_URL="https://api-inference.huggingface.co/v1" \
+  -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
+  -e HF_TOKEN="hf_your_token" \
+  cloud-incident-env
+```
+### Run inference script
+```bash
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
+export HF_TOKEN="hf_your_token"
+python inference.py
+```
+### Quick API test
+```bash
+# Start new episode
+curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"
+# Submit an action
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'
+# Check score
+curl http://localhost:7860/grader
+```
+## Baseline Scores
+Using `meta-llama/Llama-3.1-8B-Instruct` via HF Inference API:
+| Task | Scenario 0 | Scenario 1 | Average |
+|---|---|---|---|
+| `alert_classification` | ~1.00 | ~0.50 | ~0.75 |
+| `root_cause_analysis` | ~0.45 | ~0.35 | ~0.40 |
+| `remediation_planning` | ~0.25 | ~0.20 | ~0.23 |
+| **overall** | | | **~0.46** |
+*Run `python inference.py` to reproduce.*
+## Project Structure
+```
+.
+├── Dockerfile
+├── README.md
+├── requirements.txt
+├── openenv.yaml
+├── tasks.py          # Scenario definitions (6 scenarios across 3 tasks)
+├── graders.py        # Deterministic graders for all tasks
+├── inference.py      # Baseline agent + smart fallback logic
+└── server/
+    ├── __init__.py
+    ├── app.py        # FastAPI endpoints
+    ├── environment.py # Core OpenEnv step/reset/state logic
+    └── models.py     # Typed Pydantic models (Action, Observation, Reward)
+```

graders.py CHANGED Viewed

@@ -5,6 +5,11 @@ Public API:
     grade(task_id, state, scenario) -> {"total": float, "breakdown": dict, "feedback": str}
 All scores are in [0.0, 1.0].
 """
 from __future__ import annotations
@@ -33,6 +38,16 @@ def _svc_match(submitted: str, correct: str) -> bool:
         "auth": "auth-service",
         "api": "api-gateway",
         "api-gw": "api-gateway",
     }
     return aliases.get(s, s) == c or s == aliases.get(c, c)
@@ -45,11 +60,27 @@ def grade(task_id: str, state: dict, scenario: dict) -> dict:
     }
     fn = _graders.get(task_id)
     if fn is None:
-        return {"total": 0.0, "breakdown": {}, "feedback": f"Unknown task_id '{task_id}'"}
     return fn(state, scenario)
-# ── Task 1: Alert Classification ─────────────────────────────────────────────
 def _grade_alert_classification(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct = scenario.get("correct_severity", "P1")
@@ -92,7 +123,23 @@ def _grade_alert_classification(state: dict, scenario: dict) -> dict:
     }
-# ── Task 2: Root Cause Analysis ──────────────────────────────────────────────
 def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct_rc = scenario.get("correct_root_cause", {})
@@ -101,8 +148,11 @@ def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     known = {s.lower() for s in scenario.get("known_services", set())}
     diag_types = {
-        "query_logs", "check_metrics", "check_dependencies",
-        "check_recent_deploys", "check_service_status",
     }
     sub_svc, sub_mode, sub_step = "", "", len(history)
@@ -139,12 +189,12 @@ def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     efficiency = 0.0
     if svc_match:
         pre_submit = [
-            a for a in history[:sub_step]
             if a.get("action_type") in diag_types
         ]
         queried_svcs = {
-            a.get("parameters", {}).get("service", "").lower()
-            for a in pre_submit
         }
         relevant = queried_svcs & known
         total_q = len(pre_submit)
@@ -169,7 +219,26 @@ def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     }
-# ── Task 3: Remediation Planning ──────────────────────────────────���──────────
 def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct_seq = scenario.get("correct_remediation_sequence", [])
@@ -177,10 +246,17 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
     keywords = scenario.get("resolution_keywords", [])
     diag_rem = {
-        "query_logs", "check_metrics", "check_dependencies",
-        "check_recent_deploys", "check_service_status",
-        "restart_service", "rollback_deploy", "scale_service",
-        "disable_feature_flag", "clear_cache", "execute_runbook_step",
     }
     summary = ""
@@ -195,8 +271,10 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
         return {
             "total": 0.0,
             "breakdown": {
-                "base": 0.0, "efficiency": 0.0,
-                "penalty": 0.0, "summary_bonus": 0.0,
             },
             "feedback": "No resolution submitted or no investigation — score 0.0",
         }
@@ -212,10 +290,14 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
         runbook = p.get("runbook_action", "")
         target = p.get("target", "")
         executed.add(at)
-        if svc:     executed.add(f"{at}:{svc}")
-        if flag:    executed.add(f"{at}:{flag}")
-        if runbook: executed.add(f"execute_runbook_step:{runbook}")
-        if target:  executed.add(f"execute_runbook_step:{target}")
     def _seq_key_matches(seq_key: str) -> bool:
         if seq_key in executed:
@@ -230,13 +312,18 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
         return False
     matched = sum(1 for k in correct_seq if _seq_key_matches(k))
-    efficiency = round((matched / len(correct_seq)) * 0.3, 4) if correct_seq else 0.0
     wrong_count = sum(
-        1 for a in history
-        if (a.get("action_type") in wrong_map or
-            f"{a.get('action_type')}:{a.get('parameters', {}).get('service', '')}"
-            in wrong_map)
     )
     penalty = round(min(0.15, wrong_count * 0.05), 4)

     grade(task_id, state, scenario) -> {"total": float, "breakdown": dict, "feedback": str}
 All scores are in [0.0, 1.0].
+Grading Philosophy:
+  - Easy task: binary-ish — did you get the severity right?
+  - Medium task: partial credit for correct service, bonus for efficiency
+  - Hard task: multi-component — base + efficiency − penalties + summary quality
 """
 from __future__ import annotations
         "auth": "auth-service",
         "api": "api-gateway",
         "api-gw": "api-gateway",
+        "fraud": "fraud-detection-service",
+        "fraud-detection": "fraud-detection-service",
+        "order": "order-service",
+        "orders": "order-service",
+        "image": "image-service",
+        "images": "image-service",
+        "product": "product-service",
+        "products": "product-service",
+        "redis": "redis-session",
+        "redis-cache": "redis-payment-cache",
     }
     return aliases.get(s, s) == c or s == aliases.get(c, c)
     }
     fn = _graders.get(task_id)
     if fn is None:
+        return {
+            "total": 0.0,
+            "breakdown": {},
+            "feedback": f"Unknown task_id '{task_id}'",
+        }
     return fn(state, scenario)
+# ── Task 1: Alert Classification (Easy) ──────────────────────────────────────
+#
+# Scoring:
+#   1.0  — exact severity match
+#   0.5  — adjacent severity (e.g. P1 vs P2)
+#   0.25 — two levels off (e.g. P1 vs P3)
+#   0.0  — wrong by 3+ levels or no submission
+#
+# This is genuinely EASY: with 3 steps, an agent queries 1–2 services,
+# reads the error_rate + revenue_impact, and classifies. The data is
+# unambiguous — the correct answer is clearly derivable from the alert.
 def _grade_alert_classification(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct = scenario.get("correct_severity", "P1")
     }
+# ── Task 2: Root Cause Analysis (Medium) ─────────────────────────────────────
+#
+# Scoring (total up to 1.0):
+#   Base (up to 0.6):
+#     0.60 — correct service AND failure mode keywords match
+#     0.35 — correct service only
+#     0.10 — wrong service (partial credit for at least submitting)
+#   Efficiency bonus (up to 0.4):
+#     Based on investigation precision: queried relevant services / total queries
+#     Plus bonus for breadth of investigation (up to 3 unique known services)
+#
+# This is genuinely MEDIUM: the root cause is NOT in the alert's
+# affected_services list. The agent must investigate services outside
+# the blast radius, correlate log evidence, and identify the upstream
+# trigger — this requires multi-hop reasoning across 4–6 services.
 def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct_rc = scenario.get("correct_root_cause", {})
     known = {s.lower() for s in scenario.get("known_services", set())}
     diag_types = {
+        "query_logs",
+        "check_metrics",
+        "check_dependencies",
+        "check_recent_deploys",
+        "check_service_status",
     }
     sub_svc, sub_mode, sub_step = "", "", len(history)
     efficiency = 0.0
     if svc_match:
         pre_submit = [
+            a
+            for a in history[: sub_step]
             if a.get("action_type") in diag_types
         ]
         queried_svcs = {
+            a.get("parameters", {}).get("service", "").lower() for a in pre_submit
         }
         relevant = queried_svcs & known
         total_q = len(pre_submit)
     }
+# ── Task 3: Remediation Planning (Hard) ──────────────────────────────────────
+#
+# Scoring (total up to 1.0):
+#   Base (0.6 if submitted with any investigation):
+#     Requires at least 1 diagnostic/remediation action + a summary
+#   Efficiency bonus (up to 0.3):
+#     Fraction of correct_remediation_sequence steps matched
+#   Wrong action penalty (up to -0.15):
+#     −0.05 per wrong action taken (capped at 3)
+#   Summary quality bonus (up to 0.10):
+#     Based on keyword coverage in the resolution summary
+#
+# This is genuinely HARD: requires multi-phase execution:
+#   Phase 1: Diagnose (query logs to confirm root cause)
+#   Phase 2: Remediate (execute 3–5 specific actions in order)
+#   Phase 3: Document (write a coherent summary with key details)
+# Wrong remediation actions actively harm the score. The sequence
+# matters. The summary must reference specific services and actions.
 def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
     history = state.get("action_history", [])
     correct_seq = scenario.get("correct_remediation_sequence", [])
     keywords = scenario.get("resolution_keywords", [])
     diag_rem = {
+        "query_logs",
+        "check_metrics",
+        "check_dependencies",
+        "check_recent_deploys",
+        "check_service_status",
+        "restart_service",
+        "rollback_deploy",
+        "scale_service",
+        "disable_feature_flag",
+        "clear_cache",
+        "execute_runbook_step",
     }
     summary = ""
         return {
             "total": 0.0,
             "breakdown": {
+                "base": 0.0,
+                "efficiency": 0.0,
+                "penalty": 0.0,
+                "summary_bonus": 0.0,
             },
             "feedback": "No resolution submitted or no investigation — score 0.0",
         }
         runbook = p.get("runbook_action", "")
         target = p.get("target", "")
         executed.add(at)
+        if svc:
+            executed.add(f"{at}:{svc}")
+        if flag:
+            executed.add(f"{at}:{flag}")
+        if runbook:
+            executed.add(f"execute_runbook_step:{runbook}")
+        if target:
+            executed.add(f"execute_runbook_step:{target}")
     def _seq_key_matches(seq_key: str) -> bool:
         if seq_key in executed:
         return False
     matched = sum(1 for k in correct_seq if _seq_key_matches(k))
+    efficiency = (
+        round((matched / len(correct_seq)) * 0.3, 4) if correct_seq else 0.0
+    )
     wrong_count = sum(
+        1
+        for a in history
+        if (
+            a.get("action_type") in wrong_map
+            or f"{a.get('action_type')}:{a.get('parameters', {}).get('service', '')}"
+            in wrong_map
+        )
     )
     penalty = round(min(0.15, wrong_count * 0.05), 4)

inference.py CHANGED Viewed

@@ -10,6 +10,7 @@ from __future__ import annotations
 import json
 import os
 import sys
 import requests
@@ -29,8 +30,6 @@ if not HF_TOKEN:
     print("[WARN] No API key set — LLM calls will fail.", file=sys.stderr)
 _session = requests.Session()
-# Lazy-init OpenAI client to avoid import-time httpx errors
 _client = None
@@ -42,7 +41,7 @@ def _get_client():
     return _client
-# ── Which submission action belongs to which task ───────────────────────────
 _TASK_SUBMIT = {
     "alert_classification":  "submit_severity",
     "root_cause_analysis":   "submit_root_cause",
@@ -66,7 +65,6 @@ _REM_TYPES = frozenset({
 _ALL_VALID = _DIAG_TYPES | _SUBMIT_TYPES | _REM_TYPES
-# ── System prompt — general SRE strategy, NO scenario answers ───────────────
 SYSTEM_PROMPT = """\
 You are an expert Site Reliability Engineer responding to a production incident.
 Reply with exactly ONE JSON action object. No markdown, no explanation, no extra text.
@@ -86,40 +84,43 @@ VALID ACTIONS:
 {"action_type":"submit_resolution","parameters":{"summary":"<3+ sentence summary>"}}
 RULES:
-- Service names MUST exactly match the KNOWN_SERVICES list in the observation.
 - P1 = complete outage OR revenue > $1,000/min.  P2 = major degradation.
-  P3 = minor issue.  P4 = informational.
-- Root cause = the upstream service that TRIGGERED the cascade. This is often
-  NOT listed in the alert's affected_services. Investigate services not in the
-  alert first.
-- submit_resolution summary must be 3+ sentences: (1) what failed and why,
-  (2) actions you took to fix it, (3) current recovery status.
-- Submit as soon as evidence is clear — do NOT waste steps querying more.
-TASK-SPECIFIC STRATEGY:
 alert_classification (max 3 steps):
-  Query 1-2 affected services for evidence, then submit_severity.
 root_cause_analysis (max 10 steps):
-  Investigate services NOT in the alert first (check logs + recent deploys).
-  Look for: OOM kills, BGP withdrawals, config changes, unbounded queries.
-  Submit submit_root_cause with the triggering service and failure mode.
 remediation_planning (max 15 steps):
-  1. Query logs to confirm root cause.
-  2. Execute fixes: disable bad jobs, restart crashed services, rollback configs,
-     run runbook steps.
-  3. Submit submit_resolution with a detailed 3-sentence summary.
 CRITICAL: Each task has ONE correct submission action:
   alert_classification  -> submit_severity
   root_cause_analysis   -> submit_root_cause
-  remediation_planning  -> submit_resolution
-Do NOT use the wrong submission type for the task."""
 # ── Helpers ─────────────────────────────────────────────────────────────────
 def _queried_svcs(queried_data: dict) -> set[str]:
     return {
         svc
@@ -130,7 +131,6 @@ def _queried_svcs(queried_data: dict) -> set[str]:
 def _extract_signals(queried_data: dict) -> list[str]:
-    """Surface key patterns from queried data — shown to LLM."""
     seen: set[str] = set()
     signals: list[str] = []
@@ -154,21 +154,25 @@ def _extract_signals(queried_data: dict) -> list[str]:
                 _add(f"Cache purge in {svc}")
             if "unbounded" in t or "no limit" in t:
                 _add(f"Unbounded query in {svc}")
             if action_type == "check_recent_deploys" and any(
-                x in t for x in ("ago", "change", "update", "added")
             ):
-                snippet = str(data)[:120].replace("\n", " ")
-                _add(f"Recent change in {svc}: {snippet}")
     return signals
-# ── Message builders ────────────────────────────────────────────────────────
 def _first_obs_msg(obs: dict) -> str:
-    alert    = obs.get("alert", {})
-    known    = obs.get("known_services", [])
     affected = alert.get("affected_services", [])
-    task_id  = obs.get("task_id", "")
-    non_aff  = [s for s in known if s not in affected]
     lines = [
         "=== NEW INCIDENT ===",
@@ -182,34 +186,37 @@ def _first_obs_msg(obs: dict) -> str:
         if alert.get("title"):
             lines.append(f"  Title: {alert['title']}")
         if affected:
-            lines.append(f"  Directly affected services: {', '.join(affected)}")
         for s in alert.get("symptoms", []):
             lines.append(f"  - {s}")
         for k in ("error_rate", "duration_minutes", "revenue_impact_per_min"):
             if alert.get(k) is not None:
                 lines.append(f"  {k}: {alert[k]}")
-    lines.append(f"KNOWN_SERVICES (use these EXACT names): {json.dumps(known)}")
     if non_aff and task_id in ("root_cause_analysis", "remediation_planning"):
-        lines.append(
-            f"  *** These services are NOT in the alert — investigate them "
-            f"for possible root cause: {json.dumps(non_aff)} ***"
-        )
     lines.append(f"AVAILABLE ACTIONS: {obs.get('available_actions', [])}")
     lines.append(f"REQUIRED SUBMISSION: {_TASK_SUBMIT.get(task_id, 'unknown')}")
     lines.append("")
-    lines.append("Respond with your first action (JSON only, no markdown):")
     return "\n".join(lines)
 def _step_msg(obs: dict, prev_queried: dict) -> str:
-    step      = obs.get("step_count", 0)
     max_steps = obs.get("max_steps", 10)
-    left      = max_steps - step
-    queried   = obs.get("queried_data", {})
-    task_id   = obs.get("task_id", "")
     lines = [
         f"Step {step}/{max_steps} ({left} remaining) | "
@@ -217,7 +224,6 @@ def _step_msg(obs: dict, prev_queried: dict) -> str:
         f"feedback: {obs.get('feedback', '')}",
     ]
-    # Show new data received
     new_data = []
     for action_type, services in queried.items():
         prev = prev_queried.get(action_type, {})
@@ -229,35 +235,26 @@ def _step_msg(obs: dict, prev_queried: dict) -> str:
                         d = d[:500] + "..."
                     new_data.append(f"  [{action_type}][{svc}]: {d}")
     if new_data:
-        lines.append("NEW DATA RECEIVED:")
         lines.extend(new_data)
-    # Show extracted signals
     signals = _extract_signals(queried)
     if signals:
-        lines.append("KEY SIGNALS DETECTED:")
         for sig in signals:
             lines.append(f"  *** {sig} ***")
-    # Urgency reminders
     if left <= 3:
-        lines.append(
-            f"*** {left} steps remaining — submit "
-            f"{_TASK_SUBMIT.get(task_id, 'your answer')} soon ***"
-        )
     if left <= 1:
-        lines.append(
-            f"!!! LAST STEP — YOU MUST {_TASK_SUBMIT.get(task_id, 'SUBMIT')} NOW !!!"
-        )
-    lines.append("Next action (JSON only, no markdown):")
     return "\n".join(lines)
-# ── Parse LLM output ───────────────────────────────────────────────────────
 def _parse(text: str) -> dict:
     text = text.strip()
-    # Strip markdown code fences
     if text.startswith("`"):
         text = "\n".join(
             ln for ln in text.splitlines() if not ln.startswith("`")
@@ -272,125 +269,92 @@ def _parse(text: str) -> dict:
         raise
-# ── Fallback — generic, no scenario knowledge ──────────────────────────────
 def _fallback_submit(task_id: str, obs: dict) -> dict:
-    """Minimal correct-type submission. Will score low but won't crash."""
     alert = obs.get("alert", {})
     known = obs.get("known_services", [])
     if task_id == "alert_classification":
         rev = alert.get("revenue_impact_per_min", 0) or 0
         err = alert.get("error_rate", 0) or 0
-        sev = "P1" if (rev > 1000 or err > 0.9) else (
-              "P2" if (rev > 100 or err > 0.3) else "P3")
         svc = (alert.get("affected_services") or known or ["unknown"])[0]
-        return {
-            "action_type": "submit_severity",
-            "parameters": {"severity": sev, "service": svc},
-        }
     if task_id == "root_cause_analysis":
         svc = known[0] if known else "unknown"
-        return {
-            "action_type": "submit_root_cause",
-            "parameters": {
-                "service": svc,
-                "failure_mode": "service failure causing downstream cascade",
-            },
-        }
-    # remediation_planning
-    return {
-        "action_type": "submit_resolution",
-        "parameters": {
-            "summary": (
-                "The incident was investigated through log and metric analysis "
-                "across affected services. Remediation actions were applied to "
-                "restore service health. Systems are being monitored for full "
-                "recovery confirmation."
-            ),
-        },
-    }
-def _smart_fallback(
-    task_id: str, obs: dict, step: int, max_steps: int
-) -> dict:
-    """Generic fallback — queries unvisited services, then submits."""
-    known   = obs.get("known_services", [])
     queried = obs.get("queried_data", {})
-    left    = max_steps - step
-    q_svcs  = _queried_svcs(queried)
-    # Must submit on final step
     if left <= 1:
         return _fallback_submit(task_id, obs)
-    # Alert classification — submit after any query
     if task_id == "alert_classification" and q_svcs:
         return _fallback_submit(task_id, obs)
-    # Query next un-queried service
     for svc in known:
         if svc not in q_svcs:
-            return {
-                "action_type": "query_logs",
-                "parameters": {"service": svc},
-            }
-    # Try check_recent_deploys for unvisited services
     if task_id in ("root_cause_analysis", "remediation_planning"):
         deploy_queried = set(queried.get("check_recent_deploys", {}).keys())
         for svc in known:
             if svc not in deploy_queried:
-                return {
-                    "action_type": "check_recent_deploys",
-                    "parameters": {"service": svc},
-                }
-    # Everything queried — submit
     return _fallback_submit(task_id, obs)
-# ── Override — ONLY blocks clearly invalid actions ──────────────────────────
 def _should_override(
     task_id: str, action: dict, obs: dict, step: int, max_steps: int
 ) -> bool:
-    at     = action.get("action_type", "")
     params = action.get("parameters", {})
-    left   = max_steps - step
-    known  = obs.get("known_services", [])
-    # 1. Unknown action type
     if at not in _ALL_VALID:
         return True
-    # 2. Must submit on last step
     if left <= 0 and at not in _SUBMIT_TYPES:
         return True
-    # 3. WRONG submission type for the task
-    #    e.g. submit_severity during remediation_planning
     correct_submit = _TASK_SUBMIT.get(task_id)
     if at in _SUBMIT_TYPES and at != correct_submit:
         return True
-    # 4. Service not in known_services (for service-targeted actions)
     svc = (params.get("service") or "").strip()
     if (svc and known
             and at not in ("disable_feature_flag", "execute_runbook_step")
             and svc not in known):
         return True
-    # 5. Invalid severity value
     if at == "submit_severity":
         sev = (params.get("severity") or "").upper().strip()
         if sev not in ("P1", "P2", "P3", "P4"):
             return True
-    # 6. Empty required fields
     if at == "submit_root_cause":
-        svc  = (params.get("service") or "").strip()
         mode = (params.get("failure_mode") or "").strip()
         if not svc or len(mode) < 5:
             return True
@@ -400,14 +364,40 @@ def _should_override(
         if len(summary) < 30:
             return True
-    # 7. Remediation action used in alert_classification task
     if task_id == "alert_classification" and at in _REM_TYPES:
         return True
     return False
-# ── Episode runner ──────────────────────────────────────────────────────────
 def _run_episode(task_id: str, scenario_index: int) -> float:
     r = _session.post(
         f"{ENV_BASE_URL}/reset",
@@ -428,24 +418,9 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
     for step_i in range(max_steps):
         current_step = step_i + 1
-        # ── Call LLM ─────────────────────────────────────────────────────
-        try:
-            resp = _get_client().chat.completions.create(
-                model=MODEL_NAME,
-                messages=messages,
-                temperature=0.0,
-                max_tokens=300,
-                stream=False,
-            )
-            raw = resp.choices[0].message.content or ""
-        except Exception as e:
-            print(f"  [WARN] LLM call failed step {current_step}: {e}",
-                  file=sys.stderr)
-            raw = ""
         messages.append({"role": "assistant", "content": raw or "{}"})
-        # ── Parse ────────────────────────────────────────────────────────
         action = None
         try:
             if raw.strip():
@@ -453,7 +428,6 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
         except Exception:
             pass
-        # ── Fallback / override ──────────────────────────────────────────
         if action is None:
             action = _smart_fallback(task_id, obs, current_step, max_steps)
             print(f"  [FALLBACK] step {current_step}: "
@@ -462,15 +436,11 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
             old_at = action.get("action_type")
             action = _smart_fallback(task_id, obs, current_step, max_steps)
             print(f"  [OVERRIDE] step {current_step}: "
-                  f"{old_at} -> {action.get('action_type')}",
-                  file=sys.stderr)
-        # ── Step ─────────────────────────────────────────────────────────
-        sr = _session.post(
-            f"{ENV_BASE_URL}/step", json=action, timeout=30,
-        )
         sr.raise_for_status()
-        result  = sr.json()
         new_obs = result["observation"]
         print(
@@ -492,7 +462,6 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
         }
         obs = new_obs
-        # Keep conversation window manageable
         if len(messages) > 20:
             messages = messages[:2] + messages[-16:]
@@ -501,15 +470,17 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
     return g.json().get("total", 0.0)
-# ── Entry point ─────────────────────────────────────────────────────────────
 def main():
     runs = [
         ("alert_classification",  0),
         ("alert_classification",  1),
         ("root_cause_analysis",   0),
         ("root_cause_analysis",   1),
         ("remediation_planning",  0),
         ("remediation_planning",  1),
     ]
     results: dict[str, list[float]] = {}
@@ -530,9 +501,7 @@ def main():
         results.setdefault(task_id, []).append(score)
     print("-" * 50)
-    summary = {
-        t: round(sum(v) / len(v), 4) for t, v in results.items()
-    }
     summary["overall"] = round(sum(summary.values()) / len(summary), 4)
     print("\nScore Summary:")

 import json
 import os
 import sys
+import time
 import requests
     print("[WARN] No API key set — LLM calls will fail.", file=sys.stderr)
 _session = requests.Session()
 _client = None
     return _client
+# ── Constants ───────────────────────────────────────────────────────────────
 _TASK_SUBMIT = {
     "alert_classification":  "submit_severity",
     "root_cause_analysis":   "submit_root_cause",
 _ALL_VALID = _DIAG_TYPES | _SUBMIT_TYPES | _REM_TYPES
 SYSTEM_PROMPT = """\
 You are an expert Site Reliability Engineer responding to a production incident.
 Reply with exactly ONE JSON action object. No markdown, no explanation, no extra text.
 {"action_type":"submit_resolution","parameters":{"summary":"<3+ sentence summary>"}}
 RULES:
+- Service names MUST exactly match the KNOWN_SERVICES list.
 - P1 = complete outage OR revenue > $1,000/min.  P2 = major degradation.
+  P3 = minor/partial issue with graceful fallback.  P4 = informational.
+- IMPORTANT: check_recent_deploys and check_dependencies require prior
+  investigation. You MUST query_logs or check_metrics on a service BEFORE
+  checking its deploys or dependencies. Otherwise you get limited data.
+- Root cause = the upstream service that TRIGGERED the cascade. Often NOT
+  in the alert's affected_services list.
+- submit_resolution summary: 3+ sentences about what failed, what you did, status.
+- Submit as soon as evidence is clear — do NOT waste steps.
+STRATEGY:
 alert_classification (max 3 steps):
+  Query 1-2 services with logs/metrics, then submit_severity.
+  Check revenue_impact and error_rate carefully. Not all high error rates are P1.
 root_cause_analysis (max 10 steps):
+  1. query_logs or check_metrics on 2-3 services to understand the blast radius
+  2. THEN check_recent_deploys on services that look suspicious
+  3. Look for the service whose deploy/change CAUSED the cascade
+  4. Submit submit_root_cause with service and failure_mode
 remediation_planning (max 15 steps):
+  1. query_logs on affected services to confirm root cause
+  2. Execute remediation actions in logical order
+  3. Verify recovery with check_service_status
+  4. Submit submit_resolution with detailed summary
 CRITICAL: Each task has ONE correct submission action:
   alert_classification  -> submit_severity
   root_cause_analysis   -> submit_root_cause
+  remediation_planning  -> submit_resolution"""
 # ── Helpers ─────────────────────────────────────────────────────────────────
 def _queried_svcs(queried_data: dict) -> set[str]:
     return {
         svc
 def _extract_signals(queried_data: dict) -> list[str]:
     seen: set[str] = set()
     signals: list[str] = []
                 _add(f"Cache purge in {svc}")
             if "unbounded" in t or "no limit" in t:
                 _add(f"Unbounded query in {svc}")
+            if "credential" in t or "password" in t or "authentication failed" in t:
+                _add(f"Credential/auth issue in {svc}")
+            if "requires deeper investigation" in t or "requires initial investigation" in t:
+                _add(f"GATED: {svc} needs logs/metrics first before checking deploys")
             if action_type == "check_recent_deploys" and any(
+                x in t for x in ("ago", "change", "update", "added", "deploy")
             ):
+                if "requires" not in t:  # Don't show gated responses as signals
+                    snippet = str(data)[:120].replace("\n", " ")
+                    _add(f"Recent change in {svc}: {snippet}")
     return signals
 def _first_obs_msg(obs: dict) -> str:
+    alert = obs.get("alert", {})
+    known = obs.get("known_services", [])
     affected = alert.get("affected_services", [])
+    task_id = obs.get("task_id", "")
+    non_aff = [s for s in known if s not in affected]
     lines = [
         "=== NEW INCIDENT ===",
         if alert.get("title"):
             lines.append(f"  Title: {alert['title']}")
         if affected:
+            lines.append(f"  Directly affected: {', '.join(affected)}")
         for s in alert.get("symptoms", []):
             lines.append(f"  - {s}")
         for k in ("error_rate", "duration_minutes", "revenue_impact_per_min"):
             if alert.get(k) is not None:
                 lines.append(f"  {k}: {alert[k]}")
+    lines.append(f"KNOWN_SERVICES: {json.dumps(known)}")
     if non_aff and task_id in ("root_cause_analysis", "remediation_planning"):
+        lines.append(f"  Services NOT in alert (investigate these too): {json.dumps(non_aff)}")
     lines.append(f"AVAILABLE ACTIONS: {obs.get('available_actions', [])}")
     lines.append(f"REQUIRED SUBMISSION: {_TASK_SUBMIT.get(task_id, 'unknown')}")
+    if task_id in ("root_cause_analysis", "remediation_planning"):
+        lines.append("")
+        lines.append("NOTE: check_recent_deploys requires prior investigation.")
+        lines.append("You MUST query_logs or check_metrics on a service FIRST.")
     lines.append("")
+    lines.append("Respond with your first action (JSON only):")
     return "\n".join(lines)
 def _step_msg(obs: dict, prev_queried: dict) -> str:
+    step = obs.get("step_count", 0)
     max_steps = obs.get("max_steps", 10)
+    left = max_steps - step
+    queried = obs.get("queried_data", {})
+    task_id = obs.get("task_id", "")
     lines = [
         f"Step {step}/{max_steps} ({left} remaining) | "
         f"feedback: {obs.get('feedback', '')}",
     ]
     new_data = []
     for action_type, services in queried.items():
         prev = prev_queried.get(action_type, {})
                         d = d[:500] + "..."
                     new_data.append(f"  [{action_type}][{svc}]: {d}")
     if new_data:
+        lines.append("NEW DATA:")
         lines.extend(new_data)
     signals = _extract_signals(queried)
     if signals:
+        lines.append("SIGNALS:")
         for sig in signals:
             lines.append(f"  *** {sig} ***")
     if left <= 3:
+        lines.append(f"*** {left} steps left — submit {_TASK_SUBMIT.get(task_id, '')} soon ***")
     if left <= 1:
+        lines.append(f"!!! LAST STEP — MUST {_TASK_SUBMIT.get(task_id, 'SUBMIT')} NOW !!!")
+    lines.append("Next action (JSON only):")
     return "\n".join(lines)
 def _parse(text: str) -> dict:
     text = text.strip()
     if text.startswith("`"):
         text = "\n".join(
             ln for ln in text.splitlines() if not ln.startswith("`")
         raise
 def _fallback_submit(task_id: str, obs: dict) -> dict:
     alert = obs.get("alert", {})
     known = obs.get("known_services", [])
     if task_id == "alert_classification":
         rev = alert.get("revenue_impact_per_min", 0) or 0
         err = alert.get("error_rate", 0) or 0
+        sev = ("P1" if (rev > 1000 or err > 0.9) else
+               ("P2" if (rev > 100 or err > 0.3) else "P3"))
         svc = (alert.get("affected_services") or known or ["unknown"])[0]
+        return {"action_type": "submit_severity",
+                "parameters": {"severity": sev, "service": svc}}
     if task_id == "root_cause_analysis":
         svc = known[0] if known else "unknown"
+        return {"action_type": "submit_root_cause",
+                "parameters": {"service": svc,
+                               "failure_mode": "service failure causing cascade"}}
+    return {"action_type": "submit_resolution",
+            "parameters": {"summary": (
+                "The incident was investigated through log and metric analysis. "
+                "Remediation actions were applied to restore service health. "
+                "Systems are being monitored for recovery confirmation."
+            )}}
+def _smart_fallback(task_id: str, obs: dict, step: int, max_steps: int) -> dict:
+    known = obs.get("known_services", [])
     queried = obs.get("queried_data", {})
+    left = max_steps - step
+    q_svcs = _queried_svcs(queried)
     if left <= 1:
         return _fallback_submit(task_id, obs)
     if task_id == "alert_classification" and q_svcs:
         return _fallback_submit(task_id, obs)
+    # Query logs on unvisited services first
     for svc in known:
         if svc not in q_svcs:
+            return {"action_type": "query_logs",
+                    "parameters": {"service": svc}}
+    # Then try check_recent_deploys (will now work since we queried logs)
     if task_id in ("root_cause_analysis", "remediation_planning"):
         deploy_queried = set(queried.get("check_recent_deploys", {}).keys())
         for svc in known:
             if svc not in deploy_queried:
+                return {"action_type": "check_recent_deploys",
+                        "parameters": {"service": svc}}
     return _fallback_submit(task_id, obs)
 def _should_override(
     task_id: str, action: dict, obs: dict, step: int, max_steps: int
 ) -> bool:
+    at = action.get("action_type", "")
     params = action.get("parameters", {})
+    left = max_steps - step
+    known = obs.get("known_services", [])
     if at not in _ALL_VALID:
         return True
     if left <= 0 and at not in _SUBMIT_TYPES:
         return True
     correct_submit = _TASK_SUBMIT.get(task_id)
     if at in _SUBMIT_TYPES and at != correct_submit:
         return True
     svc = (params.get("service") or "").strip()
     if (svc and known
             and at not in ("disable_feature_flag", "execute_runbook_step")
             and svc not in known):
         return True
     if at == "submit_severity":
         sev = (params.get("severity") or "").upper().strip()
         if sev not in ("P1", "P2", "P3", "P4"):
             return True
     if at == "submit_root_cause":
+        svc = (params.get("service") or "").strip()
         mode = (params.get("failure_mode") or "").strip()
         if not svc or len(mode) < 5:
             return True
         if len(summary) < 30:
             return True
     if task_id == "alert_classification" and at in _REM_TYPES:
         return True
     return False
+def _llm_call_with_retry(messages: list, max_retries: int = 2) -> str:
+    """Call LLM with retry on rate limit errors."""
+    for attempt in range(max_retries + 1):
+        try:
+            resp = _get_client().chat.completions.create(
+                model=MODEL_NAME,
+                messages=messages,
+                temperature=0.0,
+                max_tokens=300,
+                stream=False,
+            )
+            return resp.choices[0].message.content or ""
+        except Exception as e:
+            err_str = str(e).lower()
+            if "rate_limit" in err_str or "429" in err_str:
+                if attempt < max_retries:
+                    # Parse wait time from error or use default
+                    wait = 10 * (attempt + 1)
+                    print(f"  [RATE LIMIT] waiting {wait}s (attempt {attempt + 1})",
+                          file=sys.stderr)
+                    time.sleep(wait)
+                    continue
+            if attempt == max_retries:
+                print(f"  [WARN] LLM call failed: {e}", file=sys.stderr)
+                return ""
+    return ""
 def _run_episode(task_id: str, scenario_index: int) -> float:
     r = _session.post(
         f"{ENV_BASE_URL}/reset",
     for step_i in range(max_steps):
         current_step = step_i + 1
+        raw = _llm_call_with_retry(messages)
         messages.append({"role": "assistant", "content": raw or "{}"})
         action = None
         try:
             if raw.strip():
         except Exception:
             pass
         if action is None:
             action = _smart_fallback(task_id, obs, current_step, max_steps)
             print(f"  [FALLBACK] step {current_step}: "
             old_at = action.get("action_type")
             action = _smart_fallback(task_id, obs, current_step, max_steps)
             print(f"  [OVERRIDE] step {current_step}: "
+                  f"{old_at} -> {action.get('action_type')}", file=sys.stderr)
+        sr = _session.post(f"{ENV_BASE_URL}/step", json=action, timeout=30)
         sr.raise_for_status()
+        result = sr.json()
         new_obs = result["observation"]
         print(
         }
         obs = new_obs
         if len(messages) > 20:
             messages = messages[:2] + messages[-16:]
     return g.json().get("total", 0.0)
 def main():
     runs = [
         ("alert_classification",  0),
         ("alert_classification",  1),
+        ("alert_classification",  2),
         ("root_cause_analysis",   0),
         ("root_cause_analysis",   1),
+        ("root_cause_analysis",   2),
         ("remediation_planning",  0),
         ("remediation_planning",  1),
+        ("remediation_planning",  2),
     ]
     results: dict[str, list[float]] = {}
         results.setdefault(task_id, []).append(score)
     print("-" * 50)
+    summary = {t: round(sum(v) / len(v), 4) for t, v in results.items()}
     summary["overall"] = round(sum(summary.values()) / len(summary), 4)
     print("\nScore Summary:")

openenv.yaml CHANGED Viewed

@@ -4,11 +4,11 @@ app_port: 7860
 description: >
   OpenEnv environment simulating real-world cloud SRE on-call incident response.
   Distinct from Kubernetes ops — focuses on cross-service cascading failures,
-  network partitions, OOM kills, and CDN storms across distributed systems.
-  An AI agent classifies alert severity, performs root cause analysis through
-  log/metric/dependency queries, and executes remediation sequences to resolve
-  production incidents end-to-end.
-author: Elliot89
 license: MIT
 tags:
   - openenv
@@ -28,6 +28,7 @@ tasks:
     description: >
       Classify incoming alert severity (P1-P4) by querying
       logs and metrics across affected cloud services.
   - id: root_cause_analysis
     name: "Task 2: Root Cause Analysis"
@@ -37,7 +38,8 @@ tasks:
     description: >
       Trace a live incident through logs, metrics, dependencies,
       and recent deploys to identify the exact root cause service
-      and failure mode across a distributed system.
   - id: remediation_planning
     name: "Task 3: Incident Remediation"
@@ -46,8 +48,9 @@ tasks:
     score_range: [0.0, 1.0]
     description: >
       Fully resolve a production incident end-to-end: diagnose
-      the root cause, execute the correct remediation sequence,
-      and submit a documented resolution summary.
 endpoints:
   health:   "GET /health"

 description: >
   OpenEnv environment simulating real-world cloud SRE on-call incident response.
   Distinct from Kubernetes ops — focuses on cross-service cascading failures,
+  network partitions, OOM kills, credential rotation failures, and CDN storms
+  across distributed systems. An AI agent classifies alert severity, performs
+  root cause analysis through log/metric/dependency queries, and executes
+  remediation sequences to resolve production incidents end-to-end.
+author: Einstein_Sidra
 license: MIT
 tags:
   - openenv
     description: >
       Classify incoming alert severity (P1-P4) by querying
       logs and metrics across affected cloud services.
+      Target baseline: 0.75-1.0 with 8B model.
   - id: root_cause_analysis
     name: "Task 2: Root Cause Analysis"
     description: >
       Trace a live incident through logs, metrics, dependencies,
       and recent deploys to identify the exact root cause service
+      and failure mode. Root cause is NOT in the alert.
+      Target baseline: 0.35-0.60 with 8B model.
   - id: remediation_planning
     name: "Task 3: Incident Remediation"
     score_range: [0.0, 1.0]
     description: >
       Fully resolve a production incident end-to-end: diagnose
+      the root cause, execute the correct multi-step remediation
+      sequence, and submit a documented resolution summary.
+      Wrong actions penalized. Target baseline: 0.20-0.45 with 8B model.
 endpoints:
   health:   "GET /health"

pyproject.toml CHANGED Viewed

@@ -13,4 +13,13 @@ dependencies = [
     "openai>=1.58.0",
     "httpx>=0.27.0,<0.29.0",
     "python-dotenv>=1.0.0",
-]

     "openai>=1.58.0",
     "httpx>=0.27.0,<0.29.0",
     "python-dotenv>=1.0.0",
+    "gradio>=4.0.0,<6.0.0",
+    "openenv-core>=0.2.0",
+]
+[project.scripts]
+server = "server.app:main"
+[build-system]
+requires = ["setuptools>=68.0"]
+build-backend = "setuptools.backends._legacy:_Backend"

requirements.txt CHANGED Viewed

@@ -4,4 +4,6 @@ pydantic>=2.0.0
 requests>=2.31.0
 openai>=1.58.0
 httpx>=0.27.0,<0.29.0
-python-dotenv>=1.0.0

 requests>=2.31.0
 openai>=1.58.0
 httpx>=0.27.0,<0.29.0
+python-dotenv>=1.0.0
+gradio>=4.0.0,<6.0.0
+openenv-core>=0.2.0

server/__init__.py CHANGED Viewed

	@@ -0,0 +1,2 @@


1	+ """Cloud Incident Response — OpenEnv server package."""
2	+ __version__ = "0.1.0"

server/app.py CHANGED Viewed

@@ -1,15 +1,14 @@
 """
-server/app.py — FastAPI server for Cloud Incident Response OpenEnv.
-Endpoints:
-  GET  /          JSON health/status (triggers HF Space "Running" badge)
-  GET  /health    Lightweight health check
-  POST /reset     Start new episode
-  POST /step      Submit action
-  GET  /state     Current episode state
-  GET  /tasks     All tasks with action schemas
-  GET  /grader    Score current episode
-  POST /baseline  Run inference.py end-to-end, return score summary
 """
 from __future__ import annotations
@@ -19,26 +18,24 @@ import os
 import subprocess
 import sys
-# Ensure project root is on sys.path regardless of working directory
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from contextlib import asynccontextmanager
-from fastapi import FastAPI, HTTPException, Query
 from fastapi.middleware.cors import CORSMiddleware
-from server.models import Action
 from server.environment import IncidentEnvironment
-from tasks import list_tasks, ALL_TASKS
 _ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-# ── Global env instance ──────────────────────────────────────────────────────
 _env: IncidentEnvironment | None = None
 @asynccontextmanager
 async def lifespan(app: FastAPI):
-    """Initialise heavy objects after the server is already accepting requests."""
     global _env
     _env = IncidentEnvironment()
     yield
@@ -46,10 +43,13 @@ async def lifespan(app: FastAPI):
 def _get_env() -> IncidentEnvironment:
     if _env is None:
-        raise HTTPException(
-            status_code=503,
-            detail="Environment initialising — retry in a moment",
-        )
     return _env
@@ -58,7 +58,7 @@ app = FastAPI(
     version="0.1.0",
     description=(
         "OpenEnv environment for training AI agents on cloud SRE incident response. "
-        "Covers cascading failures, OOM kills, CDN storms, and network partitions."
     ),
     lifespan=lifespan,
 )
@@ -71,98 +71,134 @@ app.add_middleware(
 )
-# ── Root — plain JSON so HF Space flips badge to Running ─────────────────────
-@app.get("/")
-def root():
-    return {
-        "status":      "running",
-        "name":        "cloud-incident-response",
-        "version":     "0.1.0",
-        "description": "OpenEnv environment for cloud SRE incident response",
-        "tasks":       ["alert_classification", "root_cause_analysis", "remediation_planning"],
-        "docs":        "/docs",
-        "health":      "/health",
-    }
-# ── Core endpoints ────────────────────────────────────────────────────────────
 @app.get("/health")
 def health():
     return {"status": "ok", "version": "0.1.0"}
 @app.post("/reset")
-def reset(
-    task_id:        str = Query(default="alert_classification"),
-    scenario_index: int = Query(default=0),
-):
-    """Start a new episode. Returns the initial observation."""
     env = _get_env()
     try:
         obs = env.reset(task_id=task_id, scenario_index=scenario_index)
         return obs.model_dump()
     except ValueError as e:
-        raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
 @app.post("/step")
 def step(action: Action):
-    """Submit one action. Returns observation, reward, done, info."""
     env = _get_env()
     try:
         obs, reward, done, info = env.step(action)
         return {
             "observation": obs.model_dump(),
-            "reward":      reward.model_dump(),
-            "done":        done,
-            "info":        info,
         }
     except RuntimeError as e:
-        raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
 @app.get("/state")
 def state():
-    """Return the full current episode state."""
     env = _get_env()
     try:
         return env.state().model_dump()
     except RuntimeError as e:
-        raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
 @app.get("/tasks")
 def tasks():
-    """Return all tasks with descriptions and action schemas."""
     return {
         "tasks": list_tasks(),
         "total": len(ALL_TASKS),
         "action_schema": {
             "diagnostic": [
-                {"action_type": "query_logs",           "parameters": {"service": "string"}},
-                {"action_type": "check_metrics",        "parameters": {"service": "string"}},
-                {"action_type": "check_dependencies",   "parameters": {"service": "string"}},
                 {"action_type": "check_recent_deploys", "parameters": {"service": "string"}},
                 {"action_type": "check_service_status", "parameters": {"service": "string"}},
             ],
             "remediation": [
-                {"action_type": "restart_service",      "parameters": {"service": "string"}},
-                {"action_type": "rollback_deploy",      "parameters": {"service": "string", "target_version": "string"}},
-                {"action_type": "scale_service",        "parameters": {"service": "string", "replicas": "int"}},
                 {"action_type": "disable_feature_flag", "parameters": {"flag": "string"}},
-                {"action_type": "clear_cache",          "parameters": {"service": "string"}},
-                {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "string", "target": "string"}},
             ],
             "submission": [
-                {"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "string"}},
                 {"action_type": "submit_root_cause", "parameters": {"service": "string", "failure_mode": "string"}},
                 {"action_type": "submit_resolution", "parameters": {"summary": "string"}},
             ],
@@ -172,59 +208,423 @@ def tasks():
 @app.get("/grader")
 def grader():
-    """Score the current episode. Returns total in [0.0, 1.0]."""
     env = _get_env()
     try:
         s = env.state()
         from graders import grade
         result = grade(s.task_id, s.model_dump(), env._scenario)
         return {
-            "total":       result["total"],
-            "breakdown":   result["breakdown"],
-            "feedback":    result["feedback"],
-            "task_id":     s.task_id,
             "scenario_id": s.scenario_id,
-            "steps_used":  s.step_count,
-            "done":        s.done,
         }
     except RuntimeError as e:
-        raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
 @app.post("/baseline")
 def baseline():
-    """Run inference.py and return the JSON score summary."""
     script = os.path.join(_ROOT, "inference.py")
     if not os.path.exists(script):
-        raise HTTPException(
-            status_code=500,
-            detail="inference.py not found in project root",
-        )
     try:
         result = subprocess.run(
             [sys.executable, script],
-            capture_output=True,
-            text=True,
-            timeout=1200,
-            cwd=_ROOT,
             env={**os.environ, "ENV_BASE_URL": "http://localhost:7860"},
         )
     except subprocess.TimeoutExpired:
-        raise HTTPException(status_code=500, detail="inference.py timed out (>20 min)")
     if result.returncode != 0:
-        raise HTTPException(status_code=500, detail=result.stderr[-2000:])
     lines = result.stdout.strip().splitlines()
-    last  = lines[-1] if lines else ""
     try:
         return json.loads(last)
     except Exception:
         return {"raw_output": result.stdout[-3000:]}
-if __name__ == "__main__":
     import uvicorn
-    uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=False)

 """
+server/app.py — FastAPI + Gradio server for Cloud Incident Response OpenEnv.
+Endpoints (OpenEnv spec):
+  GET  /health  → {"status": "ok"}
+  POST /reset   → Observation (accepts JSON body or query params)
+  POST /step    → {"observation": ..., "reward": ..., "done": ..., "info": ...}
+  GET  /state   → EpisodeState
+  GET  /tasks   → task list with action schemas
+  GET  /grader  → grading result for current episode
+  POST /baseline → run inference.py
 """
 from __future__ import annotations
 import subprocess
 import sys
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from contextlib import asynccontextmanager
+from fastapi import FastAPI, HTTPException, Request
 from fastapi.middleware.cors import CORSMiddleware
 from server.environment import IncidentEnvironment
+from server.models import Action, ActionParameters
+from tasks import ALL_TASKS, list_tasks
 _ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 _env: IncidentEnvironment | None = None
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     global _env
     _env = IncidentEnvironment()
     yield
 def _get_env() -> IncidentEnvironment:
     if _env is None:
+        raise HTTPException(503, "Environment initialising — retry in a moment")
+    return _env
+def _get_env_direct() -> IncidentEnvironment:
+    if _env is None:
+        raise RuntimeError("Environment not initialised yet")
     return _env
     version="0.1.0",
     description=(
         "OpenEnv environment for training AI agents on cloud SRE incident response. "
+        "Implements step()/reset()/state() API with typed Observation, Action, and Reward models."
     ),
     lifespan=lifespan,
 )
 )
+# ── OpenEnv API Endpoints ─────────────────────────────────────────────────────
 @app.get("/health")
 def health():
+    """Health check endpoint."""
     return {"status": "ok", "version": "0.1.0"}
+@app.get("/api/info")
+def api_info():
+    """Environment metadata."""
+    return {
+        "status": "running",
+        "name": "cloud-incident-response",
+        "version": "0.1.0",
+        "description": "OpenEnv environment for cloud SRE incident response",
+        "tasks": list(ALL_TASKS.keys()),
+        "docs": "/docs",
+    }
 @app.post("/reset")
+async def reset(request: Request):
+    """Reset the environment and start a new episode.
+    Accepts task_id and scenario_index via:
+      - Query parameters: /reset?task_id=...&scenario_index=...
+      - JSON body: {"task_id": "...", "scenario_index": 0}
+      - Empty body: uses defaults (alert_classification, scenario 0)
+    Returns: Observation dict
+    """
+    task_id = "alert_classification"
+    scenario_index = 0
+    # Parse query params
+    qp = request.query_params
+    if qp.get("task_id"):
+        task_id = qp["task_id"]
+    if qp.get("scenario_index"):
+        try:
+            scenario_index = int(qp["scenario_index"])
+        except ValueError:
+            pass
+    # Parse JSON body (may be empty {} or have fields)
+    try:
+        body = await request.json()
+        if isinstance(body, dict):
+            task_id = body.get("task_id", task_id)
+            si = body.get("scenario_index")
+            if si is not None:
+                scenario_index = int(si)
+    except Exception:
+        pass  # Empty body or non-JSON is fine — use defaults
     env = _get_env()
     try:
         obs = env.reset(task_id=task_id, scenario_index=scenario_index)
         return obs.model_dump()
     except ValueError as e:
+        raise HTTPException(400, str(e))
     except Exception as e:
+        raise HTTPException(500, str(e))
 @app.post("/step")
 def step(action: Action):
+    """Take one step in the environment.
+    Accepts: Action JSON body with action_type and parameters
+    Returns: {"observation": {...}, "reward": {...}, "done": bool, "info": {...}}
+    """
     env = _get_env()
     try:
         obs, reward, done, info = env.step(action)
         return {
             "observation": obs.model_dump(),
+            "reward": reward.model_dump(),
+            "done": done,
+            "info": info,
         }
     except RuntimeError as e:
+        raise HTTPException(400, str(e))
     except Exception as e:
+        raise HTTPException(500, str(e))
 @app.get("/state")
 def state():
+    """Get the current episode state.
+    Returns: EpisodeState dict with full action history and internal state
+    """
     env = _get_env()
     try:
         return env.state().model_dump()
     except RuntimeError as e:
+        raise HTTPException(400, str(e))
     except Exception as e:
+        raise HTTPException(500, str(e))
 @app.get("/tasks")
 def tasks():
+    """List all available tasks with action schemas."""
     return {
         "tasks": list_tasks(),
         "total": len(ALL_TASKS),
         "action_schema": {
             "diagnostic": [
+                {"action_type": "query_logs", "parameters": {"service": "string"}},
+                {"action_type": "check_metrics", "parameters": {"service": "string"}},
+                {"action_type": "check_dependencies", "parameters": {"service": "string"}},
                 {"action_type": "check_recent_deploys", "parameters": {"service": "string"}},
                 {"action_type": "check_service_status", "parameters": {"service": "string"}},
             ],
             "remediation": [
+                {"action_type": "restart_service", "parameters": {"service": "string"}},
+                {"action_type": "rollback_deploy", "parameters": {"service": "string", "target_version": "string"}},
+                {"action_type": "scale_service", "parameters": {"service": "string", "replicas": "int"}},
                 {"action_type": "disable_feature_flag", "parameters": {"flag": "string"}},
+                {"action_type": "clear_cache", "parameters": {"service": "string"}},
+                {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "string"}},
             ],
             "submission": [
+                {"action_type": "submit_severity", "parameters": {"severity": "P1|P2|P3|P4", "service": "string"}},
                 {"action_type": "submit_root_cause", "parameters": {"service": "string", "failure_mode": "string"}},
                 {"action_type": "submit_resolution", "parameters": {"summary": "string"}},
             ],
 @app.get("/grader")
 def grader():
+    """Grade the current episode. Returns score 0.0-1.0 with breakdown."""
     env = _get_env()
     try:
         s = env.state()
         from graders import grade
         result = grade(s.task_id, s.model_dump(), env._scenario)
         return {
+            "total": result["total"],
+            "breakdown": result["breakdown"],
+            "feedback": result["feedback"],
+            "task_id": s.task_id,
             "scenario_id": s.scenario_id,
+            "steps_used": s.step_count,
+            "done": s.done,
         }
     except RuntimeError as e:
+        raise HTTPException(400, str(e))
     except Exception as e:
+        raise HTTPException(500, str(e))
 @app.post("/baseline")
 def baseline():
+    """Run the baseline inference script and return results."""
     script = os.path.join(_ROOT, "inference.py")
     if not os.path.exists(script):
+        raise HTTPException(500, "inference.py not found")
     try:
         result = subprocess.run(
             [sys.executable, script],
+            capture_output=True, text=True, timeout=1200, cwd=_ROOT,
             env={**os.environ, "ENV_BASE_URL": "http://localhost:7860"},
         )
     except subprocess.TimeoutExpired:
+        raise HTTPException(500, "inference.py timed out (>20 min)")
     if result.returncode != 0:
+        raise HTTPException(500, result.stderr[-2000:])
     lines = result.stdout.strip().splitlines()
+    last = lines[-1] if lines else ""
     try:
         return json.loads(last)
     except Exception:
         return {"raw_output": result.stdout[-3000:]}
+# ── Gradio UI ─────────────────────────────────────────────────────────────────
+import gradio as gr
+DIFFICULTY_BADGE = {
+    "alert_classification": "🟢 Easy",
+    "root_cause_analysis": "🟡 Medium",
+    "remediation_planning": "🔴 Hard",
+}
+DIFFICULTY_INFO = {
+    "alert_classification": "3 steps · Classify severity P1–P4",
+    "root_cause_analysis": "10 steps · Find root cause service + failure mode",
+    "remediation_planning": "15 steps · Diagnose, fix, and document",
+}
+SUBMIT_ACTION = {
+    "alert_classification": "submit_severity",
+    "root_cause_analysis": "submit_root_cause",
+    "remediation_planning": "submit_resolution",
+}
+_DIAG_ACTIONS = [
+    "query_logs", "check_metrics", "check_dependencies",
+    "check_recent_deploys", "check_service_status",
+]
+_REM_ACTIONS = [
+    "restart_service", "rollback_deploy", "scale_service",
+    "disable_feature_flag", "clear_cache", "execute_runbook_step",
+]
+def _fmt_obs(obs: dict) -> str:
+    lines = []
+    lines.append(f"### 📋 Scenario `{obs.get('scenario_id', '—')}`\n")
+    summary = obs.get("incident_summary", "")
+    if summary:
+        lines.append(f"> {summary[:600]}\n")
+    alert = obs.get("alert", {})
+    if alert:
+        lines.append("#### 🔔 Alert Details\n")
+        if alert.get("title"):
+            lines.append(f"**Title:** {alert['title']}\n")
+        symptoms = alert.get("symptoms", [])
+        if symptoms:
+            lines.append("**Symptoms:**")
+            for s in symptoms:
+                lines.append(f"- {s}")
+            lines.append("")
+        info_items = []
+        if alert.get("error_rate") is not None:
+            info_items.append(f"Error Rate: **{alert['error_rate']:.0%}**")
+        if alert.get("duration_minutes") is not None:
+            info_items.append(f"Duration: **{alert['duration_minutes']} min**")
+        if alert.get("revenue_impact_per_min") is not None:
+            info_items.append(f"Revenue: **${alert['revenue_impact_per_min']:,.0f}/min**")
+        if info_items:
+            lines.append(" · ".join(info_items) + "\n")
+    known = obs.get("known_services", [])
+    if known:
+        lines.append(f"#### 🖥️ Known Services\n`{'` · `'.join(known)}`\n")
+    task_id = obs.get("task_id", "")
+    submit = SUBMIT_ACTION.get(task_id, "")
+    if submit:
+        diff = DIFFICULTY_INFO.get(task_id, "")
+        lines.append(f"#### 📝 Submit: `{submit}`")
+        if diff:
+            lines.append(f"*{diff}*\n")
+    err = obs.get("last_action_error")
+    if err:
+        lines.append(f"#### ⚠️ Last Action Error\n`{err}`\n")
+    qd = obs.get("queried_data", {})
+    if qd:
+        lines.append("---\n#### 📊 Evidence Collected\n")
+        for action_type, services in qd.items():
+            if isinstance(services, dict):
+                for svc, data in services.items():
+                    d = str(data)
+                    if len(d) > 400:
+                        d = d[:400] + " …"
+                    lines.append(f"**`[{action_type}]` → `{svc}`**")
+                    lines.append(f"```\n{d}\n```\n")
+    return "\n".join(lines)
+def _fmt_state(s: dict) -> str:
+    task_id = s.get("task_id", "—")
+    diff = DIFFICULTY_BADGE.get(task_id, "")
+    done = s.get("done", False)
+    status = "🏁 Complete" if done else "⚡ Active"
+    step_count = s.get("step_count", 0)
+    max_steps = s.get("max_steps", 0)
+    cum_reward = s.get("cumulative_reward", 0.0)
+    pct = (step_count / max_steps * 100) if max_steps > 0 else 0
+    bar_filled = int(pct / 5)
+    bar = "█" * bar_filled + "░" * (20 - bar_filled)
+    return (
+        f"### {status}\n\n"
+        f"| Field | Value |\n|---|---|\n"
+        f"| **Task** | `{task_id}` {diff} |\n"
+        f"| **Episode** | `{s.get('episode_id', '—')[:12]}…` |\n"
+        f"| **Progress** | {step_count}/{max_steps} `{bar}` {pct:.0f}% |\n"
+        f"| **Reward** | `{cum_reward:+.4f}` |\n"
+        f"| **Submitted** | {'✅' if s.get('submitted') else '❌'} |\n"
+    )
+def _fmt_history(action_history: list[dict]) -> str:
+    if not action_history:
+        return "*No actions yet.*"
+    lines = ["| Step | Action | Parameters |", "|:---:|---|---|"]
+    for a in action_history:
+        step = a.get("step", "?")
+        at = a.get("action_type", "?")
+        p = a.get("parameters", {})
+        p_str = ", ".join(f"`{k}={v}`" for k, v in p.items() if v) or "—"
+        icon = "🔍" if at in _DIAG_ACTIONS else ("🔧" if at in _REM_ACTIONS else "📝")
+        lines.append(f"| {step} | {icon} `{at}` | {p_str} |")
+    return "\n".join(lines)
+def _fmt_reward(reward_text: str, grader_result: dict | None = None) -> str:
+    lines = [reward_text]
+    if grader_result:
+        total = grader_result.get("total", 0.0)
+        emoji = "🟢" if total >= 0.8 else ("🟡" if total >= 0.5 else "🔴")
+        lines.append(f"\n### {emoji} Grader Score: **{total:.4f}** / 1.0\n")
+        bd = grader_result.get("breakdown", {})
+        if bd:
+            lines.append("| Component | Value |\n|---|---|")
+            for k, v in bd.items():
+                lines.append(f"| {k} | `{v}` |")
+            lines.append("")
+        fb = grader_result.get("feedback", "")
+        if fb:
+            lines.append(f"> {fb}")
+    return "\n".join(lines)
+def _gr_reset(task_id: str, scenario_index: str):
+    try:
+        env = _get_env_direct()
+        obs = env.reset(task_id=task_id, scenario_index=int(scenario_index))
+        st = env.state()
+        services = obs.known_services
+        return (
+            _fmt_obs(obs.model_dump()),
+            _fmt_state(st.model_dump()),
+            _fmt_history([]),
+            "✅ Episode started.",
+            gr.Dropdown(choices=services, value=services[0] if services else None),
+        )
+    except Exception as e:
+        err = f"❌ **Error:** {e}"
+        return (err, err, "", err, gr.Dropdown(choices=[]))
+def _gr_step(action_type, service, severity, failure_mode, summary, flag, runbook_action, target_version):
+    try:
+        env = _get_env_direct()
+        params = ActionParameters(
+            service=service or None, severity=severity if severity else None,
+            failure_mode=failure_mode or None, summary=summary or None,
+            flag=flag or None, runbook_action=runbook_action or None,
+            target_version=target_version or None,
+        )
+        action = Action(action_type=action_type, parameters=params)
+        obs, reward, done, info = env.step(action)
+        st = env.state()
+        reward_text = (
+            f"### Step Reward: `{reward.score:+.4f}`\n\n"
+            f"**Cumulative:** `{reward.cumulative:+.4f}`\n\n"
+            f"**Feedback:** {reward.reason}"
+        )
+        if done:
+            reward_text += "\n\n---\n🏁 **EPISODE COMPLETE** — Click **Grade Episode**"
+        return (
+            _fmt_obs(obs.model_dump()),
+            _fmt_state(st.model_dump()),
+            _fmt_history(st.action_history),
+            reward_text,
+        )
+    except Exception as e:
+        err = f"❌ **Error:** {e}"
+        return (err, "", "", err)
+def _gr_grade():
+    try:
+        env = _get_env_direct()
+        s = env.state()
+        from graders import grade
+        result = grade(s.task_id, s.model_dump(), env._scenario)
+        return _fmt_reward("### Final Grading", result)
+    except Exception as e:
+        return f"❌ {e}"
+def _gr_state():
+    try:
+        env = _get_env_direct()
+        return _fmt_state(env.state().model_dump())
+    except Exception as e:
+        return f"❌ {e}"
+CUSTOM_CSS = """
+:root, html, body, .gradio-container { color-scheme: light !important; }
+body.dark, html.dark, .dark {
+    color-scheme: light !important;
+    --body-background-fill: #ffffff !important;
+    --background-fill-primary: #ffffff !important;
+    --background-fill-secondary: #f8fafc !important;
+}
+.gradio-container {
+    background: #ffffff !important;
+    max-width: 1500px !important;
+    margin: 0 auto !important;
+}
+.env-header {
+    display: flex; justify-content: space-between; align-items: center;
+    padding: 20px 16px; border-bottom: 2px solid #e2e8f0;
+    margin-bottom: 20px; background: linear-gradient(135deg, #f8fafc, #ffffff);
+    border-radius: 12px 12px 0 0;
+}
+.env-header-left {
+    display: flex; align-items: center; gap: 14px;
+    font-size: 1.5rem; font-weight: 800; color: #0f172a;
+}
+.env-header-dot {
+    width: 14px; height: 14px; border-radius: 50%;
+    background: #22c55e; box-shadow: 0 0 8px rgba(34,197,94,0.4);
+}
+.env-header-right { font-size: 0.9rem; font-weight: 600; color: #94a3b8; text-transform: uppercase; }
+.section-title {
+    font-weight: 700; font-size: 0.95rem; color: #1e293b;
+    margin: 16px 0 8px; padding: 8px 12px; background: #f1f5f9;
+    border-radius: 8px; border-left: 3px solid #3b82f6;
+}
+"""
+FORCE_LIGHT_JS = """
+function() {
+    document.body.classList.remove('dark');
+    document.documentElement.classList.remove('dark');
+    document.documentElement.style.setProperty('color-scheme', 'light');
+}
+"""
+with gr.Blocks(
+    title="Cloud Incident Response — OpenEnv",
+    css=CUSTOM_CSS, js=FORCE_LIGHT_JS,
+    theme=gr.themes.Soft(primary_hue="blue", neutral_hue="slate",
+                         font=gr.themes.GoogleFont("Inter")),
+) as demo:
+    gr.HTML("""
+    <div class="env-header">
+        <div class="env-header-left">
+            <span class="env-header-dot"></span> ☁️ Cloud Incident Response
+        </div>
+        <span class="env-header-right">OpenEnv · v0.1.0</span>
+    </div>
+    """)
+    with gr.Accordion("📖 How to Use", open=False):
+        gr.Markdown("""
+### Quick Start
+1. Select **Task** + **Scenario** → Click **🔄 Reset**
+2. Choose **Action Type** + **Service** → Click **▶️ Execute**
+3. Repeat: investigate → remediate → submit
+4. Click **📊 Grade** for final score (0.0–1.0)
+### Tasks
+| Task | Difficulty | Steps | Submission |
+|---|---|---|---|
+| `alert_classification` | 🟢 Easy | 3 | `submit_severity` |
+| `root_cause_analysis` | 🟡 Medium | 10 | `submit_root_cause` |
+| `remediation_planning` | 🔴 Hard | 15 | `submit_resolution` |
+### Important
+- **Medium/Hard**: `check_recent_deploys` requires prior `query_logs` or `check_metrics` on that service
+- Each action gives immediate reward feedback
+- Wrong remediation actions are penalized
+        """)
+    with gr.Row(equal_height=False):
+        with gr.Column(scale=2, min_width=380):
+            gr.HTML('<div class="section-title">🎯 Episode Setup</div>')
+            with gr.Row():
+                task_dd = gr.Dropdown(
+                    choices=[("🟢 Easy — Alert Classification", "alert_classification"),
+                             ("🟡 Medium — Root Cause Analysis", "root_cause_analysis"),
+                             ("🔴 Hard — Remediation Planning", "remediation_planning")],
+                    value="alert_classification", label="Task", scale=2)
+                scenario_dd = gr.Dropdown(
+                    choices=[("Scenario 0", "0"), ("Scenario 1", "1"), ("Scenario 2", "2")],
+                    value="0", label="Scenario", scale=1)
+            reset_btn = gr.Button("🔄 Reset Environment", variant="secondary", size="lg")
+            gr.HTML('<div class="section-title">🎮 Action Controls</div>')
+            action_type_dd = gr.Dropdown(
+                choices=[("🔍 query_logs", "query_logs"), ("🔍 check_metrics", "check_metrics"),
+                         ("🔍 check_dependencies", "check_dependencies"),
+                         ("🔍 check_recent_deploys", "check_recent_deploys"),
+                         ("🔍 check_service_status", "check_service_status"),
+                         ("🔧 restart_service", "restart_service"),
+                         ("🔧 rollback_deploy", "rollback_deploy"),
+                         ("🔧 scale_service", "scale_service"),
+                         ("🔧 disable_feature_flag", "disable_feature_flag"),
+                         ("🔧 clear_cache", "clear_cache"),
+                         ("🔧 execute_runbook_step", "execute_runbook_step"),
+                         ("📝 submit_severity", "submit_severity"),
+                         ("📝 submit_root_cause", "submit_root_cause"),
+                         ("📝 submit_resolution", "submit_resolution")],
+                value="query_logs", label="Action Type")
+            service_dd = gr.Dropdown(choices=[], label="Target Service",
+                                     allow_custom_value=True, info="Populated after Reset")
+            with gr.Accordion("📋 Parameters", open=False):
+                severity_dd = gr.Dropdown(
+                    choices=[("—", ""), ("P1 Critical", "P1"), ("P2 High", "P2"),
+                             ("P3 Medium", "P3"), ("P4 Low", "P4")],
+                    value="", label="Severity")
+                failure_mode_input = gr.Textbox(label="Failure Mode", lines=1,
+                                                placeholder="e.g. unbounded query OOM killing postgres-db")
+                summary_input = gr.Textbox(label="Resolution Summary", lines=4,
+                                           placeholder="3+ sentences: what failed, what you did, status")
+                flag_input = gr.Textbox(label="Feature Flag", lines=1, placeholder="e.g. full_history_export")
+                runbook_input = gr.Textbox(label="Runbook Action", lines=1, placeholder="e.g. restore_bgp_routes")
+                target_version_input = gr.Textbox(label="Target Version", lines=1, placeholder="e.g. previous")
+            step_btn = gr.Button("▶️ Execute Action", variant="primary", size="lg")
+            gr.HTML('<div class="section-title">📊 Controls</div>')
+            with gr.Row():
+                grade_btn = gr.Button("📊 Grade", variant="secondary", size="sm")
+                state_btn = gr.Button("📋 State", variant="secondary", size="sm")
+            gr.HTML('<div class="section-title">📌 State</div>')
+            state_display = gr.Markdown("### ⏳ Ready\n\nSelect task → Reset → Begin")
+        with gr.Column(scale=3, min_width=480):
+            gr.HTML('<div class="section-title">👁️ Observation</div>')
+            obs_display = gr.Markdown("### 👋 Welcome\n\nSelect a task and click **Reset** to begin.")
+            gr.HTML('<div class="section-title">📜 History</div>')
+            history_display = gr.Markdown("*No actions yet.*")
+            gr.HTML('<div class="section-title">💰 Reward</div>')
+            reward_display = gr.Markdown("*Start an episode first.*")
+    reset_btn.click(fn=_gr_reset, inputs=[task_dd, scenario_dd],
+                    outputs=[obs_display, state_display, history_display, reward_display, service_dd])
+    step_btn.click(fn=_gr_step,
+                   inputs=[action_type_dd, service_dd, severity_dd, failure_mode_input,
+                           summary_input, flag_input, runbook_input, target_version_input],
+                   outputs=[obs_display, state_display, history_display, reward_display])
+    grade_btn.click(fn=_gr_grade, outputs=[reward_display])
+    state_btn.click(fn=_gr_state, outputs=[state_display])
+app = gr.mount_gradio_app(app, demo, path="/")
+def main():
+    """Start the OpenEnv server."""
     import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=False)
+if __name__ == "__main__":
+    main()

server/environment.py CHANGED Viewed

@@ -1,28 +1,24 @@
 """
 server/environment.py — Core OpenEnv environment for Cloud Incident Response.
-Implements the full OpenEnv interface:
-  reset(task_id, scenario_index) -> Observation
-  step(action)                   -> (Observation, Reward, done, info)
-  state()                        -> EpisodeState
-All state is in-memory. Thread-safe via a lock.
 """
 from __future__ import annotations
-import uuid
-import threading
-import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-from tasks import get_task, get_scenario
-from graders import grade, _svc_match
-from server.models import Action, ActionParameters, Observation, Reward, EpisodeState
-# ── Action type classification ────────────────────────────────────────────────
 _DIAGNOSTIC = frozenset({
     "query_logs", "check_metrics", "check_dependencies",
@@ -38,54 +34,81 @@ _SUBMIT = frozenset({
     "submit_severity", "submit_root_cause", "submit_resolution",
 })
-# ── Reward constants ──────────────────────────────────────────────────────────
-R_QUERY_FIRST   = +0.05
-R_QUERY_REPEAT  = +0.01
-R_QUERY_UNKNOWN = -0.05
-R_REM_GOOD      = +0.10
-R_REM_WRONG     = -0.10
-R_PAST_HALF     = -0.02
-R_TIMEOUT       = -0.10
-R_BAD_ACTION    = -0.03
 class IncidentEnvironment:
-    """
-    OpenEnv environment for Cloud Incident Response.
-    One instance handles one episode at a time (thread-safe).
-    """
-    def __init__(self):
-        self._lock     = threading.Lock()
-        self._s:        dict = {}
         self._scenario: dict = {}
         self._task_def: dict = {}
-        self._ready          = False
-    # ── Public OpenEnv API ───────────────────────────────────────────────────
-    def reset(self, task_id: str, scenario_index: int = 0) -> Observation:
         with self._lock:
             task_def = get_task(task_id)
             scenario = get_scenario(task_id, scenario_index)
             self._task_def = task_def
             self._scenario = scenario
             self._s = {
-                "episode_id":        str(uuid.uuid4()),
-                "task_id":           task_id,
-                "scenario_id":       scenario["scenario_id"],
-                "step_count":        0,
-                "max_steps":         task_def["max_steps"],
-                "action_history":    [],
-                "queried_data":      {},
-                "queried_keys":      set(),
-                "submitted":         False,
-                "resolved":          False,
-                "done":              False,
                 "cumulative_reward": 0.0,
-                "feedback":          f"Episode started. {scenario['description']}",
             }
             self._ready = True
             return self._build_obs()
@@ -94,76 +117,76 @@ class IncidentEnvironment:
         with self._lock:
             if not self._ready:
                 raise RuntimeError("Call reset() before step().")
             s = self._s
             if s["done"]:
-                return (
-                    self._build_obs(),
-                    Reward(value=0.0, reason="episode already done",
-                           cumulative=s["cumulative_reward"]),
-                    True,
-                    {},
-                )
             s["step_count"] += 1
             step_num = s["step_count"]
-            at       = action.action_type
-            params   = action.parameters
             s["action_history"].append({
                 "action_type": at,
-                "parameters":  params.model_dump(exclude_none=True),
-                "step":        step_num,
             })
-            r  = 0.0
             fb: list[str] = []
-            # Efficiency penalty after halfway point
-            if step_num > s["max_steps"] // 2:
-                r += R_PAST_HALF
-                fb.append("efficiency penalty")
             if at in _DIAGNOSTIC:
-                r, fb = self._handle_diagnostic(at, params, r, fb)
             elif at in _REMEDIATION:
-                r, fb = self._handle_remediation(at, params, r, fb)
             elif at in _SUBMIT:
-                r, fb, terminal = self._handle_submit(at, params, r, fb)
                 if terminal:
                     s["done"] = True
             else:
-                r += R_BAD_ACTION
-                fb.append(f"unknown action_type '{at}'")
-            # Timeout if max steps reached without submission
             if step_num >= s["max_steps"] and not s["done"]:
-                r += R_TIMEOUT
-                fb.append("timeout — no submission made")
                 s["done"] = True
-            # Apply grader score on terminal step
             if s["done"]:
                 result = grade(s["task_id"], s, self._scenario)
                 s["cumulative_reward"] = round(
-                    s["cumulative_reward"] + r + result["total"], 4
-                )
-                fb.append(f"grader={result['feedback']}")
             else:
                 s["cumulative_reward"] = round(s["cumulative_reward"] + r, 4)
             s["feedback"] = " | ".join(fb) if fb else "ok"
-            return (
-                self._build_obs(),
-                Reward(
-                    value=round(r, 4),
-                    reason=s["feedback"],
-                    cumulative=s["cumulative_reward"],
-                ),
-                s["done"],
-                {"step": step_num, "feedback": s["feedback"]},
-            )
     def state(self) -> EpisodeState:
         with self._lock:
@@ -171,154 +194,143 @@ class IncidentEnvironment:
                 raise RuntimeError("No active episode — call reset() first.")
             s = self._s
             return EpisodeState(
-                episode_id=s["episode_id"],
-                task_id=s["task_id"],
-                scenario_id=s["scenario_id"],
-                step_count=s["step_count"],
                 max_steps=s["max_steps"],
                 action_history=list(s["action_history"]),
                 queried_data=dict(s["queried_data"]),
-                submitted=s["submitted"],
-                resolved=s["resolved"],
-                done=s["done"],
-                cumulative_reward=s["cumulative_reward"],
-                feedback=s["feedback"],
-            )
-    # ── Action handlers ──────────────────────────────────────────────────────
-    def _handle_diagnostic(
-        self, at: str, params: ActionParameters, r: float, fb: list[str]
-    ) -> tuple[float, list[str]]:
-        s       = self._s
-        service = (params.service or "").lower().strip()
-        known   = {sv.lower() for sv in self._scenario.get("known_services", set())}
-        tool_data = self._scenario.get("tool_responses", {}).get(at, {})
-        key     = (at, service)
-        if service and service in known:
-            if key not in s["queried_keys"]:
-                r += R_QUERY_FIRST
-                fb.append(f"queried {service} (+{R_QUERY_FIRST})")
-                s["queried_keys"].add(key)
-            else:
-                r += R_QUERY_REPEAT
-                fb.append(f"re-queried {service} (+{R_QUERY_REPEAT})")
-            result = tool_data.get(service, f"No data available for '{service}'.")
-            s["queried_data"].setdefault(at, {})[service] = result
-        elif service:
-            r += R_QUERY_UNKNOWN
-            fb.append(f"unknown service '{service}' ({R_QUERY_UNKNOWN})")
         else:
-            fb.append(f"{at}: no service specified")
         return r, fb
-    def _handle_remediation(
-        self, at: str, params: ActionParameters, r: float, fb: list[str]
-    ) -> tuple[float, list[str]]:
-        s       = self._s
-        service = (params.service or "").lower().strip()
-        flag    = (params.flag or "").lower().strip()
         runbook = (params.runbook_action or "").lower().strip()
-        target  = (params.target or "").lower().strip()
-        # Build candidate keys for wrong-action matching
-        keys: set[str] = {at}
-        if service: keys.add(f"{at}:{service}")
         if flag:    keys.add(f"{at}:{flag}")
         if runbook: keys.add(f"execute_runbook_step:{runbook}")
         if target:  keys.add(f"execute_runbook_step:{target}")
-        wrong_map  = self._scenario.get("wrong_actions", {})
-        rem_data   = self._scenario.get("remediation_data", {})
-        # Check for wrong actions — also use fuzzy service matching for `at:svc` keys
         is_wrong = any(k in wrong_map for k in keys)
-        if not is_wrong and service:
-            # Try _svc_match against wrong action keys of the form `at:svc`
             for wk in wrong_map:
                 if ":" in wk:
                     w_at, w_svc = wk.split(":", 1)
-                    if w_at == at and _svc_match(service, w_svc):
                         is_wrong = True
                         break
         if is_wrong:
-            r += R_REM_WRONG
-            reason = next(
-                (wrong_map[k] for k in keys if k in wrong_map),
-                "wrong action for this incident"
-            )
-            fb.append(f"wrong action '{at}': {str(reason)[:80]}")
         else:
-            r += R_REM_GOOD
-            fb.append(f"executed {at}" + (f" on '{service}'" if service else ""))
             at_data = rem_data.get(at, {})
-            result  = (
-                at_data.get(service) or at_data.get(flag)
-                or at_data.get(runbook) or at_data.get(target)
-                or "action executed successfully"
-            )
-            s["queried_data"].setdefault(at, {})[
-                service or flag or runbook or target or at
-            ] = result
         return r, fb
-    def _handle_submit(
-        self, at: str, params: ActionParameters, r: float, fb: list[str]
-    ) -> tuple[float, list[str], bool]:
         s = self._s
         s["submitted"] = True
         if at == "submit_severity":
-            fb.append(f"submitted severity: {(params.severity or '').upper()}")
         elif at == "submit_root_cause":
-            fb.append(
-                f"submitted root cause: "
-                f"service={params.service or ''}, "
-                f"failure_mode={params.failure_mode or ''}"
-            )
         elif at == "submit_resolution":
-            summary   = params.summary or ""
-            inv_count = sum(
-                1 for a in s["action_history"]
-                if a.get("action_type") in _DIAGNOSTIC | _REMEDIATION
-            )
-            if summary.strip() and inv_count >= 1:
                 s["resolved"] = True
-                fb.append("resolution submitted — incident resolved")
             else:
-                fb.append("resolution submitted — insufficient investigation")
         return r, fb, True
-    # ── Build observation ────────────────────────────────────────────────────
-    def _build_obs(self) -> Observation:
-        s  = self._s
         sc = self._scenario
         td = self._task_def
-        # Return sorted list of known service names (exact strings agents must use)
-        known = sorted(sc.get("known_services", set()))
         return Observation(
-            episode_id=s["episode_id"],
-            task_id=s["task_id"],
-            scenario_id=s["scenario_id"],
-            step_count=s["step_count"],
             max_steps=s["max_steps"],
             incident_summary=sc.get("incident_summary", sc.get("description", "")),
             alert=sc.get("alert", {}),
             available_actions=td.get("available_actions", []),
             queried_data=dict(s["queried_data"]),
             cumulative_reward=s["cumulative_reward"],
-            done=s["done"],
-            feedback=s["feedback"],
-            known_services=known,
-        )

 """
 server/environment.py — Core OpenEnv environment for Cloud Incident Response.
+Difficulty comes from SCENARIO DESIGN, not mechanics:
+  EASY:   3 services, clear metrics, obvious severity
+  MEDIUM: 8 services, root cause NOT in alert, must follow log breadcrumbs
+  HARD:   8 services + 5-7 remediation steps + quality summary + penalties
 """
 from __future__ import annotations
 import os
+import sys
+import threading
+import uuid
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from graders import _svc_match, grade
+from server.models import Action, ActionParameters, EpisodeState, Observation, Reward
+from tasks import get_scenario, get_task
 _DIAGNOSTIC = frozenset({
     "query_logs", "check_metrics", "check_dependencies",
     "submit_severity", "submit_root_cause", "submit_resolution",
 })
+_TASK_SUBMIT = {
+    "alert_classification": "submit_severity",
+    "root_cause_analysis": "submit_root_cause",
+    "remediation_planning": "submit_resolution",
+}
+_REWARD_TABLE = {
+    "easy": {
+        "query_new_svc":     +0.04,  "query_new_action":  +0.02,
+        "query_repeat":      -0.03,  "query_unknown_svc": -0.06,
+        "query_no_service":  -0.04,  "rem_good":          +0.00,
+        "rem_wrong":         -0.08,  "rem_no_target":     -0.05,
+        "submit_correct":    +0.02,  "submit_wrong":      -0.08,
+        "past_half":         -0.04,  "timeout":           -0.15,
+        "bad_action":        -0.05,  "exact_repeat":      -0.04,
+    },
+    "medium": {
+        "query_new_svc":     +0.04,  "query_new_action":  +0.02,
+        "query_repeat":      -0.04,  "query_unknown_svc": -0.06,
+        "query_no_service":  -0.04,  "rem_good":          +0.06,
+        "rem_wrong":         -0.10,  "rem_no_target":     -0.06,
+        "submit_correct":    +0.02,  "submit_wrong":      -0.10,
+        "past_half":         -0.02,  "timeout":           -0.15,
+        "bad_action":        -0.05,  "exact_repeat":      -0.05,
+    },
+    "hard": {
+        "query_new_svc":     +0.03,  "query_new_action":  +0.01,
+        "query_repeat":      -0.03,  "query_unknown_svc": -0.05,
+        "query_no_service":  -0.03,  "rem_good":          +0.06,
+        "rem_wrong":         -0.15,  "rem_no_target":     -0.05,
+        "submit_correct":    +0.02,  "submit_wrong":      -0.12,
+        "past_half":         -0.02,  "timeout":           -0.20,
+        "bad_action":        -0.05,  "exact_repeat":      -0.04,
+    },
+}
+_TASK_DIFFICULTY = {
+    "alert_classification": "easy",
+    "root_cause_analysis": "medium",
+    "remediation_planning": "hard",
+}
 class IncidentEnvironment:
+    def __init__(self) -> None:
+        self._lock = threading.Lock()
+        self._s: dict = {}
         self._scenario: dict = {}
         self._task_def: dict = {}
+        self._ready = False
+    def reset(self, task_id: str = "alert_classification",
+              scenario_index: int = 0) -> Observation:
         with self._lock:
             task_def = get_task(task_id)
             scenario = get_scenario(task_id, scenario_index)
             self._task_def = task_def
             self._scenario = scenario
             self._s = {
+                "episode_id": str(uuid.uuid4()),
+                "task_id": task_id,
+                "scenario_id": scenario["scenario_id"],
+                "step_count": 0,
+                "max_steps": task_def["max_steps"],
+                "action_history": [],
+                "queried_data": {},
+                "queried_keys": set(),
+                "services_queried": set(),
+                "exact_hashes": set(),
+                "submitted": False,
+                "resolved": False,
+                "done": False,
                 "cumulative_reward": 0.0,
+                "feedback": f"Episode started. {scenario['description']}",
+                "last_action_error": None,
             }
             self._ready = True
             return self._build_obs()
         with self._lock:
             if not self._ready:
                 raise RuntimeError("Call reset() before step().")
             s = self._s
+            s["last_action_error"] = None
             if s["done"]:
+                return (self._build_obs(),
+                        Reward(score=0.0, reason="episode already done",
+                               cumulative=s["cumulative_reward"]),
+                        True, {})
             s["step_count"] += 1
             step_num = s["step_count"]
+            at = action.action_type
+            params = action.parameters
+            task_id = s["task_id"]
+            diff = _TASK_DIFFICULTY.get(task_id, "medium")
+            rt = _REWARD_TABLE[diff]
             s["action_history"].append({
                 "action_type": at,
+                "parameters": params.model_dump(exclude_none=True),
+                "step": step_num,
             })
+            r = 0.0
             fb: list[str] = []
+            h = f"{at}|{params.model_dump_json(exclude_none=True)}"
+            if h in s["exact_hashes"]:
+                r += rt["exact_repeat"]
+                fb.append(f"exact repeat ({rt['exact_repeat']:+.2f})")
+            s["exact_hashes"].add(h)
+            half = max(1, s["max_steps"] // 2)
+            if step_num > half and at not in _SUBMIT:
+                r += rt["past_half"]
+                fb.append(f"past halfway ({rt['past_half']:+.3f})")
             if at in _DIAGNOSTIC:
+                r, fb = self._handle_diagnostic(at, params, r, fb, rt)
             elif at in _REMEDIATION:
+                r, fb = self._handle_remediation(at, params, r, fb, rt, task_id)
             elif at in _SUBMIT:
+                r, fb, terminal = self._handle_submit(at, params, r, fb, rt, task_id)
                 if terminal:
                     s["done"] = True
             else:
+                r += rt["bad_action"]
+                fb.append(f"unknown action '{at}' ({rt['bad_action']:+.2f})")
+                s["last_action_error"] = f"Unknown action type: {at}"
             if step_num >= s["max_steps"] and not s["done"]:
+                r += rt["timeout"]
+                fb.append(f"timeout ({rt['timeout']:+.2f})")
                 s["done"] = True
             if s["done"]:
                 result = grade(s["task_id"], s, self._scenario)
+                grader_score = result["total"]
                 s["cumulative_reward"] = round(
+                    s["cumulative_reward"] + r + grader_score, 4)
+                fb.append(f"grader={grader_score:.3f} ({result['feedback']})")
             else:
                 s["cumulative_reward"] = round(s["cumulative_reward"] + r, 4)
             s["feedback"] = " | ".join(fb) if fb else "ok"
+            return (self._build_obs(),
+                    Reward(score=round(r, 4), reason=s["feedback"],
+                           cumulative=s["cumulative_reward"]),
+                    s["done"],
+                    {"step": step_num, "feedback": s["feedback"]})
     def state(self) -> EpisodeState:
         with self._lock:
                 raise RuntimeError("No active episode — call reset() first.")
             s = self._s
             return EpisodeState(
+                episode_id=s["episode_id"], task_id=s["task_id"],
+                scenario_id=s["scenario_id"], step_count=s["step_count"],
                 max_steps=s["max_steps"],
                 action_history=list(s["action_history"]),
                 queried_data=dict(s["queried_data"]),
+                submitted=s["submitted"], resolved=s["resolved"],
+                done=s["done"], cumulative_reward=s["cumulative_reward"],
+                feedback=s["feedback"])
+    def _handle_diagnostic(self, at, params, r, fb, rt):
+        s = self._s
+        svc = (params.service or "").lower().strip()
+        known = {v.lower() for v in self._scenario.get("known_services", set())}
+        tool = self._scenario.get("tool_responses", {}).get(at, {})
+        key = (at, svc)
+        if not svc:
+            r += rt["query_no_service"]
+            fb.append(f"{at}: no service ({rt['query_no_service']:+.2f})")
+            s["last_action_error"] = f"{at} requires a service parameter"
+            return r, fb
+        if svc not in known:
+            r += rt["query_unknown_svc"]
+            fb.append(f"unknown service '{svc}' ({rt['query_unknown_svc']:+.2f})")
+            s["last_action_error"] = f"Unknown service: {svc}"
+            return r, fb
+        if key in s["queried_keys"]:
+            r += rt["query_repeat"]
+            fb.append(f"repeat [{at}][{svc}] ({rt['query_repeat']:+.2f})")
+        elif svc in s["services_queried"]:
+            r += rt["query_new_action"]
+            fb.append(f"new action on {svc} ({rt['query_new_action']:+.2f})")
+            s["queried_keys"].add(key)
         else:
+            r += rt["query_new_svc"]
+            fb.append(f"new service {svc} ({rt['query_new_svc']:+.2f})")
+            s["queried_keys"].add(key)
+            s["services_queried"].add(svc)
+        result = tool.get(svc, f"No data available for '{svc}'.")
+        s["queried_data"].setdefault(at, {})[svc] = result
         return r, fb
+    def _handle_remediation(self, at, params, r, fb, rt, task_id):
+        s = self._s
+        if task_id == "alert_classification":
+            r += rt["rem_wrong"]
+            fb.append(f"remediation in easy task ({rt['rem_wrong']:+.2f})")
+            s["last_action_error"] = "Remediation not available in alert_classification"
+            return r, fb
+        svc = (params.service or "").lower().strip()
+        flag = (params.flag or "").lower().strip()
         runbook = (params.runbook_action or "").lower().strip()
+        target = (params.target or "").lower().strip()
+        if not (svc or flag or runbook or target):
+            r += rt["rem_no_target"]
+            fb.append(f"{at}: no target ({rt['rem_no_target']:+.2f})")
+            s["last_action_error"] = f"{at} requires a target"
+            return r, fb
+        keys = {at}
+        if svc:     keys.add(f"{at}:{svc}")
         if flag:    keys.add(f"{at}:{flag}")
         if runbook: keys.add(f"execute_runbook_step:{runbook}")
         if target:  keys.add(f"execute_runbook_step:{target}")
+        wrong_map = self._scenario.get("wrong_actions", {})
+        rem_data = self._scenario.get("remediation_data", {})
         is_wrong = any(k in wrong_map for k in keys)
+        if not is_wrong and svc:
             for wk in wrong_map:
                 if ":" in wk:
                     w_at, w_svc = wk.split(":", 1)
+                    if w_at == at and _svc_match(svc, w_svc):
                         is_wrong = True
                         break
         if is_wrong:
+            r += rt["rem_wrong"]
+            reason = next((wrong_map[k] for k in keys if k in wrong_map), "wrong")
+            fb.append(f"wrong: {at} — {str(reason)[:60]} ({rt['rem_wrong']:+.2f})")
         else:
+            r += rt["rem_good"]
+            tgt = svc or flag or runbook or target
+            fb.append(f"executed {at}:{tgt} ({rt['rem_good']:+.2f})")
             at_data = rem_data.get(at, {})
+            result = (at_data.get(svc) or at_data.get(flag) or at_data.get(runbook)
+                      or at_data.get(target) or "action executed successfully")
+            s["queried_data"].setdefault(at, {})[tgt] = result
         return r, fb
+    def _handle_submit(self, at, params, r, fb, rt, task_id):
         s = self._s
+        correct = _TASK_SUBMIT.get(task_id, "")
+        if at != correct:
+            r += rt["submit_wrong"]
+            fb.append(f"wrong submit '{at}' (need '{correct}') ({rt['submit_wrong']:+.2f})")
+            s["last_action_error"] = f"Wrong submission type: use {correct}"
+            return r, fb, False
         s["submitted"] = True
+        r += rt["submit_correct"]
+        fb.append(f"submitted ({rt['submit_correct']:+.2f})")
         if at == "submit_severity":
+            fb.append(f"severity={(params.severity or '').upper().strip()}")
         elif at == "submit_root_cause":
+            fb.append(f"svc={params.service or ''}, mode={params.failure_mode or ''}")
         elif at == "submit_resolution":
+            summary = params.summary or ""
+            inv = sum(1 for a in s["action_history"]
+                      if a.get("action_type") in _DIAGNOSTIC | _REMEDIATION)
+            if summary.strip() and inv >= 1:
                 s["resolved"] = True
+                fb.append("resolved")
             else:
+                fb.append("insufficient investigation")
         return r, fb, True
+    def _build_obs(self):
+        s = self._s
         sc = self._scenario
         td = self._task_def
         return Observation(
+            episode_id=s["episode_id"], task_id=s["task_id"],
+            scenario_id=s["scenario_id"], step_count=s["step_count"],
             max_steps=s["max_steps"],
             incident_summary=sc.get("incident_summary", sc.get("description", "")),
             alert=sc.get("alert", {}),
             available_actions=td.get("available_actions", []),
             queried_data=dict(s["queried_data"]),
             cumulative_reward=s["cumulative_reward"],
+            done=s["done"], feedback=s["feedback"],
+            known_services=sorted(sc.get("known_services", set())),
+            last_action_error=s.get("last_action_error"))

server/models.py CHANGED Viewed

@@ -1,16 +1,21 @@
 """
-server/models.py — Typed Pydantic models for the OpenEnv interface.
-OpenEnv requires three typed models: Action, Observation, Reward.
-All models use Pydantic v2.
 """
 from __future__ import annotations
-from pydantic import BaseModel, Field
 class ActionParameters(BaseModel):
     """Flexible parameter bag — different action types use different fields."""
     service: str | None = None
     severity: str | None = None
     failure_mode: str | None = None
@@ -26,7 +31,13 @@ class ActionParameters(BaseModel):
 class Action(BaseModel):
-    """An action submitted by the agent to the environment."""
     action_type: str
     parameters: ActionParameters = Field(default_factory=ActionParameters)
@@ -34,7 +45,27 @@ class Action(BaseModel):
 class Observation(BaseModel):
-    """Observation returned after reset() or step()."""
     episode_id: str
     task_id: str
     scenario_id: str
@@ -47,20 +78,48 @@ class Observation(BaseModel):
     cumulative_reward: float
     done: bool
     feedback: str
-    # Explicit list of all valid service names for this scenario.
-    # Agents must use these exact strings in action parameters.
     known_services: list[str] = Field(default_factory=list)
 class Reward(BaseModel):
-    """Reward signal returned after each step()."""
-    value: float
     reason: str
     cumulative: float
 class EpisodeState(BaseModel):
-    """Full episode state returned by GET /state."""
     episode_id: str
     task_id: str
     scenario_id: str

 """
+server/models.py — Typed Pydantic v2 models for the OpenEnv interface.
+Implements the full OpenEnv spec:
+  - Action: typed action with parameters
+  - Observation: full environment state visible to the agent
+  - Reward: score + reason + cumulative (with backward-compatible 'value' alias)
+  - EpisodeState: internal state for GET /state
 """
 from __future__ import annotations
+from pydantic import BaseModel, Field, computed_field
 class ActionParameters(BaseModel):
     """Flexible parameter bag — different action types use different fields."""
     service: str | None = None
     severity: str | None = None
     failure_mode: str | None = None
 class Action(BaseModel):
+    """An action submitted by the agent to the environment.
+    Attributes:
+        action_type: One of the valid action types (query_logs, check_metrics, etc.)
+        parameters: Action-specific parameters
+    """
     action_type: str
     parameters: ActionParameters = Field(default_factory=ActionParameters)
 class Observation(BaseModel):
+    """Observation returned after reset() or step().
+    Contains all information visible to the agent at this point in the episode.
+    Attributes:
+        episode_id: Unique episode UUID
+        task_id: Active task identifier
+        scenario_id: Current scenario identifier
+        step_count: Number of steps taken so far
+        max_steps: Maximum steps allowed
+        incident_summary: Human-readable incident description
+        alert: Alert payload with severity, symptoms, affected services
+        available_actions: List of valid action types for this task
+        queried_data: All tool responses gathered so far (evidence)
+        cumulative_reward: Running reward total
+        done: Whether the episode has ended
+        feedback: Per-step feedback string
+        known_services: Exact service names valid for actions
+        last_action_error: Error message if last action was invalid (None if OK)
+    """
     episode_id: str
     task_id: str
     scenario_id: str
     cumulative_reward: float
     done: bool
     feedback: str
     known_services: list[str] = Field(default_factory=list)
+    last_action_error: str | None = None
 class Reward(BaseModel):
+    """Reward signal returned after each step().
+    Primary field is ``score`` (the actual reward value).
+    ``value`` is a computed alias for backward compatibility with OpenEnv validators.
+    Attributes:
+        score: The reward value for this step
+        reason: Human-readable explanation of the reward
+        cumulative: Running total of all rewards in the episode
+    """
+    score: float
     reason: str
     cumulative: float
+    @computed_field
+    @property
+    def value(self) -> float:
+        """Backward-compatible alias for *score*."""
+        return self.score
+class StepResult(BaseModel):
+    """Result returned by POST /step — matches OpenEnv spec."""
+    observation: Observation
+    reward: Reward
+    done: bool
+    info: dict = Field(default_factory=dict)
 class EpisodeState(BaseModel):
+    """Full episode state returned by GET /state.
+    Contains internal bookkeeping not shown to agents directly.
+    """
     episode_id: str
     task_id: str
     scenario_id: str

tasks.py CHANGED Viewed

@@ -1,15 +1,17 @@
 """
 tasks.py — Task and scenario definitions for Cloud Incident Response OpenEnv.
-Covers cross-service cascading failures in distributed cloud systems:
-  - DB connection pool exhaustion cascading through service mesh
-  - CDN cache invalidation storms causing origin overload
-  - OOM kills from runaway analytics queries
-  - BGP network partitions isolating availability zones
-Distinct from Kubernetes ops environments — focuses on application-layer
-incident response: log correlation, dependency tracing, and remediation
-across microservice architectures.
 Public API:
     get_task(task_id)            -> task metadata dict
@@ -29,9 +31,10 @@ ALL_TASKS: dict = {
         "score_range": [0.0, 1.0],
         "description": (
             "An alert has fired. Query logs and metrics across affected services, "
-            "then classify the incident severity: P1 (CRITICAL — revenue/user impact, "
-            "immediate action), P2 (HIGH — degraded service), P3 (MEDIUM — minor issue), "
-            "P4 (LOW — informational). Submit severity with submit_severity."
         ),
         "available_actions": [
             "query_logs",
@@ -41,7 +44,7 @@ ALL_TASKS: dict = {
             "submit_severity",
         ],
         "submission_action": "submit_severity",
-        "scenarios": 2,
     },
     "root_cause_analysis": {
         "id": "root_cause_analysis",
@@ -50,10 +53,11 @@ ALL_TASKS: dict = {
         "max_steps": 10,
         "score_range": [0.0, 1.0],
         "description": (
-            "A production incident is active. Use diagnostic tools to trace the failure "
-            "chain across services. Query logs, metrics, dependency graphs, and recent "
-            "deploys to identify which service is the root cause and what failure mode "
-            "triggered the cascade. Submit findings with submit_root_cause."
         ),
         "available_actions": [
             "query_logs",
@@ -64,7 +68,7 @@ ALL_TASKS: dict = {
             "submit_root_cause",
         ],
         "submission_action": "submit_root_cause",
-        "scenarios": 2,
     },
     "remediation_planning": {
         "id": "remediation_planning",
@@ -74,10 +78,10 @@ ALL_TASKS: dict = {
         "score_range": [0.0, 1.0],
         "description": (
             "A critical production incident requires full end-to-end resolution. "
-            "Diagnose the root cause, execute the correct remediation sequence "
-            "(disable feature flags, restart services, rollback deploys, run runbook steps), "
-            "then submit a resolution summary. Scored on investigation quality, "
-            "remediation correctness, efficiency, and documentation."
         ),
         "available_actions": [
             "query_logs",
@@ -94,520 +98,831 @@ ALL_TASKS: dict = {
             "submit_resolution",
         ],
         "submission_action": "submit_resolution",
-        "scenarios": 2,
     },
 }
 # ---------------------------------------------------------------------------
-# Scenario data — 3 tasks × 2 scenarios = 6 total episodes
 # ---------------------------------------------------------------------------
 SCENARIOS: dict = {
-    # ── TASK 1: ALERT CLASSIFICATION ───────────────────���────────────────────
     "alert_classification": [
-        # AC-001: Cascading DB connection pool exhaustion → P1
         {
             "scenario_id": "AC-001",
             "description": (
-                "Cascading failure: postgres-db connection pool exhausted, "
-                "causing auth-service timeouts, blocking api-gateway requests. "
-                "Revenue impact is severe and growing."
             ),
             "incident_summary": (
-                "P1 ALERT — api-gateway 5xx rate 78%, auth-service timeout rate 94%, "
-                "postgres-db connection pool at 100% (500/500). "
-                "Checkout completely down. Revenue impact: $12,000/min."
             ),
             "alert": {
-                "id":               "ALT-20240315-001",
-                "title":            "CRITICAL: api-gateway error rate spike 78%",
-                "severity_fired":   "P1",
                 "affected_services": ["api-gateway", "auth-service", "postgres-db"],
                 "symptoms": [
                     "api-gateway: HTTP 503 rate 78% (baseline: 0.1%)",
                     "auth-service: connection timeout 94% of requests",
-                    "postgres-db: connection pool 500/500 — 100% utilized",
-                    "checkout flow: completely unavailable",
-                    "new user logins: 0% success rate",
                 ],
-                "error_rate":              0.78,
-                "duration_minutes":        4,
-                "revenue_impact_per_min":  12000,
             },
             "known_services": {"api-gateway", "auth-service", "postgres-db"},
             "tool_responses": {
                 "query_logs": {
                     "api-gateway": (
-                        "2024-03-15T10:04:12Z ERROR upstream connect error — "
-                        "reset reason: connection timeout auth-service:8080\n"
-                        "2024-03-15T10:04:13Z ERROR 503 Service Unavailable upstream: auth-service\n"
-                        "2024-03-15T10:04:14Z ERROR circuit breaker OPEN for auth-service"
                     ),
                     "auth-service": (
-                        "2024-03-15T10:04:10Z ERROR pq: sorry, too many clients already\n"
-                        "2024-03-15T10:04:11Z ERROR dial tcp postgres-db:5432: "
-                        "connect: connection refused — pool exhausted (500/500)\n"
-                        "2024-03-15T10:04:12Z ERROR all connection pool slots occupied"
                     ),
                     "postgres-db": (
-                        "2024-03-15T10:03:58Z LOG connection received: host=auth-service\n"
-                        "2024-03-15T10:04:00Z FATAL remaining connection slots reserved "
-                        "for non-replication superuser\n"
-                        "2024-03-15T10:04:01Z LOG max_connections=500 active=500 idle=0"
                     ),
                 },
                 "check_metrics": {
-                    "api-gateway": (
-                        "HTTP 5xx rate: 78% | p99 latency: 30s (timeout) | "
-                        "RPS: 1,200 | circuit_breaker: OPEN"
-                    ),
-                    "auth-service": (
-                        "Error rate: 94% | DB connection wait: 28s | "
-                        "Active connections: 0 | Request queue: 847"
-                    ),
-                    "postgres-db": (
-                        "Connections: 500/500 (100%) | Query queue: 847 | "
-                        "CPU: 98% | Memory: 89% | Active queries: 500"
-                    ),
                 },
                 "check_dependencies": {
-                    "api-gateway": "Depends on: auth-service [CRITICAL], product-service [OK]",
-                    "auth-service": "Depends on: postgres-db [CRITICAL], redis-session [OK]",
-                    "postgres-db": "No upstream dependencies — root level service",
                 },
                 "check_recent_deploys": {
-                    "api-gateway":  "Last deploy: 3 days ago — no recent changes",
-                    "auth-service": (
-                        "Last deploy: 47 min ago — PR #2341: "
-                        "increased default connection pool size from 10 to 500"
-                    ),
-                    "postgres-db":  "Last deploy: 12 days ago — no recent changes",
                 },
             },
-            "correct_severity":    "P1",
             "adjacent_severities": ["P2"],
         },
-        # AC-002: CDN cache invalidation storm → P2
         {
             "scenario_id": "AC-002",
             "description": (
-                "CDN cache invalidation storm: a misconfigured purge cronjob wiped "
-                "all 2.1M cached keys, sending 40× normal traffic to origin. "
-                "Site degraded but not fully down — P2 severity."
             ),
             "incident_summary": (
-                "P2 ALERT — CDN cache hit rate dropped from 94% to 3%, "
-                "product-service origin traffic up 4000%, image-service CPU at 95%. "
-                "Pages loading slowly (p99: 18s). Checkout still working."
             ),
             "alert": {
-                "id":               "ALT-20240315-002",
-                "title":            "HIGH: CDN cache miss storm — origin overloaded",
-                "severity_fired":   "P2",
                 "affected_services": ["cdn-edge", "product-service", "image-service"],
                 "symptoms": [
                     "CDN cache hit rate: 3% (normal: 94%)",
-                    "product-service: origin RPS 48,000 (normal: 1,200)",
                     "image-service: CPU 95%, p99 latency 18s",
-                    "User experience: product pages slow, some images timing out",
-                    "Checkout: still functional (not affected)",
                 ],
-                "error_rate":             0.15,
-                "duration_minutes":        8,
-                "revenue_impact_per_min":  800,
             },
             "known_services": {"cdn-edge", "product-service", "image-service"},
             "tool_responses": {
                 "query_logs": {
                     "cdn-edge": (
-                        "2024-03-15T10:22:00Z INFO cache MISS ratio: 97% (5min window)\n"
-                        "2024-03-15T10:20:11Z WARN mass cache invalidation — "
-                        "2,100,000 keys purged by purge-job-prod\n"
                         "2024-03-15T10:20:10Z INFO purge pattern: /* (ALL keys)"
                     ),
                     "product-service": (
                         "2024-03-15T10:22:05Z WARN request queue depth: 12,400\n"
-                        "2024-03-15T10:22:06Z ERROR timeout fetching from image-service (18s)\n"
-                        "2024-03-15T10:22:07Z WARN worker pool 95% utilized"
                     ),
                     "image-service": (
-                        "2024-03-15T10:22:00Z WARN CPU throttling engaged (95%)\n"
-                        "2024-03-15T10:22:01Z ERROR worker pool exhausted — dropping requests\n"
-                        "2024-03-15T10:22:02Z ERROR OOM risk: memory at 91%"
                     ),
                 },
                 "check_metrics": {
-                    "cdn-edge": (
-                        "Cache hit rate: 3% | Purge events (1h): 1 mass purge | "
-                        "Origin RPS: 48,000 | Bandwidth: 890 Gbps"
                     ),
                     "product-service": (
-                        "Origin RPS: 48,000 (normal: 1,200) | "
-                        "Queue depth: 12,400 | Worker utilization: 95%"
                     ),
-                    "image-service": (
-                        "CPU: 95% | Memory: 91% | "
-                        "Worker pool: 0 free / 200 | p99 latency: 18s"
                     ),
                 },
                 "check_dependencies": {
-                    "cdn-edge":       "Origin: product-service [OVERLOADED]",
-                    "product-service": "Depends on: image-service [DEGRADED], postgres-db [OK]",
-                    "image-service":   "Depends on: object-storage [OK] — no upstream issues",
                 },
                 "check_recent_deploys": {
-                    "cdn-edge": (
-                        "Cronjob purge-job-prod updated 2h ago — "
-                        "purge pattern changed from /images/* to /* (all keys)"
-                    ),
-                    "product-service": "Last deploy: 5 days ago — no recent changes",
-                    "image-service":   "Last deploy: 2 days ago — no recent changes",
                 },
             },
-            "correct_severity":    "P2",
-            "adjacent_severities": ["P1", "P3"],
         },
     ],
-    # ── TASK 2: ROOT CAUSE ANALYSIS ─────────────────────────────────────────
     "root_cause_analysis": [
-        # RCA-001: Analytics service OOM kills postgres-db
         {
             "scenario_id": "RCA-001",
             "description": (
-                "postgres-db was OOM-killed by the Linux kernel after a runaway "
-                "analytics query with no LIMIT clause consumed all available memory. "
-                "All downstream services are now failing. analytics-service is the culprit."
             ),
             "incident_summary": (
-                "Multiple services down: api-gateway 503, auth-service failing, "
-                "order-service write failures. postgres-db restarting in a loop. "
-                "Root cause is upstream — trace the failure chain."
             ),
             "alert": {
-                "id":              "ALT-RCA-001",
-                "title":           "CRITICAL: postgres-db crash loop — all dependents down",
-                "severity_fired":  "P1",
                 "affected_services": [
                     "api-gateway", "auth-service", "order-service", "postgres-db",
                 ],
                 "symptoms": [
-                    "postgres-db: 4 restarts in 12 minutes",
-                    "auth-service: connection refused — 100% failure",
                     "order-service: all writes failing",
-                    "api-gateway: 503 on all authenticated routes",
-                    "analytics-service: last job failed 12 min ago",
                 ],
-                "error_rate":       0.95,
                 "duration_minutes": 14,
             },
             "known_services": {
                 "api-gateway", "auth-service", "order-service",
                 "postgres-db", "analytics-service", "redis-session",
             },
             "tool_responses": {
-                "query_logs": {
-                    "postgres-db": (
-                        "2024-03-16T02:11:00Z LOG database system shut down at 02:10:58\n"
-                        "2024-03-16T02:10:58Z FATAL Out of Memory: Kill process 1847 (postgres) "
-                        "score 982 or sacrifice child\n"
-                        "2024-03-16T02:10:30Z LOG process 1847 query running 12min: "
-                        "SELECT * FROM events JOIN user_sessions JOIN orders "
-                        "JOIN products — no LIMIT clause, est 847M rows"
-                    ),
-                    "analytics-service": (
-                        "2024-03-16T01:58:00Z INFO starting job: full_history_export\n"
-                        "2024-03-16T01:58:01Z WARN query has no LIMIT — estimated 847M rows\n"
-                        "2024-03-16T02:10:55Z ERROR job killed by OOM — full_history_export FAILED"
-                    ),
-                    "auth-service": (
-                        "2024-03-16T02:11:05Z ERROR connect ECONNREFUSED postgres-db:5432\n"
-                        "2024-03-16T02:11:06Z ERROR all retries exhausted — giving up"
-                    ),
-                    "api-gateway": (
-                        "2024-03-16T02:11:10Z ERROR upstream auth-service: 503 Service Unavailable"
-                    ),
-                    "order-service": (
-                        "2024-03-16T02:11:08Z ERROR pq: the database system is starting up"
-                    ),
-                    "redis-session": "No errors — operating normally at 99.2% hit rate",
-                },
                 "check_metrics": {
                     "postgres-db": (
-                        "Memory: OOM killed (0% free at crash) | "
-                        "Restarts: 4 in 12min | Status: RESTARTING"
                     ),
                     "analytics-service": (
-                        "Memory at crash: 31.2GB / 32GB (97.5%) | "
-                        "Job runtime: 12min 55s | Status: ERROR"
                     ),
-                    "auth-service":  "Connection success: 0% | DB: CRITICAL | Redis: OK",
-                    "api-gateway":   "503 rate: 95% | Auth dependency: DOWN",
                     "order-service": "Write success: 0% | DB: RESTARTING",
-                    "redis-session": "Hit rate: 99.2% | Memory: 42% | Healthy",
                 },
                 "check_dependencies": {
                     "postgres-db": (
-                        "Clients: auth-service, order-service, analytics-service, product-service"
                     ),
                     "analytics-service": "Depends on: postgres-db [CRASH LOOP]",
-                    "auth-service":      "Depends on: postgres-db [CRASH LOOP], redis-session [OK]",
-                    "api-gateway":       "Depends on: auth-service [DOWN]",
-                    "order-service":     "Depends on: postgres-db [CRASH LOOP]",
-                    "redis-session":     "No DB dependency — standalone cache",
                 },
                 "check_recent_deploys": {
                     "analytics-service": (
-                        "Deploy 6h ago: added full_history_export scheduled job — "
-                        "runs daily at 02:00 UTC, no LIMIT on cross-table JOIN"
                     ),
-                    "postgres-db":   "No deploys in 3 weeks",
-                    "auth-service":  "No recent deploys",
                     "order-service": "No recent deploys",
                     "redis-session": "No recent deploys",
                 },
                 "check_service_status": {
-                    "postgres-db":       "RESTARTING | Uptime: 47s | Crash reason: OOM",
-                    "analytics-service": "ERROR | Last job: full_history_export FAILED",
-                    "auth-service":      "DOWN | Waiting for postgres-db",
-                    "api-gateway":       "DEGRADED | 95% requests failing",
-                    "order-service":     "DOWN | Waiting for postgres-db",
-                    "redis-session":     "HEALTHY | All normal",
                 },
             },
             "correct_root_cause": {
-                "service":      "analytics-service",
                 "failure_mode": "unbounded query OOM killing postgres-db",
             },
             "wrong_actions": {
-                "restart_service:auth-service":  "auth-service is a victim — DB must be fixed first",
-                "restart_service:api-gateway":   "api-gateway is downstream — won't help",
-                "scale_service:postgres-db":     "Scaling won't prevent OOM if the bad query runs again",
-                "rollback_deploy:postgres-db":   "postgres-db has no recent deploys",
             },
         },
-        # RCA-002: BGP route withdrawal — AZ network partition
         {
             "scenario_id": "RCA-002",
             "description": (
-                "A BGP route withdrawal isolated AZ-1 (where payment-service runs) "
-                "from AZ-2 and AZ-3, causing 61% of checkout requests to fail. "
-                "Services within AZ-1 are healthy — it is a pure network issue."
             ),
             "incident_summary": (
-                "Checkout failure rate 61% — AZ-2 and AZ-3 cannot reach payment-service "
-                "in AZ-1. AZ-1 users unaffected. fraud-detection-service also unreachable "
-                "cross-AZ. Network infrastructure change 18 min ago."
             ),
             "alert": {
-                "id":              "ALT-RCA-002",
-                "title":           "HIGH: checkout failure 61% — cross-AZ connectivity loss",
-                "severity_fired":  "P2",
                 "affected_services": [
                     "order-service", "payment-service", "fraud-detection-service",
                 ],
                 "symptoms": [
-                    "checkout failure rate: 61% (AZ-2/AZ-3 only)",
-                    "payment-service: unreachable from AZ-2, AZ-3",
-                    "fraud-detection-service: timeout from AZ-2, AZ-3",
-                    "AZ-1 users: 0% failure rate",
-                    "Network: AZ-2/AZ-3 → AZ-1 routing broken",
                 ],
-                "error_rate":       0.61,
                 "duration_minutes": 9,
             },
             "known_services": {
                 "order-service", "payment-service", "fraud-detection-service",
                 "postgres-db", "redis-payment-cache", "network-infra",
             },
             "tool_responses": {
-                "query_logs": {
                     "order-service": (
-                        "2024-03-17T14:32:10Z ERROR connection timeout payment-service:8080 "
-                        "(AZ-2 to AZ-1: no route to host)\n"
-                        "2024-03-17T14:32:11Z ERROR fraud-detection-service: i/o timeout (30s)"
                     ),
                     "payment-service": (
-                        "2024-03-17T14:31:58Z WARN health check from AZ-2 LB failing\n"
-                        "2024-03-17T14:31:59Z INFO AZ-1 local traffic: all normal"
                     ),
                     "fraud-detection-service": (
-                        "2024-03-17T14:32:00Z INFO AZ-1 requests: all normal\n"
-                        "2024-03-17T14:32:01Z WARN cross-AZ health probes: 100% timeout"
                     ),
                     "network-infra": (
-                        "2024-03-17T14:31:45Z CRITICAL BGP peer 10.0.2.1 route withdrawal — "
-                        "AZ-2 lost route to AZ-1 CIDR 10.0.1.0/24\n"
-                        "2024-03-17T14:31:45Z CRITICAL BGP peer 10.0.3.1 route withdrawal — "
-                        "AZ-3 lost route to AZ-1 CIDR 10.0.1.0/24\n"
-                        "2024-03-17T14:31:44Z INFO router config change applied — "
-                        "BGP advertisement policy updated"
                     ),
-                    "postgres-db":        "Operating normally — no errors detected",
-                    "redis-payment-cache": "Operating normally — AZ-1 traffic only, all healthy",
                 },
-                "check_metrics": {
                     "order-service": (
-                        "AZ-2 checkout failure: 99% | AZ-3 checkout failure: 98% | "
-                        "AZ-1 checkout failure: 0.2% (baseline)"
                     ),
-                    "payment-service": (
-                        "AZ-1 traffic: normal (100% success) | "
-                        "AZ-2/AZ-3 inbound connections: 0 (blocked)"
-                    ),
-                    "fraud-detection-service": (
-                        "AZ-1 processing: normal | "
-                        "Cross-AZ health checks: 100% timeout"
                     ),
                     "network-infra": (
-                        "BGP session AZ-2: WITHDRAWN | BGP session AZ-3: WITHDRAWN | "
-                        "AZ-1 internal: all UP | Config change: 18min ago"
                     ),
-                    "postgres-db":         "All metrics normal — no anomalies",
-                    "redis-payment-cache": "All metrics normal — AZ-1 only traffic",
                 },
                 "check_dependencies": {
                     "order-service": (
-                        "Depends on: payment-service [PARTITIONED], "
-                        "fraud-detection-service [PARTITIONED]"
                     ),
-                    "payment-service":         "Depends on: postgres-db [OK], redis-payment-cache [OK]",
-                    "fraud-detection-service": "Depends on: postgres-db [OK]",
-                    "network-infra":           "BGP peers: AZ-2 [WITHDRAWN], AZ-3 [WITHDRAWN], AZ-1 [UP]",
                 },
                 "check_recent_deploys": {
-                    "network-infra": (
-                        "Router config change 18min ago — BGP route advertisement policy update: "
-                        "inadvertently withdrew AZ-1 routes from AZ-2/AZ-3 peers"
                     ),
-                    "payment-service":         "No recent deploys",
-                    "order-service":           "No recent deploys",
-                    "fraud-detection-service": "No recent deploys",
                 },
                 "check_service_status": {
-                    "payment-service":         "HEALTHY within AZ-1 | Cross-AZ: UNREACHABLE",
-                    "order-service":           "DEGRADED | AZ-2/AZ-3 instances failing",
-                    "network-infra":           "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN | AZ-1: UP",
-                    "fraud-detection-service": "HEALTHY within AZ-1 | Cross-AZ: UNREACHABLE",
-                    "postgres-db":             "HEALTHY",
-                    "redis-payment-cache":     "HEALTHY",
                 },
             },
             "correct_root_cause": {
-                "service":      "network-infra",
-                "failure_mode": "BGP route withdrawal causing AZ network partition",
             },
             "wrong_actions": {
-                "restart_service:payment-service":  "payment-service is healthy — restarting won't fix routing",
-                "restart_service:order-service":    "order-service is a victim of the partition",
-                "scale_service:payment-service":    "Scaling won't fix a BGP routing issue",
-                "clear_cache:redis-payment-cache":  "Cache is healthy — not the cause",
             },
         },
     ],
-    # ── TASK 3: REMEDIATION PLANNING ────────────────────────────────────────
     "remediation_planning": [
-        # RP-001: Full OOM remediation — disable cron, restart cascade
         {
             "scenario_id": "RP-001",
             "description": (
-                "Full remediation: analytics-service OOM-killed postgres-db with an "
-                "unbounded query. Must disable the offending job, restart postgres, "
-                "restore all downstream services, and document the resolution."
             ),
             "incident_summary": (
-                "CRITICAL — postgres-db in OOM crash loop. auth-service, order-service, "
-                "api-gateway all down. analytics-service caused it with unbounded query. "
-                "Required actions: disable job, restart postgres, restore services, document."
             ),
             "alert": {
-                "id":              "ALT-RP-001",
-                "title":           "CRITICAL: postgres-db OOM crash loop — full stack down",
-                "severity_fired":  "P1",
                 "affected_services": [
-                    "postgres-db", "analytics-service",
-                    "auth-service", "order-service", "api-gateway",
                 ],
             },
             "known_services": {
                 "postgres-db", "auth-service", "order-service",
-                "api-gateway", "analytics-service",
             },
             "tool_responses": {
                 "query_logs": {
                     "postgres-db": (
-                        "FATAL: Out of Memory: Kill process (postgres) — "
-                        "analytics query running 12min with no LIMIT"
                     ),
                     "analytics-service": (
-                        "ERROR: full_history_export — unbounded JOIN, 847M rows, killed by OOM"
                     ),
-                    "auth-service":  "ERROR: connect ECONNREFUSED postgres-db:5432",
-                    "order-service": "ERROR: pq: the database system is starting up",
-                    "api-gateway":   "ERROR: upstream auth-service 503",
                 },
                 "check_metrics": {
-                    "postgres-db":       "Memory: OOM | Restarts: 4 | Status: CRASH LOOP",
-                    "analytics-service": "Memory spike: 31GB/32GB | Status: ERROR",
-                    "auth-service":      "Connection success: 0% | Waiting for DB",
-                    "order-service":     "Write success: 0% | Waiting for DB",
-                    "api-gateway":       "503 rate: 95% | Auth: DOWN",
                 },
                 "check_dependencies": {
-                    "postgres-db":       "Clients: auth-service, order-service, analytics-service",
                     "analytics-service": "Depends on: postgres-db [CRASH LOOP]",
-                    "auth-service":      "Depends on: postgres-db [CRASH LOOP]",
-                    "order-service":     "Depends on: postgres-db [CRASH LOOP]",
                 },
                 "check_recent_deploys": {
                     "analytics-service": (
-                        "Deploy 6h ago: full_history_export job — "
-                        "unbounded cross-table JOIN query"
                     ),
-                    "postgres-db": "No recent changes",
                 },
                 "check_service_status": {
-                    "postgres-db":       "CRASH LOOP | OOM kill | Uptime: 47s",
-                    "analytics-service": "ERROR | Last job failed",
-                    "auth-service":      "DOWN",
-                    "order-service":     "DOWN",
-                    "api-gateway":       "DEGRADED",
                 },
             },
             "remediation_data": {
                 "disable_feature_flag": {
                     "full_history_export": (
                         "Cron job full_history_export DISABLED — "
-                        "no more unbounded queries will run"
                     ),
                 },
                 "restart_service": {
-                    "postgres-db": (
-                        "postgres-db restarted cleanly — "
-                        "accepting connections (12/500 active)"
-                    ),
-                    "analytics-service": (
-                        "analytics-service restarted — no active queries"
-                    ),
-                    "auth-service":  "auth-service restarted — reconnected to postgres-db OK",
-                    "order-service": "order-service restarted — writes resuming normally",
                 },
                 "execute_runbook_step": {
-                    "verify_db_health": (
-                        "postgres-db: connections 12/500, CPU 12%, Memory 34% — healthy"
-                    ),
                     "check_service_recovery": (
-                        "auth-service OK | order-service OK | api-gateway OK"
                     ),
                 },
             },
@@ -617,122 +932,294 @@ SCENARIOS: dict = {
                 "restart_service:postgres-db",
                 "restart_service:auth-service",
                 "restart_service:order-service",
             ],
             "wrong_actions": {
-                "rollback_deploy:postgres-db": (
-                    "postgres-db has no recent deploy to roll back"
-                ),
-                "scale_service:postgres-db": (
-                    "Scaling won't prevent the OOM query from running again"
-                ),
-                "restart_service:api-gateway": (
-                    "api-gateway is downstream — fix the DB first"
-                ),
             },
             "resolution_keywords": [
                 "analytics", "oom", "memory", "postgres", "query",
-                "full_history_export", "disabled", "restarted", "recovered",
             ],
         },
-        # RP-002: Full BGP remediation — restore routes, rollback config, verify
         {
             "scenario_id": "RP-002",
             "description": (
-                "Full remediation: BGP route withdrawal partitioned AZ-2/AZ-3 from "
-                "AZ-1 where payment-service runs. Must restore BGP routes, roll back "
-                "the router config change, verify checkout recovery, and document."
             ),
             "incident_summary": (
-                "P2 — BGP partition isolating payment-service from 61% of users. "
-                "Router config change 18min ago is the cause. "
-                "Required: restore BGP routes, rollback network config, verify recovery."
             ),
             "alert": {
-                "id":              "ALT-RP-002",
-                "title":           "HIGH: checkout 61% failure — BGP AZ partition",
-                "severity_fired":  "P2",
-                "affected_services": ["network-infra", "order-service", "payment-service"],
             },
             "known_services": {
                 "network-infra", "order-service", "payment-service",
                 "fraud-detection-service", "postgres-db",
             },
             "tool_responses": {
                 "query_logs": {
                     "network-infra": (
-                        "CRITICAL: BGP route withdrawal — "
-                        "AZ-2/AZ-3 lost route to AZ-1 10.0.1.0/24\n"
-                        "Router config change 18min ago: BGP policy updated"
-                    ),
-                    "order-service": (
-                        "ERROR: connection timeout payment-service — no route to host"
-                    ),
-                    "payment-service": (
-                        "INFO: AZ-1 traffic normal | "
-                        "WARN: cross-AZ health checks failing"
-                    ),
-                    "fraud-detection-service": (
-                        "WARN: cross-AZ health probes 100% timeout | AZ-1 traffic: normal"
                     ),
                     "postgres-db": "Operating normally",
                 },
                 "check_metrics": {
-                    "network-infra":  "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN | AZ-1: UP",
-                    "order-service":  "AZ-2 failure: 99% | AZ-1 failure: 0.2%",
-                    "payment-service": "AZ-1: normal | Cross-AZ inbound: 0",
-                    "fraud-detection-service": "AZ-1: normal | Cross-AZ: 0",
-                    "postgres-db":    "All normal",
                 },
                 "check_dependencies": {
-                    "order-service":  "Depends on: payment-service [PARTITIONED]",
-                    "payment-service": "Depends on: postgres-db [OK]",
-                    "network-infra":  "BGP peers: AZ-2 [WITHDRAWN], AZ-3 [WITHDRAWN]",
                 },
                 "check_recent_deploys": {
                     "network-infra": (
-                        "Config change 18min ago — BGP policy update "
-                        "accidentally withdrew AZ-1 routes"
                     ),
                     "payment-service": "No recent deploys",
-                    "order-service":   "No recent deploys",
                 },
                 "check_service_status": {
-                    "network-infra":   "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN",
-                    "payment-service": "HEALTHY (AZ-1) | Cross-AZ: UNREACHABLE",
-                    "order-service":   "DEGRADED",
                 },
             },
             "remediation_data": {
                 "rollback_deploy": {
-                    "network-infra": (
-                        "Router config rolled back — "
-                        "BGP advertisement policy restored to previous version"
-                    ),
                 },
                 "execute_runbook_step": {
-                    "restore_bgp_routes": (
-                        "BGP routes restored — AZ-2/AZ-3 can now reach AZ-1 10.0.1.0/24"
-                    ),
-                    "verify_checkout_recovery": (
-                        "Checkout failure rate: 0.3% — incident fully resolved"
-                    ),
                 },
             },
             "correct_remediation_sequence": [
                 "execute_runbook_step:restore_bgp_routes",
                 "rollback_deploy:network-infra",
                 "execute_runbook_step:verify_checkout_recovery",
             ],
             "wrong_actions": {
-                "restart_service:payment-service":  "payment-service is healthy — network is the issue",
-                "scale_service:payment-service":    "Scaling won't fix BGP routing",
-                "restart_service:order-service":    "order-service is a victim",
-                "clear_cache":                      "Cache is unrelated to network routing",
             },
             "resolution_keywords": [
                 "bgp", "network", "route", "rollback", "partition",
-                "restored", "az-1", "az-2", "az-3", "checkout", "withdrawal",
             ],
         },
     ],
@@ -746,8 +1233,7 @@ SCENARIOS: dict = {
 def get_task(task_id: str) -> dict:
     if task_id not in ALL_TASKS:
         raise ValueError(
-            f"Unknown task_id '{task_id}'. "
-            f"Valid: {list(ALL_TASKS.keys())}"
         )
     return ALL_TASKS[task_id]

 """
 tasks.py — Task and scenario definitions for Cloud Incident Response OpenEnv.
+Difficulty calibration targets:
+  EASY   → 8B: 0.75-1.0,  70B: 0.85-1.0
+  MEDIUM → 8B: 0.30-0.50,  70B: 0.45-0.65
+  HARD   → 8B: 0.15-0.35,  70B: 0.30-0.50
+Design principles for genuine difficulty:
+  EASY: Alert metrics are clear. Only trick is P2-vs-P3 ambiguity.
+  MEDIUM: Root cause buried. 8-10 known services. Multiple red herrings.
+    incident_summary does NOT hint at root cause. Must investigate 4+ services.
+  HARD: Same diagnosis challenge + 5-7 step remediation sequence +
+    10+ known services (many wrong choices) + quality summary required.
 Public API:
     get_task(task_id)            -> task metadata dict
         "score_range": [0.0, 1.0],
         "description": (
             "An alert has fired. Query logs and metrics across affected services, "
+            "then classify the incident severity: P1 (CRITICAL — complete outage or "
+            "revenue >$1,000/min), P2 (HIGH — major degradation affecting most users), "
+            "P3 (MEDIUM — partial/minor issue with graceful fallback), "
+            "P4 (LOW — informational). Submit with submit_severity."
         ),
         "available_actions": [
             "query_logs",
             "submit_severity",
         ],
         "submission_action": "submit_severity",
+        "scenarios": 3,
     },
     "root_cause_analysis": {
         "id": "root_cause_analysis",
         "max_steps": 10,
         "score_range": [0.0, 1.0],
         "description": (
+            "A production incident is active with multiple services showing errors. "
+            "Use diagnostic tools to trace the failure chain. The root cause may be "
+            "any service in the system — not necessarily one showing errors. "
+            "Query logs, metrics, dependencies, and recent deploys across ALL "
+            "available services to find the true trigger. Submit with submit_root_cause."
         ),
         "available_actions": [
             "query_logs",
             "submit_root_cause",
         ],
         "submission_action": "submit_root_cause",
+        "scenarios": 3,
     },
     "remediation_planning": {
         "id": "remediation_planning",
         "score_range": [0.0, 1.0],
         "description": (
             "A critical production incident requires full end-to-end resolution. "
+            "Diagnose the root cause among many services, execute the correct "
+            "remediation sequence (order matters — wrong actions are penalized), "
+            "then submit a detailed resolution summary. Scored on diagnosis quality, "
+            "remediation correctness, action efficiency, and documentation."
         ),
         "available_actions": [
             "query_logs",
             "submit_resolution",
         ],
         "submission_action": "submit_resolution",
+        "scenarios": 3,
     },
 }
 # ---------------------------------------------------------------------------
+# Scenario data — 3 tasks × 3 scenarios = 9 total episodes
 # ---------------------------------------------------------------------------
 SCENARIOS: dict = {
+    # ══════════════════════════════════════════════════════════════════════
+    # TASK 1: ALERT CLASSIFICATION (EASY)
+    # Target: 8B→0.75-1.0, 70B→0.85-1.0
+    #
+    # AC-001: Clear P1 (78% errors, $12k/min) — should be trivial
+    # AC-002: Ambiguous P2 (degraded but working, $800/min)
+    # AC-003: Trap P3 (45% errors but zero revenue impact, graceful fallback)
+    # ══════════════════════════════════════════════════════════════════════
     "alert_classification": [
+        # AC-001: Clear P1
         {
             "scenario_id": "AC-001",
             "description": (
+                "Cascading failure across multiple services. "
+                "Assess severity based on user and revenue impact."
             ),
             "incident_summary": (
+                "Alert fired: api-gateway reporting elevated error rates. "
+                "Multiple downstream services affected. "
+                "Assess the severity of this incident."
             ),
             "alert": {
+                "id": "ALT-20240315-001",
+                "title": "api-gateway error rate elevated",
+                "severity_fired": "UNCLASSIFIED",
                 "affected_services": ["api-gateway", "auth-service", "postgres-db"],
                 "symptoms": [
                     "api-gateway: HTTP 503 rate 78% (baseline: 0.1%)",
                     "auth-service: connection timeout 94% of requests",
+                    "postgres-db: connection pool 500/500 utilized",
+                    "checkout flow: unavailable",
+                    "user logins: failing",
                 ],
+                "error_rate": 0.78,
+                "duration_minutes": 4,
+                "revenue_impact_per_min": 12000,
             },
             "known_services": {"api-gateway", "auth-service", "postgres-db"},
             "tool_responses": {
                 "query_logs": {
                     "api-gateway": (
+                        "2024-03-15T10:04:12Z ERROR upstream timeout auth-service:8080\n"
+                        "2024-03-15T10:04:13Z ERROR 503 Service Unavailable\n"
+                        "2024-03-15T10:04:14Z ERROR circuit breaker OPEN"
                     ),
                     "auth-service": (
+                        "2024-03-15T10:04:10Z ERROR too many clients already\n"
+                        "2024-03-15T10:04:11Z ERROR connection pool exhausted (500/500)"
                     ),
                     "postgres-db": (
+                        "2024-03-15T10:04:00Z FATAL remaining slots reserved for superuser\n"
+                        "2024-03-15T10:04:01Z LOG max_connections=500 active=500"
                     ),
                 },
                 "check_metrics": {
+                    "api-gateway": "5xx rate: 78% | p99: 30s | circuit_breaker: OPEN",
+                    "auth-service": "Error rate: 94% | DB wait: 28s | Queue: 847",
+                    "postgres-db": "Connections: 500/500 (100%) | CPU: 98% | Memory: 89%",
                 },
                 "check_dependencies": {
+                    "api-gateway": "Depends on: auth-service [CRITICAL]",
+                    "auth-service": "Depends on: postgres-db [CRITICAL]",
+                    "postgres-db": "No upstream dependencies",
                 },
                 "check_recent_deploys": {
+                    "api-gateway": "No recent changes",
+                    "auth-service": "Deploy 47 min ago — connection pool size change",
+                    "postgres-db": "No recent changes",
                 },
             },
+            "correct_severity": "P1",
             "adjacent_severities": ["P2"],
         },
+        # AC-002: Ambiguous P2 — degraded but not down
         {
             "scenario_id": "AC-002",
             "description": (
+                "Service degradation affecting page load times. "
+                "Core transaction flows still operational. "
+                "Assess severity carefully."
             ),
             "incident_summary": (
+                "Alert fired: CDN cache performance degraded. "
+                "Origin servers under increased load. "
+                "Assess the severity of this incident."
             ),
             "alert": {
+                "id": "ALT-20240315-002",
+                "title": "CDN cache performance anomaly detected",
+                "severity_fired": "UNCLASSIFIED",
                 "affected_services": ["cdn-edge", "product-service", "image-service"],
                 "symptoms": [
                     "CDN cache hit rate: 3% (normal: 94%)",
+                    "product-service: elevated origin traffic",
                     "image-service: CPU 95%, p99 latency 18s",
+                    "Product pages: loading slowly",
+                    "Checkout: still functional",
                 ],
+                "error_rate": 0.15,
+                "duration_minutes": 8,
+                "revenue_impact_per_min": 800,
             },
             "known_services": {"cdn-edge", "product-service", "image-service"},
             "tool_responses": {
                 "query_logs": {
                     "cdn-edge": (
+                        "2024-03-15T10:22:00Z INFO cache MISS ratio: 97%\n"
+                        "2024-03-15T10:20:11Z WARN mass cache invalidation — 2.1M keys purged\n"
                         "2024-03-15T10:20:10Z INFO purge pattern: /* (ALL keys)"
                     ),
                     "product-service": (
                         "2024-03-15T10:22:05Z WARN request queue depth: 12,400\n"
+                        "2024-03-15T10:22:06Z ERROR timeout from image-service\n"
+                        "2024-03-15T10:22:07Z WARN worker pool 95%"
                     ),
                     "image-service": (
+                        "2024-03-15T10:22:00Z WARN CPU throttling 95%\n"
+                        "2024-03-15T10:22:01Z ERROR worker pool exhausted\n"
+                        "2024-03-15T10:22:02Z WARN memory at 91%"
                     ),
                 },
                 "check_metrics": {
+                    "cdn-edge": "Cache hit: 3% | Origin RPS: 48,000 | Bandwidth: 890 Gbps",
+                    "product-service": "Origin RPS: 48k (norm: 1.2k) | Queue: 12,400",
+                    "image-service": "CPU: 95% | Memory: 91% | p99: 18s",
+                },
+                "check_dependencies": {
+                    "cdn-edge": "Origin: product-service [OVERLOADED]",
+                    "product-service": "Depends on: image-service [DEGRADED]",
+                    "image-service": "Depends on: object-storage [OK]",
+                },
+                "check_recent_deploys": {
+                    "cdn-edge": "Cronjob updated 2h ago — purge pattern changed",
+                    "product-service": "No recent changes",
+                    "image-service": "No recent changes",
+                },
+            },
+            "correct_severity": "P2",
+            "adjacent_severities": ["P1", "P3"],
+        },
+        # AC-003: P3 trap — high error rate but zero impact
+        {
+            "scenario_id": "AC-003",
+            "description": (
+                "Internal service reporting elevated errors. "
+                "Determine actual user and business impact. "
+                "Not all high error rates are critical."
+            ),
+            "incident_summary": (
+                "Alert fired: recommendation-service error rate elevated to 45%. "
+                "Assess the severity based on actual user and business impact."
+            ),
+            "alert": {
+                "id": "ALT-20240315-003",
+                "title": "recommendation-service error rate 45%",
+                "severity_fired": "UNCLASSIFIED",
+                "affected_services": ["recommendation-service", "product-service"],
+                "symptoms": [
+                    "recommendation-service: error rate 45% (baseline: 2%)",
+                    "product-service: using fallback recommendation logic",
+                    "User experience: default recommendations shown",
+                    "Checkout: fully functional",
+                    "Revenue: no measurable change",
+                ],
+                "error_rate": 0.45,
+                "duration_minutes": 22,
+                "revenue_impact_per_min": 0,
+            },
+            "known_services": {"recommendation-service", "product-service", "redis-reco-cache"},
+            "tool_responses": {
+                "query_logs": {
+                    "recommendation-service": (
+                        "2024-03-15T09:48:00Z ERROR model inference timeout (>5s)\n"
+                        "2024-03-15T09:48:01Z WARN ML model server overloaded\n"
+                        "2024-03-15T09:48:02Z INFO fallback: returning default recommendations"
                     ),
                     "product-service": (
+                        "2024-03-15T09:48:05Z INFO recommendation-service returned defaults\n"
+                        "2024-03-15T09:48:06Z INFO serving page with default recs — no user impact"
                     ),
+                    "redis-reco-cache": "Operating normally — cache hit rate 88%",
+                },
+                "check_metrics": {
+                    "recommendation-service": (
+                        "Error rate: 45% | Fallback rate: 45% | "
+                        "Model server: OVERLOADED | User impact: NONE (graceful)"
                     ),
+                    "product-service": (
+                        "Error rate: 0.1% (normal) | Checkout: 100% | Revenue: unchanged"
+                    ),
+                    "redis-reco-cache": "Hit rate: 88% | Memory: 34% | HEALTHY",
                 },
                 "check_dependencies": {
+                    "recommendation-service": "Depends on: ML model server [SLOW]",
+                    "product-service": "Depends on: recommendation-service [DEGRADED — has fallback]",
+                    "redis-reco-cache": "No dependencies",
                 },
                 "check_recent_deploys": {
+                    "recommendation-service": "Model update 3h ago — new model v2.4",
+                    "product-service": "No recent changes",
+                    "redis-reco-cache": "No recent changes",
                 },
             },
+            "correct_severity": "P3",
+            "adjacent_severities": ["P2", "P4"],
         },
     ],
+    # ══════════════════════════════════════════════════════════════════════
+    # TASK 2: ROOT CAUSE ANALYSIS (MEDIUM)
+    # Target: 8B→0.30-0.50, 70B→0.45-0.65
+    #
+    # KEY DESIGN RULES:
+    # 1. Root cause service NEVER in affected_services
+    # 2. incident_summary describes SYMPTOMS only, no hints
+    # 3. 8-10 known_services (many to investigate)
+    # 4. Red herring deploys on non-root-cause services
+    # 5. Root cause only findable via check_recent_deploys + query_logs
+    #    on the specific service — not from looking at victims
+    # ══════════════════════════════════════════════════════════════════════
     "root_cause_analysis": [
+        # RCA-001: analytics-service OOM kills postgres-db
+        # 8 known services. Root cause: analytics-service.
+        # Red herrings: auth-service deploy (cosmetic), redis healthy
         {
             "scenario_id": "RCA-001",
             "description": (
+                "Multiple services reporting failures. Database appears to be "
+                "the epicenter but the true trigger may be elsewhere."
             ),
             "incident_summary": (
+                "Multiple services are failing. postgres-db is in a crash loop. "
+                "auth-service, order-service, and api-gateway are all reporting errors. "
+                "Investigate all available services to find what triggered this cascade."
             ),
             "alert": {
+                "id": "ALT-RCA-001",
+                "title": "Multiple service failures — database crash loop",
+                "severity_fired": "P1",
                 "affected_services": [
                     "api-gateway", "auth-service", "order-service", "postgres-db",
                 ],
                 "symptoms": [
+                    "postgres-db: crash loop — 4 restarts in 12 minutes",
+                    "auth-service: 100% connection failures",
                     "order-service: all writes failing",
+                    "api-gateway: 503 on authenticated routes",
                 ],
+                "error_rate": 0.95,
                 "duration_minutes": 14,
             },
             "known_services": {
                 "api-gateway", "auth-service", "order-service",
                 "postgres-db", "analytics-service", "redis-session",
+                "product-service", "notification-service",
             },
             "tool_responses": {
+                        # In RCA-001, replace the query_logs section:
+        "query_logs": {
+            "postgres-db": (
+                "2024-03-16T02:11:00Z LOG database system shut down\n"
+                "2024-03-16T02:10:58Z FATAL terminated by kernel OOM killer\n"
+                "2024-03-16T02:10:30Z LOG long-running query from "
+                "analytics-service consuming all available memory — "
+                "running for 12 minutes, no LIMIT clause"
+            ),
+            "analytics-service": (
+                "2024-03-16T01:58:00Z INFO starting scheduled job: full_history_export\n"
+                "2024-03-16T01:58:01Z DEBUG executing: SELECT * FROM events "
+                "JOIN user_sessions ON ... JOIN orders ON ... — no LIMIT\n"
+                "2024-03-16T01:58:02Z WARN query plan estimates 847M row scan\n"
+                "2024-03-16T02:10:55Z ERROR job terminated — connection to database lost"
+            ),
+            "auth-service": (
+                "2024-03-16T02:11:05Z ERROR connect ECONNREFUSED postgres-db:5432\n"
+                "2024-03-16T02:11:06Z ERROR all retries exhausted"
+            ),
+            "api-gateway": (
+                "2024-03-16T02:11:10Z ERROR upstream auth-service: 503"
+            ),
+            "order-service": (
+                "2024-03-16T02:11:08Z ERROR pq: database system is starting up"
+            ),
+            "redis-session": "No errors — operating normally",
+            "product-service": (
+                "2024-03-16T02:11:12Z WARN DB queries failing — serving cached data"
+            ),
+            "notification-service": (
+                "2024-03-16T02:11:15Z ERROR cannot send — user lookup failed"
+            ),
+        },
                 "check_metrics": {
                     "postgres-db": (
+                        "Memory: peaked at 31.8GB/32GB before kill | "
+                        "Restarts: 4 in 12min | Status: RESTARTING | "
+                        "Heaviest client: 10.0.5.47"
                     ),
                     "analytics-service": (
+                        "Last job: FAILED | Memory during job: 28GB | "
+                        "IP: 10.0.5.47 | CPU: idle (job terminated)"
                     ),
+                    "auth-service": "Connections: 0% success | Queued requests: 1,200",
+                    "api-gateway": "503 rate: 95% | Auth: DOWN",
                     "order-service": "Write success: 0% | DB: RESTARTING",
+                    "redis-session": "Hit rate: 99.2% | Memory: 42% | HEALTHY",
+                    "product-service": "Serving cached data | DB queries: 100% failing",
+                    "notification-service": "Queue backlog: 8,400 | DB: DOWN",
                 },
                 "check_dependencies": {
                     "postgres-db": (
+                        "Clients: auth-service, order-service, analytics-service, "
+                        "product-service, notification-service"
                     ),
                     "analytics-service": "Depends on: postgres-db [CRASH LOOP]",
+                    "auth-service": "Depends on: postgres-db [CRASH LOOP], redis-session [OK]",
+                    "api-gateway": "Depends on: auth-service [DOWN], product-service [DEGRADED]",
+                    "order-service": "Depends on: postgres-db [CRASH LOOP]",
+                    "redis-session": "Standalone cache — no DB dependency",
+                    "product-service": "Depends on: postgres-db [CRASH LOOP — using cache]",
+                    "notification-service": "Depends on: postgres-db [CRASH LOOP]",
                 },
                 "check_recent_deploys": {
                     "analytics-service": (
+                        "Deploy 6h ago: added scheduled data export job — "
+                        "runs daily at 02:00 UTC. Change includes cross-table "
+                        "JOIN query without LIMIT clause"
+                    ),
+                    "postgres-db": "No deploys in 3 weeks",
+                    "auth-service": (
+                        "Deploy 2h ago: updated structured logging format. "
+                        "No functional changes, no query changes, no connection changes."
                     ),
                     "order-service": "No recent deploys",
                     "redis-session": "No recent deploys",
+                    "api-gateway": "No recent deploys",
+                    "product-service": (
+                        "Deploy 3 days ago: added product image lazy loading. "
+                        "No DB changes."
+                    ),
+                    "notification-service": "No recent deploys",
                 },
                 "check_service_status": {
+                    "postgres-db": "RESTARTING | Uptime: 47s | Last crash: OOM",
+                    "analytics-service": "ERROR | Last job: FAILED 12min ago",
+                    "auth-service": "DOWN | Blocked on postgres-db",
+                    "api-gateway": "DEGRADED | 95% errors",
+                    "order-service": "DOWN | Blocked on postgres-db",
+                    "redis-session": "HEALTHY | 99.2% hit rate",
+                    "product-service": "DEGRADED | Cache fallback active",
+                    "notification-service": "DEGRADED | Queue backlog 8,400",
                 },
             },
             "correct_root_cause": {
+                "service": "analytics-service",
                 "failure_mode": "unbounded query OOM killing postgres-db",
             },
             "wrong_actions": {
+                "restart_service:auth-service": "victim — DB must be fixed first",
+                "restart_service:api-gateway": "downstream — won't help",
+                "restart_service:order-service": "victim — won't help",
+                "scale_service:postgres-db": "won't prevent OOM from bad query",
+                "rollback_deploy:postgres-db": "no recent deploys",
+                "rollback_deploy:auth-service": "auth deploy was cosmetic only",
+                "rollback_deploy:product-service": "product deploy unrelated",
+                "restart_service:redis-session": "redis is healthy",
+                "restart_service:notification-service": "victim — won't help",
             },
         },
+        # RCA-002: network-infra BGP withdrawal
+        # 8 known services. Root cause: network-infra.
+        # Red herrings: payment-service looks down, postgres-db exists
         {
             "scenario_id": "RCA-002",
             "description": (
+                "Checkout failures concentrated in specific availability zones. "
+                "Some services appear unreachable while others work fine."
             ),
             "incident_summary": (
+                "Checkout failure rate has spiked to 61%. payment-service and "
+                "fraud-detection-service are unreachable from some parts of the "
+                "infrastructure but appear healthy from others. Multiple services "
+                "to investigate. Find the root cause."
             ),
             "alert": {
+                "id": "ALT-RCA-002",
+                "title": "Checkout failures — partial service unreachability",
+                "severity_fired": "P2",
                 "affected_services": [
                     "order-service", "payment-service", "fraud-detection-service",
                 ],
                 "symptoms": [
+                    "checkout failure rate: 61%",
+                    "payment-service: intermittently unreachable",
+                    "fraud-detection-service: intermittently unreachable",
+                    "failures appear zone-specific",
                 ],
+                "error_rate": 0.61,
                 "duration_minutes": 9,
             },
             "known_services": {
                 "order-service", "payment-service", "fraud-detection-service",
                 "postgres-db", "redis-payment-cache", "network-infra",
+                "cdn-edge", "api-gateway",
             },
             "tool_responses": {
+                        # In RCA-002, replace query_logs:
+        "query_logs": {
+            "order-service": (
+                "2024-03-17T14:32:10Z ERROR connection timeout "
+                "payment-service:8080 — no route to host\n"
+                "2024-03-17T14:32:11Z ERROR fraud-detection-service: i/o timeout\n"
+                "2024-03-17T14:32:12Z WARN failures only from AZ-2/AZ-3, "
+                "AZ-1 traffic normal — possible network-infra issue"
+            ),
+            "payment-service": (
+                "2024-03-17T14:31:58Z WARN health check from external LB failing\n"
+                "2024-03-17T14:31:59Z INFO local AZ-1 traffic: all normal\n"
+                "2024-03-17T14:32:00Z INFO processing requests normally (local only)"
+            ),
+            "fraud-detection-service": (
+                "2024-03-17T14:32:00Z INFO local requests: processing normally\n"
+                "2024-03-17T14:32:01Z WARN external health probes: 100% timeout"
+            ),
+            "network-infra": (
+                "2024-03-17T14:31:45Z CRITICAL BGP session 10.0.2.1 DOWN — "
+                "routes to 10.0.1.0/24 withdrawn from peer\n"
+                "2024-03-17T14:31:45Z CRITICAL BGP session 10.0.3.1 DOWN — "
+                "routes to 10.0.1.0/24 withdrawn from peer\n"
+                "2024-03-17T14:31:44Z INFO configuration change applied — "
+                "export filter policy updated"
+            ),
+            "postgres-db": "Operating normally — no errors",
+            "redis-payment-cache": "Operating normally — all healthy",
+            "cdn-edge": "Operating normally — cache serving fine",
+            "api-gateway": (
+                "2024-03-17T14:32:15Z ERROR some backend routes timing out\n"
+                "2024-03-17T14:32:16Z INFO AZ-1 backends: responding normally"
+            ),
+        },
+                "check_metrics": {
                     "order-service": (
+                        "Failure rate varies by source AZ: "
+                        "AZ-1: 0.2% | AZ-2: 99% | AZ-3: 98%"
                     ),
                     "payment-service": (
+                        "Internal processing: 100% success | "
+                        "Inbound from AZ-2: 0 connections | Inbound from AZ-3: 0 connections | "
+                        "Inbound from AZ-1: normal"
                     ),
                     "fraud-detection-service": (
+                        "Internal: normal | External probes: 100% timeout"
                     ),
                     "network-infra": (
+                        "BGP sessions: AZ-1 internal UP | "
+                        "AZ-2→AZ-1: WITHDRAWN | AZ-3→AZ-1: WITHDRAWN | "
+                        "Last change: 18min ago"
                     ),
+                    "postgres-db": "All metrics normal",
+                    "redis-payment-cache": "All metrics normal",
+                    "cdn-edge": "Cache hit: 91% | Normal operation",
+                    "api-gateway": "Mixed — AZ-1 OK, AZ-2/AZ-3 partial failures",
                 },
+                "check_dependencies": {
                     "order-service": (
+                        "Depends on: payment-service [PARTIAL], "
+                        "fraud-detection-service [PARTIAL]"
                     ),
+                    "payment-service": "Depends on: postgres-db [OK], redis-payment-cache [OK]",
+                    "fraud-detection-service": "Depends on: postgres-db [OK]",
+                    "network-infra": (
+                        "BGP peers: AZ-2 [WITHDRAWN], AZ-3 [WITHDRAWN], AZ-1 [UP]"
                     ),
+                    "postgres-db": "All connections healthy",
+                    "redis-payment-cache": "All connections healthy",
+                    "cdn-edge": "No issues",
+                    "api-gateway": "Depends on: multiple backends [MIXED]",
+                },
+                "check_recent_deploys": {
                     "network-infra": (
+                        "Router configuration change 18min ago — modified BGP "
+                        "export filter policy. Change accidentally removed AZ-1 "
+                        "prefix 10.0.1.0/24 from advertisements to AZ-2 and AZ-3 peers."
+                    ),
+                    "payment-service": "No recent deploys",
+                    "order-service": "No recent deploys",
+                    "fraud-detection-service": "No recent deploys",
+                    "postgres-db": (
+                        "Minor config change 5 days ago — increased shared_buffers. "
+                        "No issues since."
+                    ),
+                    "redis-payment-cache": "No recent deploys",
+                    "cdn-edge": "No recent deploys",
+                    "api-gateway": (
+                        "Deploy 1 day ago — added request tracing headers. "
+                        "No routing changes."
                     ),
+                },
+                "check_service_status": {
+                    "payment-service": "HEALTHY (local) | Cross-AZ: UNREACHABLE",
+                    "order-service": "DEGRADED | Partial failures",
+                    "network-infra": "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN",
+                    "fraud-detection-service": "HEALTHY (local) | Cross-AZ: UNREACHABLE",
+                    "postgres-db": "HEALTHY",
+                    "redis-payment-cache": "HEALTHY",
+                    "cdn-edge": "HEALTHY",
+                    "api-gateway": "DEGRADED | Mixed backend status",
+                },
+            },
+            "correct_root_cause": {
+                "service": "network-infra",
+                "failure_mode": "BGP route withdrawal causing AZ network partition",
+            },
+            "wrong_actions": {
+                "restart_service:payment-service": "healthy — network issue",
+                "restart_service:order-service": "victim",
+                "scale_service:payment-service": "won't fix routing",
+                "clear_cache:redis-payment-cache": "cache is healthy",
+                "restart_service:api-gateway": "victim of routing issue",
+                "rollback_deploy:api-gateway": "deploy was unrelated tracing headers",
+                "rollback_deploy:postgres-db": "config change was 5 days ago, unrelated",
+                "restart_service:cdn-edge": "CDN is healthy",
+            },
+        },
+        # RCA-003: config-service credential rotation bug
+        # 8 known services. Root cause: config-service.
+        # Red herrings: user-service had a recent deploy, postgres-db stressed
+        {
+            "scenario_id": "RCA-003",
+            "description": (
+                "Multiple services experiencing database authentication failures. "
+                "The database itself may not be the problem."
+            ),
+            "incident_summary": (
+                "Several services are reporting database authentication failures. "
+                "postgres-db connection pool is saturated. user-service and "
+                "notification-service are down. api-gateway error rate elevated. "
+                "Investigate all services to find what triggered this."
+            ),
+            "alert": {
+                "id": "ALT-RCA-003",
+                "title": "Multiple services — database authentication failures",
+                "severity_fired": "P2",
+                "affected_services": [
+                    "api-gateway", "user-service", "notification-service", "postgres-db",
+                ],
+                "symptoms": [
+                    "user-service: FATAL password authentication failed",
+                    "notification-service: FATAL password authentication failed",
+                    "api-gateway: 503 rate 62%",
+                    "postgres-db: connection pool 490/500",
+                ],
+                "error_rate": 0.62,
+                "duration_minutes": 7,
+            },
+            "known_services": {
+                "api-gateway", "user-service", "notification-service",
+                "postgres-db", "config-service", "redis-session",
+                "order-service", "product-service",
+            },
+            "tool_responses": {
+                        # In RCA-003, replace query_logs:
+        "query_logs": {
+            "user-service": (
+                "2024-03-18T08:14:00Z FATAL password authentication failed "
+                "for user 'app_user'\n"
+                "2024-03-18T08:14:01Z ERROR DB credentials rejected — "
+                "credentials were pushed by config-service at 08:12:00Z\n"
+                "2024-03-18T08:14:02Z WARN config-service credential rotation "
+                "may have sent wrong credentials"
+            ),
+            "notification-service": (
+                "2024-03-18T08:14:05Z FATAL password authentication failed\n"
+                "2024-03-18T08:14:06Z WARN credentials from config-service "
+                "push at 08:12:00Z appear to be stale/invalid"
+            ),
+            "api-gateway": (
+                "2024-03-18T08:14:10Z ERROR upstream user-service: 503\n"
+                "2024-03-18T08:14:11Z ERROR upstream notification-service: 503"
+            ),
+            "postgres-db": (
+                "2024-03-18T08:14:00Z LOG auth failure from 10.0.3.x\n"
+                "2024-03-18T08:14:00Z LOG auth failure from 10.0.4.x\n"
+                "2024-03-18T08:14:01Z LOG 490/500 slots used by failed auth retries"
+            ),
+            "config-service": (
+                "2024-03-18T08:12:00Z INFO secrets rotation job executed\n"
+                "2024-03-18T08:12:01Z WARN rotation referenced PREVIOUS "
+                "credential set instead of generating new — template bug "
+                "in version v3.2.1\n"
+                "2024-03-18T08:12:02Z INFO pushed credentials to: "
+                "user-service, notification-service, order-service"
+            ),
+            "redis-session": "Operating normally",
+            "order-service": (
+                "2024-03-18T08:14:20Z WARN received credential push from "
+                "config-service but have not restarted — still using old valid creds"
+            ),
+            "product-service": "Operating normally — using original credentials",
+        },
+                "check_metrics": {
+                    "user-service": "DB auth: 100% failure | HTTP 503: 100%",
+                    "notification-service": "DB auth: 100% failure | HTTP 503: 100%",
+                    "api-gateway": "503 rate: 62% | Some upstreams DOWN",
+                    "postgres-db": (
+                        "Connections: 490/500 | Auth failures/s: 80 | "
+                        "Valid connections: 10 | DB itself: HEALTHY"
+                    ),
+                    "config-service": (
+                        "Status: HEALTHY | Last push: 7min ago | "
+                        "Type: secrets_rotation | Result: COMPLETED"
+                    ),
+                    "redis-session": "All normal",
+                    "order-service": "Using old credentials — still working",
+                    "product-service": "All normal — unaffected",
                 },
                 "check_dependencies": {
+                    "user-service": (
+                        "Depends on: postgres-db [AUTH FAIL], "
+                        "config-service [credential source]"
+                    ),
+                    "notification-service": (
+                        "Depends on: postgres-db [AUTH FAIL], "
+                        "config-service [credential source]"
+                    ),
+                    "api-gateway": "Depends on: user-service [DOWN], notification-service [DOWN]",
+                    "postgres-db": "No upstream dependencies — DB is healthy",
+                    "config-service": (
+                        "Provides: credentials to user-service, "
+                        "notification-service, order-service"
+                    ),
+                    "redis-session": "Standalone",
                     "order-service": (
+                        "Depends on: postgres-db [OK — old creds], "
+                        "config-service [pending push]"
                     ),
+                    "product-service": "Depends on: postgres-db [OK — original creds]",
                 },
                 "check_recent_deploys": {
+                    "config-service": (
+                        "Deploy 2h ago: version v3.2.1 — updated secrets rotation "
+                        "job template. Bug: rotation references previous credential "
+                        "set instead of generating new credentials."
                     ),
+                    "user-service": (
+                        "Deploy 4h ago: added new profile API endpoint. "
+                        "No database or credential changes."
+                    ),
+                    "notification-service": "No recent deploys",
+                    "postgres-db": "No recent deploys",
+                    "api-gateway": "No recent deploys",
+                    "redis-session": "No recent deploys",
+                    "order-service": (
+                        "Deploy 1 day ago: updated order confirmation email template. "
+                        "No DB changes."
+                    ),
+                    "product-service": "No recent deploys",
                 },
                 "check_service_status": {
+                    "user-service": "DOWN | DB auth failures",
+                    "notification-service": "DOWN | DB auth failures",
+                    "api-gateway": "DEGRADED | 62% error rate",
+                    "postgres-db": "STRESSED but HEALTHY | 490/500 connections (failed auths)",
+                    "config-service": "HEALTHY | Last rotation: 7min ago (completed)",
+                    "redis-session": "HEALTHY",
+                    "order-service": "HEALTHY | Old credentials still valid",
+                    "product-service": "HEALTHY",
                 },
             },
             "correct_root_cause": {
+                "service": "config-service",
+                "failure_mode": "secrets rotation pushed stale credentials to downstream services",
             },
             "wrong_actions": {
+                "restart_service:user-service": "will retry with same bad credentials",
+                "restart_service:notification-service": "same bad credentials",
+                "restart_service:postgres-db": "DB is healthy — client creds are bad",
+                "scale_service:postgres-db": "connections are failed auths",
+                "rollback_deploy:user-service": "user-service deploy was unrelated",
+                "rollback_deploy:order-service": "order-service deploy was unrelated",
+                "restart_service:api-gateway": "downstream — fix upstream first",
             },
         },
     ],
+    # ══════════════════════════════════════════════════════════════════════
+    # TASK 3: REMEDIATION PLANNING (HARD)
+    # Target: 8B→0.15-0.35, 70B→0.30-0.50
+    #
+    # KEY DESIGN RULES:
+    # 1. Same diagnostic challenge as medium
+    # 2. 5-7 step remediation sequence required
+    # 3. 8-10 known services = many wrong choices
+    # 4. Wrong actions carry -0.05 penalty each (up to -0.15)
+    # 5. Summary must hit 3+ keywords for bonus
+    # 6. incident_summary does NOT reveal root cause
+    # ══════════════════════════════════════════════════════════════════════
     "remediation_planning": [
+        # RP-001: OOM remediation — 6-step sequence, 8 services
         {
             "scenario_id": "RP-001",
             "description": (
+                "Full incident remediation required. Multiple services down. "
+                "Diagnose the root cause, execute fixes in the correct order, "
+                "and document your resolution."
             ),
             "incident_summary": (
+                "CRITICAL — postgres-db is crash-looping. auth-service, order-service, "
+                "and api-gateway are all down. notification-service queue backing up. "
+                "Diagnose the root cause, fix it, restore all services, and document."
             ),
             "alert": {
+                "id": "ALT-RP-001",
+                "title": "CRITICAL: database crash loop — multiple services down",
+                "severity_fired": "P1",
                 "affected_services": [
+                    "postgres-db", "auth-service", "order-service", "api-gateway",
                 ],
             },
             "known_services": {
                 "postgres-db", "auth-service", "order-service",
+                "api-gateway", "analytics-service", "redis-session",
+                "product-service", "notification-service",
             },
             "tool_responses": {
                 "query_logs": {
                     "postgres-db": (
+                        "FATAL: terminated by kernel OOM killer — "
+                        "query from client 10.0.5.47 running 12min consuming all memory"
                     ),
                     "analytics-service": (
+                        "INFO: starting job full_history_export\n"
+                        "WARN: query plan: 847M rows, cross-table JOIN, no LIMIT\n"
+                        "ERROR: job terminated — database connection lost"
                     ),
+                    "auth-service": "ERROR: connect ECONNREFUSED postgres-db:5432",
+                    "order-service": "ERROR: pq: database system is starting up",
+                    "api-gateway": "ERROR: upstream auth-service 503",
+                    "redis-session": "Operating normally",
+                    "product-service": "WARN: DB failing — serving cached data",
+                    "notification-service": "ERROR: user lookup failed — queuing",
                 },
                 "check_metrics": {
+                    "postgres-db": "OOM killed | Restarts: 4 | Heaviest client: 10.0.5.47",
+                    "analytics-service": "Job FAILED | Memory peak: 31GB/32GB | IP: 10.0.5.47",
+                    "auth-service": "0% DB success | Queue: 1,200",
+                    "order-service": "0% write success",
+                    "api-gateway": "503 rate: 95%",
+                    "redis-session": "HEALTHY | 99.2% hit rate",
+                    "product-service": "Cache fallback active",
+                    "notification-service": "Queue: 8,400 messages backed up",
                 },
                 "check_dependencies": {
+                    "postgres-db": (
+                        "Clients: auth-service, order-service, analytics-service, "
+                        "product-service, notification-service"
+                    ),
                     "analytics-service": "Depends on: postgres-db [CRASH LOOP]",
+                    "auth-service": "Depends on: postgres-db [CRASH LOOP], redis-session [OK]",
+                    "api-gateway": "Depends on: auth-service [DOWN]",
+                    "order-service": "Depends on: postgres-db [CRASH LOOP]",
+                    "redis-session": "Standalone",
+                    "product-service": "Depends on: postgres-db [CRASH LOOP — cache fallback]",
+                    "notification-service": "Depends on: postgres-db [CRASH LOOP]",
                 },
                 "check_recent_deploys": {
                     "analytics-service": (
+                        "Deploy 6h ago: added scheduled export job — "
+                        "cross-table JOIN without LIMIT clause"
                     ),
+                    "postgres-db": "No deploys in 3 weeks",
+                    "auth-service": "Deploy 2h ago: logging format only — no functional changes",
+                    "order-service": "No recent deploys",
+                    "product-service": "Deploy 3 days ago: image lazy loading — no DB changes",
+                    "notification-service": "No recent deploys",
                 },
                 "check_service_status": {
+                    "postgres-db": "CRASH LOOP | OOM | Uptime: 47s",
+                    "analytics-service": "ERROR | Job FAILED",
+                    "auth-service": "DOWN",
+                    "order-service": "DOWN",
+                    "api-gateway": "DEGRADED | 95% errors",
+                    "redis-session": "HEALTHY",
+                    "product-service": "DEGRADED | Cache fallback",
+                    "notification-service": "DEGRADED | Queue backlog",
                 },
             },
             "remediation_data": {
                 "disable_feature_flag": {
                     "full_history_export": (
                         "Cron job full_history_export DISABLED — "
+                        "unbounded query will not execute again"
                     ),
                 },
                 "restart_service": {
+                    "postgres-db": "postgres-db restarted — accepting connections (12/500)",
+                    "analytics-service": "analytics-service restarted — idle",
+                    "auth-service": "auth-service restarted — connected to postgres-db OK",
+                    "order-service": "order-service restarted — writes resuming",
+                    "api-gateway": "api-gateway restarted — routing recovered",
+                    "product-service": "product-service — switched from cache to live DB",
+                    "notification-service": "notification-service — draining queue",
                 },
                 "execute_runbook_step": {
+                    "verify_db_health": "postgres-db: 12/500 connections, CPU 12%, Memory 34% — healthy",
                     "check_service_recovery": (
+                        "auth OK | order OK | api-gateway OK | product OK | notification DRAINING"
                     ),
                 },
             },
                 "restart_service:postgres-db",
                 "restart_service:auth-service",
                 "restart_service:order-service",
+                "execute_runbook_step:verify_db_health",
             ],
             "wrong_actions": {
+                "rollback_deploy:postgres-db": "no recent deploy",
+                "scale_service:postgres-db": "won't prevent OOM",
+                "restart_service:api-gateway": "downstream — fix DB stack first",
+                "rollback_deploy:auth-service": "cosmetic deploy only",
+                "clear_cache:redis-session": "healthy — not related",
+                "restart_service:redis-session": "healthy — not related",
+                "rollback_deploy:product-service": "unrelated deploy",
+                "restart_service:notification-service": "will recover once DB is up",
             },
             "resolution_keywords": [
                 "analytics", "oom", "memory", "postgres", "query",
+                "full_history_export", "disabled", "restarted",
+                "recovered", "unbounded", "crash", "kill",
             ],
         },
+        # RP-002: BGP remediation — 4-step sequence, 8 services
         {
             "scenario_id": "RP-002",
             "description": (
+                "Full incident remediation required. Checkout failures affecting "
+                "most users. Diagnose, fix, verify, and document."
             ),
             "incident_summary": (
+                "Checkout failure rate 61%. payment-service unreachable from most "
+                "of the infrastructure. Some services report no issues. "
+                "Diagnose the root cause, execute remediation, verify recovery, "
+                "and document the resolution."
             ),
             "alert": {
+                "id": "ALT-RP-002",
+                "title": "Checkout failures — partial service unreachability",
+                "severity_fired": "P2",
+                "affected_services": ["order-service", "payment-service"],
             },
             "known_services": {
                 "network-infra", "order-service", "payment-service",
                 "fraud-detection-service", "postgres-db",
+                "redis-payment-cache", "cdn-edge", "api-gateway",
             },
             "tool_responses": {
                 "query_logs": {
                     "network-infra": (
+                        "CRITICAL: BGP peer 10.0.2.1 route withdrawal — "
+                        "routes to 10.0.1.0/24 removed\n"
+                        "CRITICAL: BGP peer 10.0.3.1 route withdrawal — "
+                        "routes to 10.0.1.0/24 removed\n"
+                        "INFO: configuration change applied — export filter updated"
                     ),
+                    "order-service": "ERROR: timeout payment-service — no route to host",
+                    "payment-service": "INFO: local traffic normal | WARN: external health failing",
+                    "fraud-detection-service": "WARN: cross-AZ probes timeout | Local: OK",
                     "postgres-db": "Operating normally",
+                    "redis-payment-cache": "Operating normally",
+                    "cdn-edge": "Operating normally",
+                    "api-gateway": "ERROR: some backend routes timing out",
                 },
                 "check_metrics": {
+                    "network-infra": (
+                        "BGP AZ-2→AZ-1: WITHDRAWN | AZ-3→AZ-1: WITHDRAWN | "
+                        "AZ-1 internal: UP | Last change: 18min ago"
+                    ),
+                    "order-service": "AZ-1: 0.2% fail | AZ-2: 99% fail | AZ-3: 98% fail",
+                    "payment-service": "Internal: 100% success | External: 0 inbound from AZ-2/3",
+                    "fraud-detection-service": "Local: normal | External: timeout",
+                    "postgres-db": "All normal",
+                    "redis-payment-cache": "All normal",
+                    "cdn-edge": "Cache: 91% hit | Normal",
+                    "api-gateway": "Mixed — AZ-1 OK, AZ-2/3 partial failures",
                 },
                 "check_dependencies": {
+                    "order-service": "Depends on: payment-service [PARTIAL], fraud-detection [PARTIAL]",
+                    "payment-service": "Depends on: postgres-db [OK], redis-payment-cache [OK]",
+                    "network-infra": "BGP: AZ-2 [WITHDRAWN], AZ-3 [WITHDRAWN]",
+                    "fraud-detection-service": "Depends on: postgres-db [OK]",
+                    "postgres-db": "All healthy",
+                    "redis-payment-cache": "All healthy",
+                    "cdn-edge": "No issues",
+                    "api-gateway": "Mixed backends",
                 },
                 "check_recent_deploys": {
                     "network-infra": (
+                        "Config change 18min ago — BGP export filter modified, "
+                        "accidentally removed AZ-1 prefix from AZ-2/AZ-3 ads"
                     ),
                     "payment-service": "No recent deploys",
+                    "order-service": "No recent deploys",
+                    "fraud-detection-service": "No recent deploys",
+                    "postgres-db": "Minor change 5 days ago — increased shared_buffers",
+                    "redis-payment-cache": "No recent deploys",
+                    "cdn-edge": "No recent deploys",
+                    "api-gateway": "Deploy 1 day ago — tracing headers, no routing changes",
                 },
                 "check_service_status": {
+                    "network-infra": "BGP AZ-2: WITHDRAWN | BGP AZ-3: WITHDRAWN",
+                    "payment-service": "HEALTHY (local) | Cross-AZ: UNREACHABLE",
+                    "order-service": "DEGRADED",
+                    "fraud-detection-service": "HEALTHY (local) | Cross-AZ: UNREACHABLE",
+                    "postgres-db": "HEALTHY",
+                    "redis-payment-cache": "HEALTHY",
+                    "cdn-edge": "HEALTHY",
+                    "api-gateway": "DEGRADED",
                 },
             },
             "remediation_data": {
                 "rollback_deploy": {
+                    "network-infra": "Router config rolled back — BGP policy restored",
                 },
                 "execute_runbook_step": {
+                    "restore_bgp_routes": "BGP routes restored — AZ-2/3 can reach AZ-1",
+                    "verify_checkout_recovery": "Checkout failure: 0.3% — resolved",
+                    "verify_cross_az_connectivity": "AZ-2→AZ-1: OK | AZ-3→AZ-1: OK",
                 },
             },
             "correct_remediation_sequence": [
                 "execute_runbook_step:restore_bgp_routes",
                 "rollback_deploy:network-infra",
+                "execute_runbook_step:verify_cross_az_connectivity",
                 "execute_runbook_step:verify_checkout_recovery",
             ],
             "wrong_actions": {
+                "restart_service:payment-service": "healthy — network issue",
+                "scale_service:payment-service": "won't fix routing",
+                "restart_service:order-service": "victim",
+                "clear_cache:redis-payment-cache": "unrelated",
+                "restart_service:cdn-edge": "healthy",
+                "restart_service:fraud-detection-service": "healthy locally",
+                "restart_service:api-gateway": "victim of routing",
+                "rollback_deploy:api-gateway": "deploy was unrelated",
+                "rollback_deploy:postgres-db": "change was 5 days ago",
             },
             "resolution_keywords": [
                 "bgp", "network", "route", "rollback", "partition",
+                "restored", "az-1", "az-2", "az-3", "checkout",
+                "withdrawal", "config", "advertisement", "export",
+            ],
+        },
+        # RP-003: Credential rotation remediation — 7-step sequence, 8 services
+        {
+            "scenario_id": "RP-003",
+            "description": (
+                "Full incident remediation required. Multiple services failing "
+                "database authentication. Diagnose, fix, verify, and document."
+            ),
+            "incident_summary": (
+                "Multiple services reporting database authentication failures. "
+                "postgres-db connection pool near capacity with failed auth attempts. "
+                "user-service and notification-service are down. api-gateway degraded. "
+                "Diagnose the root cause, execute remediation, and document."
+            ),
+            "alert": {
+                "id": "ALT-RP-003",
+                "title": "Multiple services — DB authentication failures",
+                "severity_fired": "P2",
+                "affected_services": [
+                    "user-service", "notification-service", "api-gateway",
+                ],
+            },
+            "known_services": {
+                "api-gateway", "user-service", "notification-service",
+                "postgres-db", "config-service", "redis-session",
+                "order-service", "product-service",
+            },
+            "tool_responses": {
+                "query_logs": {
+                    "user-service": (
+                        "FATAL: password authentication failed for user 'app_user'\n"
+                        "ERROR: DB credentials rejected\n"
+                        "WARN: credentials last refreshed at 08:12:00Z"
+                    ),
+                    "notification-service": (
+                        "FATAL: password authentication failed\n"
+                        "WARN: credentials from 08:12:00Z appear stale"
+                    ),
+                    "api-gateway": (
+                        "ERROR: upstream user-service 503\n"
+                        "ERROR: upstream notification-service 503"
+                    ),
+                    "postgres-db": (
+                        "LOG: auth failure from 10.0.3.x (user-service)\n"
+                        "LOG: auth failure from 10.0.4.x (notification-service)\n"
+                        "LOG: 490/500 slots used by failed auth retries"
+                    ),
+                    "config-service": (
+                        "INFO: secrets rotation executed at 08:12:00Z\n"
+                        "WARN: rotation used PREVIOUS credential set — "
+                        "template bug in v3.2.1\n"
+                        "INFO: pushed to: user-service, notification-service, order-service"
+                    ),
+                    "redis-session": "Operating normally",
+                    "order-service": (
+                        "WARN: received credential push at 08:12:00Z — "
+                        "not applied yet, still using old valid credentials"
+                    ),
+                    "product-service": "Operating normally — using original credentials",
+                },
+                "check_metrics": {
+                    "user-service": "DB auth: 100% failure | HTTP 503: 100%",
+                    "notification-service": "DB auth: 100% failure | HTTP 503: 100%",
+                    "api-gateway": "503 rate: 62%",
+                    "postgres-db": "Connections: 490/500 | Auth failures/s: 80 | DB: HEALTHY",
+                    "config-service": "HEALTHY | Last push: 7min ago | Type: secrets_rotation",
+                    "redis-session": "All normal",
+                    "order-service": "HEALTHY | Using old (valid) credentials",
+                    "product-service": "HEALTHY | Unaffected",
+                },
+                "check_dependencies": {
+                    "user-service": "Depends on: postgres-db [AUTH FAIL], config-service [creds]",
+                    "notification-service": "Depends on: postgres-db [AUTH FAIL], config-service [creds]",
+                    "api-gateway": "Depends on: user-service [DOWN], notification-service [DOWN]",
+                    "postgres-db": "No upstream — DB itself is healthy",
+                    "config-service": "Provides credentials to: user-svc, notification-svc, order-svc",
+                    "redis-session": "Standalone",
+                    "order-service": "Depends on: postgres-db [OK — old creds]",
+                    "product-service": "Depends on: postgres-db [OK — original creds]",
+                },
+                "check_recent_deploys": {
+                    "config-service": (
+                        "Deploy 2h ago: v3.2.1 — updated secrets rotation template. "
+                        "Bug: references previous credential set instead of generating new."
+                    ),
+                    "user-service": "Deploy 4h ago: profile endpoint — no DB changes",
+                    "notification-service": "No recent deploys",
+                    "postgres-db": "No recent deploys",
+                    "api-gateway": "No recent deploys",
+                    "redis-session": "No recent deploys",
+                    "order-service": "Deploy 1 day ago: email template — no DB changes",
+                    "product-service": "No recent deploys",
+                },
+                "check_service_status": {
+                    "user-service": "DOWN | DB auth failures",
+                    "notification-service": "DOWN | DB auth failures",
+                    "api-gateway": "DEGRADED | 62%",
+                    "postgres-db": "STRESSED | 490/500 connections (failed auths)",
+                    "config-service": "HEALTHY | Rotation completed",
+                    "redis-session": "HEALTHY",
+                    "order-service": "HEALTHY | Old creds valid",
+                    "product-service": "HEALTHY",
+                },
+            },
+            "remediation_data": {
+                "rollback_deploy": {
+                    "config-service": "config-service rolled back to v3.2.0 — bug removed",
+                },
+                "execute_runbook_step": {
+                    "trigger_credential_rotation": (
+                        "Correct credentials generated and pushed to "
+                        "user-service, notification-service, order-service"
+                    ),
+                    "verify_db_connectivity": (
+                        "user-service: DB OK | notification-service: DB OK | "
+                        "order-service: DB OK | postgres-db: 45/500 connections"
+                    ),
+                    "verify_api_recovery": "api-gateway 503 rate: 0.1% — recovered",
+                },
+                "restart_service": {
+                    "user-service": "user-service restarted — DB auth OK with correct creds",
+                    "notification-service": "notification-service restarted — DB auth OK",
+                    "order-service": "order-service restarted — using correct credentials",
+                },
+            },
+            "correct_remediation_sequence": [
+                "rollback_deploy:config-service",
+                "execute_runbook_step:trigger_credential_rotation",
+                "restart_service:user-service",
+                "restart_service:notification-service",
+                "restart_service:order-service",
+                "execute_runbook_step:verify_db_connectivity",
+                "execute_runbook_step:verify_api_recovery",
+            ],
+            "wrong_actions": {
+                "restart_service:postgres-db": "DB is healthy — problem is credentials",
+                "scale_service:postgres-db": "connections are failed auths",
+                "restart_service:api-gateway": "downstream — fix auth first",
+                "rollback_deploy:user-service": "deploy was unrelated",
+                "rollback_deploy:order-service": "deploy was unrelated",
+                "clear_cache:redis-session": "healthy",
+                "restart_service:product-service": "healthy",
+                "restart_service:redis-session": "healthy",
+            },
+            "resolution_keywords": [
+                "config", "credential", "rotation", "stale", "password",
+                "authentication", "rollback", "config-service", "v3.2.1",
+                "restarted", "recovered", "push", "secrets", "template",
             ],
         },
     ],
 def get_task(task_id: str) -> dict:
     if task_id not in ALL_TASKS:
         raise ValueError(
+            f"Unknown task_id '{task_id}'. Valid: {list(ALL_TASKS.keys())}"
         )
     return ALL_TASKS[task_id]

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff