Spaces:

Elliot89
/

cloud-incident-response

Sleeping

App Files Files Community

Elliot89 commited on Apr 1

Commit

8151d99

1 Parent(s): a2c102b

Prepare project for push: update files

Browse files

Files changed (9) hide show

README.md +155 -273
graders.py +32 -4
inference.py +189 -30
openenv.yaml +9 -2
requirements.txt +1 -1
server/app.py +9 -1
server/environment.py +1 -1
tasks.py +36 -29
uv.lock +3 -3

README.md CHANGED Viewed

@@ -18,362 +18,244 @@ tags:
 # ☁️ Cloud Incident Response — OpenEnv Environment
-An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** — the real-world on-call workflow that engineers perform daily at every cloud company.
-Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures — OOM kills from runaway analytics queries, BGP network partitions isolating availability zones, and credential rotation bugs pushing stale secrets to production services.
 ---
-## OpenEnv Interface
-This environment implements the **full OpenEnv specification** with typed Pydantic models:
-| Method | Endpoint | Input | Returns |
-|---|---|---|---|
-| `POST` | `/reset` | `{"task_id": "...", "scenario_index": 0}` or `{}` | `Observation` |
-| `POST` | `/step` | `Action` JSON body | `{observation, reward, done, info}` |
-| `GET` | `/state` | — | `EpisodeState` |
-| `GET` | `/health` | — | `{"status": "ok"}` |
-| `GET` | `/tasks` | — | Task list + action schemas |
-| `GET` | `/grader` | — | Score 0.0–1.0 with breakdown |
-| `POST` | `/baseline` | — | Run inference.py, return scores |
-### Typed Models
-```python
-# Action — submitted by the agent
-Action {
-    action_type: str,          # e.g. "query_logs", "restart_service", "submit_severity"
-    parameters: {
-        service?: str,         # Target service name
-        severity?: str,        # P1|P2|P3|P4 (for submit_severity)
-        failure_mode?: str,    # Root cause description (for submit_root_cause)
-        summary?: str,         # Resolution summary (for submit_resolution)
-        flag?: str,            # Feature flag name (for disable_feature_flag)
-        runbook_action?: str,  # Runbook step (for execute_runbook_step)
-        target_version?: str,  # Deploy version (for rollback_deploy)
-    }
-}
-# Observation — returned to the agent
-Observation {
-    episode_id: str,           # Unique episode UUID
-    task_id: str,              # Active task
-    scenario_id: str,          # Current scenario (e.g. "AC-001")
-    step_count: int,           # Steps taken so far
-    max_steps: int,            # Budget (3, 10, or 15)
-    incident_summary: str,     # Plain-text incident description
-    alert: dict,               # Alert payload: title, symptoms, error_rate, revenue_impact
-    available_actions: [str],  # Valid action types for this task
-    queried_data: dict,        # All evidence gathered so far
-    known_services: [str],     # Valid service names for actions
-    cumulative_reward: float,  # Running reward total
-    done: bool,                # Episode complete flag
-    feedback: str,             # Per-step reward explanation
-    last_action_error: str?,   # Error from last action (null if OK)
-}
-# Reward — returned after each step
-Reward {
-    score: float,              # Step reward value
-    value: float,              # Alias for score (backward compatibility)
-    reason: str,               # Human-readable explanation
-    cumulative: float,         # Running total
-}
-```
 ---
-## Tasks (3 Difficulty Levels, 9 Scenarios)
-| Task ID | Difficulty | Max Steps | Scenarios | What the Agent Does |
-|---|---|---|---|---|
-| `alert_classification` | 🟢 Easy | 3 | 3 | Classify alert severity P1–P4 from metrics and symptoms |
-| `root_cause_analysis` | 🟡 Medium | 10 | 3 | Trace failure chain across 8 services to find root cause |
-| `remediation_planning` | 🔴 Hard | 15 | 3 | Diagnose + execute multi-step remediation + document resolution |
-### Scenario Details
-| ID | Incident | Root Cause | Challenge |
-|---|---|---|---|
-| AC-001 | DB connection pool exhaustion | — | Clear P1: 78% errors, $12k/min |
-| AC-002 | CDN cache invalidation storm | — | Ambiguous P2: degraded but checkout works |
-| AC-003 | Recommendation engine errors | — | Trap P3: 45% errors but zero revenue impact |
-| RCA-001 | Postgres OOM crash loop | analytics-service (unbounded query) | Root cause NOT in alert, 8 services to investigate |
-| RCA-002 | Cross-AZ checkout failures | network-infra (BGP route withdrawal) | Network issue disguised as application failure |
-| RCA-003 | DB authentication failures | config-service (stale credential rotation) | Multiple red herring deploys on other services |
-| RP-001 | Full OOM incident | analytics-service | 6-step remediation sequence, wrong actions penalized |
-| RP-002 | Full BGP incident | network-infra | 4-step runbook + config rollback, 8 services |
-| RP-003 | Full credential incident | config-service | 7-step sequence, credential rotation + service restarts |
-### Why This Is Genuinely Difficult
-- **Medium**: Root cause service is NEVER in the alert's `affected_services`. Agent must query logs on victim services, follow breadcrumbs that name the culprit, then investigate that service. 8 known services with red herring deploys.
-- **Hard**: Same diagnostic challenge PLUS must execute 4–7 remediation actions in logical order. Wrong actions (e.g. restarting a healthy service) carry −0.15 penalties. Resolution summary must reference specific services and actions.
-### Baseline Scores
-| Model | Easy | Medium | Hard | Overall |
 |---|---|---|---|---|
-| `llama-3.1-8b-instant` | 1.0 | 0.65 | 0.70 | 0.78 |
-| `llama-3.3-70b-versatile` | 1.0 | 0.99 | 0.80 | 0.93 |
-70B consistently outperforms 8B on medium/hard tasks, proving the environment differentiates model quality.
----
-## Action Space
-### 🔍 Diagnostic Actions (gather evidence)
 ```json
-{"action_type": "query_logs",           "parameters": {"service": "postgres-db"}}
-{"action_type": "check_metrics",        "parameters": {"service": "auth-service"}}
-{"action_type": "check_dependencies",   "parameters": {"service": "api-gateway"}}
-{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
-{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
 ```
-### 🔧 Remediation Actions (fix the incident)
 ```json
-{"action_type": "restart_service",      "parameters": {"service": "postgres-db"}}
-{"action_type": "rollback_deploy",      "parameters": {"service": "config-service", "target_version": "previous"}}
-{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
-{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
-{"action_type": "scale_service",        "parameters": {"service": "image-service", "replicas": 10}}
-{"action_type": "clear_cache",          "parameters": {"service": "redis-session"}}
 ```
-### 📝 Submission Actions (end episode)
 ```json
-{"action_type": "submit_severity",   "parameters": {"severity": "P1", "service": "api-gateway"}}
-{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}}
-{"action_type": "submit_resolution", "parameters": {"summary": "3+ sentence description of what failed, what you did, and current status"}}
 ```
 ---
-## Reward Function
-Dense reward shaping provides signal over the **full trajectory** (not just binary end-of-episode):
-| Signal | Reward | Description |
-|---|---|---|
-| Query new service | +0.03 to +0.04 | First diagnostic action on a service |
-| Query new action type | +0.01 to +0.02 | Different diagnostic on already-queried service |
-| Repeat same query | −0.03 to −0.04 | Same (action, service) pair again |
-| Unknown service | −0.05 to −0.06 | Service not in known_services |
-| Correct remediation | +0.06 | Action matches correct remediation sequence |
-| Wrong remediation | −0.12 to −0.15 | Action in wrong_actions list (e.g. restarting healthy service) |
-| Correct submission type | +0.02 | Using the right submit action for the task |
-| Wrong submission type | −0.08 to −0.12 | e.g. submit_severity during remediation_planning |
-| Past halfway (non-submit) | −0.015 to −0.04 | Per-step efficiency penalty |
-| Timeout | −0.15 to −0.20 | No submission before max_steps |
-| Exact repeat action | −0.04 to −0.05 | Identical action+params as a previous step |
-| **Grader score** | **0.0–1.0** | **Added at terminal step** |
-### Grading (deterministic, reproducible, 0.0–1.0)
-| Task | Scoring Logic |
-|---|---|
-| `alert_classification` | 1.0 exact match · 0.5 adjacent (P1↔P2) · 0.25 two-off · 0.0 wrong |
-| `root_cause_analysis` | 0.6 base (correct service + failure mode) + up to 0.4 efficiency bonus |
-| `remediation_planning` | 0.6 base + 0.3 efficiency (correct steps matched) − 0.15 penalty (wrong actions) + 0.1 summary quality |
----
-## 🖥️ Interactive UI Walkthrough
-The Gradio UI at `/` provides a visual interface for human evaluation. Here's how to use it:
-### 🟢 Easy Task: Alert Classification
-1. **Select Task**: Choose `🟢 Easy — Alert Classification` from the Task dropdown
-2. **Select Scenario**: Choose `Scenario 2` (the tricky P3 trap)
-3. **Click** `🔄 Reset Environment`
-4. **Read** the observation panel — recommendation-service errors at 45%
-5. **Investigate**: Set Action Type to `🔍 check_metrics`, Service to `recommendation-service`, click `▶️ Execute Action`
-6. **Read evidence** — "User impact: NONE", "Revenue: unchanged", "Checkout: 100%"
-7. **Submit**: Set Action Type to `📝 submit_severity`, expand `📋 Parameters`, set Severity to `P3 Medium`, click `▶️ Execute Action`
-8. **Grade**: Click `📊 Grade` — should show **1.0** for exact P3 match
-### 🟡 Medium Task: Root Cause Analysis
-1. **Select Task**: `🟡 Medium — Root Cause Analysis`, **Scenario**: `Scenario 0`
-2. **Click** `🔄 Reset Environment`
-3. **Read** the observation — postgres-db crash loop, multiple services down
-4. **Query victim**: Action Type `🔍 query_logs`, Service `postgres-db`, click `▶️ Execute Action`
-5. **Read evidence** — logs say *"query from analytics-service consuming all memory"*
-6. **Follow breadcrumb**: Action Type `🔍 query_logs`, Service `analytics-service`, click `▶️ Execute Action`
-7. **Read evidence** — "full_history_export job", "847M row scan", "no LIMIT"
-8. **Confirm**: Action Type `🔍 check_recent_deploys`, Service `analytics-service`, click `▶️ Execute Action`
-9. **Read evidence** — "Deploy 6h ago: cross-table JOIN without LIMIT clause"
-10. **Submit**: Action Type `📝 submit_root_cause`, Service `analytics-service`, Failure Mode: `unbounded query OOM killing postgres-db`, click `▶️ Execute Action`
-11. **Grade**: Click `📊 Grade` — should show **0.85–1.0**
-### 🔴 Hard Task: Remediation Planning
-1. **Select Task**: `🔴 Hard — Remediation Planning`, **Scenario**: `Scenario 0`
-2. **Click** `🔄 Reset Environment`
-3. **Diagnose**: `🔍 query_logs` on `postgres-db` → see "analytics-service" breadcrumb
-4. **Confirm**: `🔍 query_logs` on `analytics-service` → see "full_history_export, no LIMIT"
-5. **Fix Step 1**: `🔧 disable_feature_flag`, Flag: `full_history_export` → "job DISABLED"
-6. **Fix Step 2**: `🔧 restart_service` on `analytics-service` → "restarted — idle"
-7. **Fix Step 3**: `🔧 restart_service` on `postgres-db` → "accepting connections (12/500)"
-8. **Fix Step 4**: `🔧 restart_service` on `auth-service` → "reconnected OK"
-9. **Fix Step 5**: `🔧 restart_service` on `order-service` → "writes resuming"
-10. **Verify**: `🔧 execute_runbook_step`, Runbook Action: `verify_db_health` → "healthy"
-11. **Submit**: `📝 submit_resolution`, Summary: *"The analytics-service deployed a full_history_export job with an unbounded query that OOM-killed postgres-db. We disabled the full_history_export flag, restarted analytics-service, then restarted postgres-db, auth-service, and order-service. All services recovered and postgres-db is healthy."*
-12. **Grade**: Click `📊 Grade` — should show **0.85–1.0**
-### UI Controls Reference
-| Button | Purpose |
-|---|---|
-| `🔄 Reset Environment` | Start a new episode |
-| `▶️ Execute Action` | Run the selected action |
-| `📋 Parameters` | Expand to fill severity / failure_mode / summary / flag / runbook fields |
-| `📊 Grade` | See final grader score (0.0–1.0) after episode ends |
-| `📋 State` | Refresh the state panel |
-### Common Mistakes & Penalties
-| Mistake | Penalty | Why |
 |---|---|---|
-| Wrong submission type (e.g. `submit_severity` in hard task) | −0.12 | Each task has ONE correct submission action |
-| Restarting a healthy service (e.g. `restart redis-session`) | −0.15 | Wrong remediation action |
-| Querying a service not in `known_services` | −0.06 | Invalid target |
-| Repeating the exact same action | −0.04 | Infinite loop detection |
-| Not submitting before max steps | −0.20 | Timeout penalty |
-| Using remediation actions in easy task | −0.08 | Not available for alert classification |
 ---
-## API Usage
-### Quick Test
-```bash
-# Reset with defaults (alert_classification, scenario 0)
-curl -X POST http://localhost:7860/reset \
-  -H "Content-Type: application/json" -d '{}'
-# Reset with specific task
-curl -X POST http://localhost:7860/reset \
-  -H "Content-Type: application/json" \
-  -d '{"task_id": "root_cause_analysis", "scenario_index": 1}'
-# Take a step
-curl -X POST http://localhost:7860/step \
-  -H "Content-Type: application/json" \
-  -d '{"action_type": "query_logs", "parameters": {"service": "postgres-db"}}'
-# Check state
-curl http://localhost:7860/state
-# Grade current episode
-curl http://localhost:7860/grader
-```
-### Full Episode Example (Python)
-```python
-import requests
-BASE = "http://localhost:7860"
-# Start episode
-obs = requests.post(f"{BASE}/reset", json={
-    "task_id": "alert_classification", "scenario_index": 0
-}).json()
-print(f"Incident: {obs['incident_summary']}")
-print(f"Services: {obs['known_services']}")
-# Investigate
-result = requests.post(f"{BASE}/step", json={
-    "action_type": "check_metrics",
-    "parameters": {"service": obs["known_services"][0]}
-}).json()
-print(f"Reward: {result['reward']['score']:+.3f}")
-print(f"Done: {result['done']}")
-# Submit
-result = requests.post(f"{BASE}/step", json={
-    "action_type": "submit_severity",
-    "parameters": {"severity": "P1", "service": obs["known_services"][0]}
-}).json()
-# Grade
-grade = requests.get(f"{BASE}/grader").json()
-print(f"Score: {grade['total']}")
-```
 ---
-## Setup
 ### Local Development
 ```bash
 pip install -r requirements.txt
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
 ### Docker
 ```bash
 docker build -t cloud-incident-env .
 docker run -p 7860:7860 cloud-incident-env
 ```
-### Run Baseline Agent
 ```bash
-export API_BASE_URL="https://api.groq.com/openai/v1"
-export MODEL_NAME="llama-3.1-8b-instant"
-export HF_TOKEN="gsk_your_groq_key"
 python inference.py
 ```
 ---
-## Project Structure
 ```
-├── Dockerfile              # Docker build for HF Spaces
 ├── README.md               # This file
 ├── requirements.txt        # Python dependencies
-├── openenv.yaml            # OpenEnv manifest (tasks, endpoints)
-├── pyproject.toml           # Project metadata
 ├── tasks.py                # 9 scenarios across 3 difficulty levels
 ├── graders.py              # Deterministic graders (0.0–1.0)
-├── inference.py            # Baseline LLM agent with fallback logic
 └── server/
     ├── __init__.py
     ├── app.py              # FastAPI + Gradio endpoints
-    ├── environment.py      # Core step/reset/state logic + reward shaping
     └── models.py           # Typed Pydantic models (Action, Observation, Reward)
 ```
 ---
-## Design Decisions
-### Why Cloud Incident Response?
-Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This is a universal, high-value skill that AI agents should learn. The environment simulates the exact decision loop: triage → investigate → diagnose → remediate → document.
-### Why These Specific Incidents?
-- **OOM kills** (RCA-001, RP-001): The most common database failure mode — a runaway query consumes all memory and crashes the DB, taking down every dependent service.
-- **BGP partitions** (RCA-002, RP-002): Network-layer failures that look like application failures — services appear "down" but are actually healthy, just unreachable.
-- **Credential rotation bugs** (RCA-003, RP-003): Configuration management failures that cause cascading auth failures — the DB is fine but clients have wrong passwords.
-### Why Dense Rewards?
-Sparse rewards (only at episode end) provide no learning signal for RL agents. Our reward function gives feedback on **every step**: positive for useful investigation, negative for wasted actions, with the terminal grader score layered on top. This enables both RL training and LLM agent evaluation.
-```

 # ☁️ Cloud Incident Response — OpenEnv Environment
+An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** — the real-world on-call workflow that engineers at every cloud company perform daily.
+Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures — connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions.
+## Authors
+- **Einstein** — Environment Design & Grader Implementation
+- **Sidra** — Scenario Design & Testing
 ---
+## 🎯 Why This Environment
+Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
+| Phase | What the Agent Does |
+|---|---|
+| **Triage** | Read alert, assess blast radius, classify severity (P1–P4) |
+| **Investigate** | Query logs, metrics, dependencies, recent deploys |
+| **Diagnose** | Correlate signals across services to find root cause |
+| **Remediate** | Execute correct runbook steps in the right sequence |
+| **Document** | Submit resolution summary for post-incident review |
+Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
 ---
+## 📊 Baseline Scores
+Using `Llama 3.1 8B Instruct` · deterministic (`temperature=0.0`) · fully reproducible
+| Task | Difficulty | S0 | S1 | S2 | Average |
+|---|---|---|---|---|---|
+| `alert_classification` | 🟢 Easy | 1.00 | 1.00 | 1.00 | **1.00** |
+| `root_cause_analysis` | 🟡 Medium | 1.00 | 0.20 | 1.00 | **0.73** |
+| `remediation_planning` | 🔴 Hard | 0.60 | 0.45 | 0.59 | **0.55** |
+| **Overall** | | | | | **0.76** |
+### Score Interpretation
+```
+Easy   1.00 ████████████████████  Clear metrics → straightforward classification
+Medium 0.73 ██████████████▌       Root cause hidden — model fails on BGP scenario (S1=0.20)
+Hard   0.55 ███████████           Multi-phase execution with wrong-action penalties
+```
+- **Easy → 1.00:** Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries.
+- **Medium → 0.73:** Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause.
+- **Hard → 0.55:** Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (−0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations.
+---
+## 🏗️ Tasks
+| Task ID | Difficulty | Max Steps | Objective | Submission Action |
 |---|---|---|---|---|
+| `alert_classification` | 🟢 Easy | 3 | Classify alert severity (P1–P4) | `submit_severity` |
+| `root_cause_analysis` | 🟡 Medium | 10 | Find root cause service + failure mode | `submit_root_cause` |
+| `remediation_planning` | 🔴 Hard | 15 | Diagnose + remediate + document | `submit_resolution` |
+### Scenarios (3 per task = 9 total episodes)
+| ID | Incident Type | Root Cause | Why It's Hard |
+|---|---|---|---|
+| AC-001 | DB connection pool exhaustion | — | Clear P1: 78% errors, $12k/min revenue loss |
+| AC-002 | CDN cache invalidation storm | — | Ambiguous P2: degraded but checkout works |
+| AC-003 | Recommendation service errors | — | Trap P3: 45% errors but zero revenue impact |
+| RCA-001 | Postgres OOM kill | analytics-service | Must correlate "analytics export query" in DB logs |
+| RCA-002 | BGP network partition | network-infra | No victim log names network-infra — hardest scenario |
+| RCA-003 | Credential rotation bug | config-service | Must trace "secrets rotation" hint to config-service |
+| RP-001 | Full OOM remediation | analytics-service | 6-step sequence: disable job → restart chain |
+| RP-002 | Full BGP remediation | network-infra | 4-step sequence: restore routes → rollback → verify |
+| RP-003 | Full credential fix | config-service | 7-step sequence: rollback → rotate → restart → verify |
+---
+## 🎮 Action Space
+### Diagnostic Actions (gather evidence)
 ```json
+{"action_type": "query_logs",           "parameters": {"service": "<name>"}}
+{"action_type": "check_metrics",        "parameters": {"service": "<name>"}}
+{"action_type": "check_dependencies",   "parameters": {"service": "<name>"}}
+{"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}}
+{"action_type": "check_service_status", "parameters": {"service": "<name>"}}
 ```
+### Remediation Actions (fix the incident)
 ```json
+{"action_type": "restart_service",      "parameters": {"service": "<name>"}}
+{"action_type": "rollback_deploy",      "parameters": {"service": "<name>"}}
+{"action_type": "scale_service",        "parameters": {"service": "<name>", "replicas": 10}}
+{"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}}
+{"action_type": "clear_cache",          "parameters": {"service": "<name>"}}
+{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>"}}
 ```
+### Submission Actions (end the episode)
 ```json
+{"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "<name>"}}
+{"action_type": "submit_root_cause", "parameters": {"service": "<name>", "failure_mode": "<description>"}}
+{"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}}
 ```
 ---
+## 👁️ Observation Space
+| Field | Type | Description |
 |---|---|---|
+| `episode_id` | string | Unique episode UUID |
+| `task_id` | string | Active task identifier |
+| `scenario_id` | string | Current scenario (e.g., `RCA-001`) |
+| `step_count` / `max_steps` | int | Progress through episode |
+| `incident_summary` | string | Plain-text incident description (no root cause hints) |
+| `alert` | dict | Alert payload with severity, symptoms, affected services |
+| `available_actions` | list | Valid action types for this task |
+| `queried_data` | dict | All evidence gathered so far |
+| `known_services` | list | Exact service names valid for actions |
+| `cumulative_reward` | float | Running reward total |
+| `done` | bool | Episode terminal flag |
+| `feedback` | string | Per-step feedback explaining reward |
+| `last_action_error` | string? | Error message if last action was invalid |
 ---
+## 💰 Reward Function
+Dense reward shaping throughout the trajectory — not just terminal scoring.
+### Per-Step Rewards
+| Event | Easy | Medium | Hard |
+|---|---|---|---|
+| Query new service (first time) | +0.04 | +0.04 | +0.03 |
+| Query new action on known service | +0.02 | +0.02 | +0.01 |
+| Repeat exact same query | −0.03 | −0.04 | −0.03 |
+| Query unknown service | −0.06 | −0.06 | −0.05 |
+| Correct remediation action | — | +0.06 | +0.06 |
+| Wrong remediation action | −0.08 | −0.10 | −0.15 |
+| Step past halfway (non-submit) | −0.04 | −0.02 | −0.02 |
+| Timeout without submission | −0.15 | −0.15 | −0.20 |
+### Grader Scoring (terminal, deterministic)
+| Task | Scoring Logic |
+|---|---|
+| `alert_classification` | 1.0 exact · 0.5 adjacent · 0.25 two-off · 0.0 wrong |
+| `root_cause_analysis` | Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05–0.20 based on investigation effort |
+| `remediation_planning` | Scaled base (0.10–0.50 by investigation depth) + 0.30 efficiency − up to 0.30 wrong-action penalty + 0.10 summary quality |
+---
+## 🔌 API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/` | Gradio UI — interactive environment demo |
+| `GET` | `/health` | `{"status":"ok","version":"0.1.0"}` |
+| `POST` | `/reset` | Start new episode (accepts `task_id`, `scenario_index`) |
+| `POST` | `/step` | Submit action → returns observation, reward, done, info |
+| `GET` | `/state` | Full current episode state with action history |
+| `GET` | `/tasks` | All tasks with action schemas |
+| `GET` | `/grader` | Score current episode (0.0–1.0) with breakdown |
 ---
+## 🚀 Setup & Usage
 ### Local Development
 ```bash
 pip install -r requirements.txt
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
 ### Docker
 ```bash
 docker build -t cloud-incident-env .
 docker run -p 7860:7860 cloud-incident-env
 ```
+### Run Baseline Inference
 ```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
+export HF_TOKEN="your_token"
 python inference.py
 ```
+### Quick API Test
+```bash
+# Reset
+curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"
+# Step
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'
+# Grade
+curl http://localhost:7860/grader
+```
 ---
+## 📁 Project Structure
 ```
+.
+├── Dockerfile              # Container build
 ├── README.md               # This file
 ├── requirements.txt        # Python dependencies
+├── openenv.yaml            # OpenEnv metadata + task definitions
+├── inference.py            # Baseline agent (OpenAI client + smart fallback)
 ├── tasks.py                # 9 scenarios across 3 difficulty levels
 ├── graders.py              # Deterministic graders (0.0–1.0)
 └── server/
     ├── __init__.py
     ├── app.py              # FastAPI + Gradio endpoints
+    ├── environment.py      # Core step()/reset()/state() logic
     └── models.py           # Typed Pydantic models (Action, Observation, Reward)
 ```
 ---
+## ✅ Validation
+```bash
+# OpenEnv spec validation
+openenv validate    # → [OK] Ready for multi-mode deployment
+# Docker build
+docker build -t cloud-incident-env .    # → builds successfully
+# Health check
+curl http://localhost:7860/health       # → {"status":"ok","version":"0.1.0"}
+```
+## Team
+- **Einstein** — [@MrEinsteinE](https://github.com/MrEinsteinE)
+- **Sidra** — [@sidraaiman](https://github.com/sidraaiman)

graders.py CHANGED Viewed

@@ -182,8 +182,20 @@ def _grade_root_cause_analysis(state: dict, scenario: dict) -> dict:
     elif svc_match:
         base, base_fb = 0.35, "Correct service only — failure mode unclear"
     else:
-        base, base_fb = 0.10, (
-            f"Wrong service: '{sub_svc}' (correct: '{correct_svc}') — partial credit"
         )
     efficiency = 0.0
@@ -279,7 +291,22 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
             "feedback": "No resolution submitted or no investigation — score 0.0",
         }
-    base = 0.6
     executed = set()
     for a in history:
@@ -325,7 +352,8 @@ def _grade_remediation_planning(state: dict, scenario: dict) -> dict:
             in wrong_map
         )
     )
-    penalty = round(min(0.15, wrong_count * 0.05), 4)
     sl = summary.lower()
     hits = sum(1 for kw in keywords if kw in sl)

     elif svc_match:
         base, base_fb = 0.35, "Correct service only — failure mode unclear"
     else:
+        # Give MORE partial credit for investigation effort even with wrong answer
+        pre_submit_diag = [
+            a for a in history[:sub_step]
+            if a.get("action_type") in diag_types
+        ]
+        investigated = len({
+            a.get("parameters", {}).get("service", "").lower()
+            for a in pre_submit_diag
+        } - {""})
+        # 0.05 base + up to 0.15 for investigating 3+ services
+        wrong_base = min(0.20, 0.05 + investigated * 0.05)
+        base, base_fb = wrong_base, (
+            f"Wrong service: '{sub_svc}' (correct: '{correct_svc}') — "
+            f"investigated {investigated} services"
         )
     efficiency = 0.0
             "feedback": "No resolution submitted or no investigation — score 0.0",
         }
+    # Base scales with investigation depth — not a free 0.60
+    diag_count = sum(
+        1 for a in history if a.get("action_type") in {
+            "query_logs", "check_metrics", "check_dependencies",
+            "check_recent_deploys", "check_service_status",
+        }
+    )
+    rem_count = sum(
+        1 for a in history if a.get("action_type") in {
+            "restart_service", "rollback_deploy", "scale_service",
+            "disable_feature_flag", "clear_cache", "execute_runbook_step",
+        }
+    )
+    diag_credit = min(0.20, diag_count * 0.05)
+    rem_credit = min(0.20, rem_count * 0.05)
+    base = round(0.10 + diag_credit + rem_credit, 4)
     executed = set()
     for a in history:
             in wrong_map
         )
     )
+    penalty = round(min(0.30, wrong_count * 0.10), 4)
     sl = summary.lower()
     hits = sum(1 for kw in keywords if kw in sl)

inference.py CHANGED Viewed

@@ -13,7 +13,13 @@ import sys
 import time
 import requests
 try:
     from dotenv import load_dotenv
     load_dotenv()
@@ -23,10 +29,10 @@ except ImportError:
 # ── Config ──────────────────────────────────────────────────────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME",   "llama-3.1-8b-instant")
-HF_TOKEN     = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY") or ""
 ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
-if not HF_TOKEN:
     print("[WARN] No API key set — LLM calls will fail.", file=sys.stderr)
 _session = requests.Session()
@@ -37,7 +43,7 @@ def _get_client():
     global _client
     if _client is None:
         from openai import OpenAI
-        _client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
     return _client
@@ -370,7 +376,7 @@ def _should_override(
     return False
-def _llm_call_with_retry(messages: list, max_retries: int = 2) -> str:
     """Call LLM with retry on rate limit errors."""
     for attempt in range(max_retries + 1):
         try:
@@ -387,7 +393,7 @@ def _llm_call_with_retry(messages: list, max_retries: int = 2) -> str:
             if "rate_limit" in err_str or "429" in err_str:
                 if attempt < max_retries:
                     # Parse wait time from error or use default
-                    wait = 10 * (attempt + 1)
                     print(f"  [RATE LIMIT] waiting {wait}s (attempt {attempt + 1})",
                           file=sys.stderr)
                     time.sleep(wait)
@@ -399,6 +405,12 @@ def _llm_call_with_retry(messages: list, max_retries: int = 2) -> str:
 def _run_episode(task_id: str, scenario_index: int) -> float:
     r = _session.post(
         f"{ENV_BASE_URL}/reset",
         params={"task_id": task_id, "scenario_index": scenario_index},
@@ -472,44 +484,191 @@ def _run_episode(task_id: str, scenario_index: int) -> float:
 def main():
     runs = [
-        ("alert_classification",  0),
-        ("alert_classification",  1),
-        ("alert_classification",  2),
-        ("root_cause_analysis",   0),
-        ("root_cause_analysis",   1),
-        ("root_cause_analysis",   2),
-        ("remediation_planning",  0),
-        ("remediation_planning",  1),
-        ("remediation_planning",  2),
     ]
-    results: dict[str, list[float]] = {}
-    print(f"{'Task':<36} {'S':>2}  {'Score':>7}")
-    print("-" * 50)
     for task_id, scenario_index in runs:
         try:
-            score = _run_episode(task_id, scenario_index)
         except Exception as e:
-            print(f"  [ERROR] {task_id} s{scenario_index}: {e}",
-                  file=sys.stderr)
-            score = 0.0
-        label = f"{task_id} [s{scenario_index}]"
-        print(f"{label:<36} {scenario_index:>2}  {score:>7.4f}")
-        results.setdefault(task_id, []).append(score)
-    print("-" * 50)
-    summary = {t: round(sum(v) / len(v), 4) for t, v in results.items()}
-    summary["overall"] = round(sum(summary.values()) / len(summary), 4)
-    print("\nScore Summary:")
-    for k, v in summary.items():
-        print(f"  {k:<36}: {v:.4f}")
     print(json.dumps(summary))
 if __name__ == "__main__":
     main()

 import time
 import requests
+import time as _time
+_START = _time.time()
+_MAX_RUNTIME = 1080
+def _check_timeout():
+    if _time.time() - _START > _MAX_RUNTIME:
+        raise RuntimeError("Approaching 20min limit — stopping early")
 try:
     from dotenv import load_dotenv
     load_dotenv()
 # ── Config ──────────────────────────────────────────────────────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME",   "llama-3.1-8b-instant")
+API_KEY = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY") or ""
 ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
+if not API_KEY:
     print("[WARN] No API key set — LLM calls will fail.", file=sys.stderr)
 _session = requests.Session()
     global _client
     if _client is None:
         from openai import OpenAI
+        _client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
     return _client
     return False
+def _llm_call_with_retry(messages: list, max_retries: int = 1) -> str:
     """Call LLM with retry on rate limit errors."""
     for attempt in range(max_retries + 1):
         try:
             if "rate_limit" in err_str or "429" in err_str:
                 if attempt < max_retries:
                     # Parse wait time from error or use default
+                    wait = 5 * (attempt + 1)
                     print(f"  [RATE LIMIT] waiting {wait}s (attempt {attempt + 1})",
                           file=sys.stderr)
                     time.sleep(wait)
 def _run_episode(task_id: str, scenario_index: int) -> float:
+    if _time.time() - _START > _MAX_RUNTIME:
+        print(f"  [TIMEOUT] Approaching 20min limit — skipping {task_id} s{scenario_index}",
+              file=sys.stderr)
+        return 0.0
+    _check_timeout()
     r = _session.post(
         f"{ENV_BASE_URL}/reset",
         params={"task_id": task_id, "scenario_index": scenario_index},
 def main():
     runs = [
+        ("alert_classification", 0),
+        ("alert_classification", 1),
+        ("alert_classification", 2),
+        ("root_cause_analysis", 0),
+        ("root_cause_analysis", 1),
+        ("root_cause_analysis", 2),
+        ("remediation_planning", 0),
+        ("remediation_planning", 1),
+        ("remediation_planning", 2),
     ]
+    _DIFFICULTY = {
+        "alert_classification": "🟢 Easy",
+        "root_cause_analysis": "🟡 Medium",
+        "remediation_planning": "🔴 Hard",
+    }
+    _MAX_STEPS = {
+        "alert_classification": 3,
+        "root_cause_analysis": 10,
+        "remediation_planning": 15,
+    }
+    results: dict[str, list[dict]] = {}
+    print()
+    print("=" * 100)
+    print("  ☁️  CLOUD INCIDENT RESPONSE — BASELINE INFERENCE")
+    print("=" * 100)
+    print(f"  Model:    {MODEL_NAME}")
+    print(f"  Endpoint: {API_BASE_URL}")
+    print("=" * 100)
+    print()
+    # Table header
+    print(f"{'Task':<24} {'Difficulty':<12} {'Scenario':>8} {'Steps':>10} {'Actions':>10} {'Reward':>10} {'Score':>10}")
+    print("─" * 100)
     for task_id, scenario_index in runs:
         try:
+            score, steps_used, actions_taken, cumulative_reward = _run_episode_detailed(task_id, scenario_index)
         except Exception as e:
+            print(f"  [ERROR] {task_id} scenario {scenario_index}: {e}", file=sys.stderr)
+            score, steps_used, actions_taken, cumulative_reward = 0.0, 0, 0, 0.0
+        difficulty = _DIFFICULTY.get(task_id, "?")
+        max_steps = _MAX_STEPS.get(task_id, "?")
+        steps_display = f"{steps_used}/{max_steps}"
+        print(
+            f"{task_id:<24} {difficulty:<12} {scenario_index:>8} "
+            f"{steps_display:>10} {actions_taken:>10} {cumulative_reward:>+10.4f} {score:>10.4f}"
+        )
+        results.setdefault(task_id, []).append({
+            "scenario": scenario_index,
+            "score": score,
+            "steps": steps_used,
+            "actions": actions_taken,
+            "reward": cumulative_reward,
+        })
+    print("─" * 100)
+    print()
+    # Summary table
+    print("=" * 100)
+    print("  📊 SUMMARY BY TASK")
+    print("=" * 100)
+    print(f"{'Task':<24} {'Difficulty':<12} {'Avg Score':>10} {'Avg Steps':>10} {'Scenarios':>20}")
+    print("─" * 100)
+    summary = {}
+    for task_id in ["alert_classification", "root_cause_analysis", "remediation_planning"]:
+        if task_id not in results:
+            continue
+        data = results[task_id]
+        avg_score = sum(d["score"] for d in data) / len(data)
+        avg_steps = sum(d["steps"] for d in data) / len(data)
+        scenario_scores = " | ".join(f'{d["score"]:.2f}' for d in data)
+        difficulty = _DIFFICULTY.get(task_id, "?")
+        print(f"{task_id:<24} {difficulty:<12} {avg_score:>10.4f} {avg_steps:>10.1f} {scenario_scores:>20}")
+        summary[task_id] = round(avg_score, 4)
+    summary["overall"] = round(sum(summary.values()) / len(summary), 4)
+    print("─" * 100)
+    print(f"{'OVERALL':<24} {'':12} {summary['overall']:>10.4f}")
+    print("=" * 100)
+    print()
+    # Difficulty progression check
+    easy = summary.get("alert_classification", 0)
+    med = summary.get("root_cause_analysis", 0)
+    hard = summary.get("remediation_planning", 0)
+    if easy > med > hard:
+        print("  ✅ Difficulty Progression: Easy (%.2f) > Medium (%.2f) > Hard (%.2f)" % (easy, med, hard))
+    elif easy > med and easy > hard:
+        print("  ⚠️  Difficulty Progression: Easy highest, Medium ≈ Hard")
+    else:
+        print("  ❌ Difficulty Progression: Unexpected order")
+    print()
     print(json.dumps(summary))
+def _run_episode_detailed(task_id: str, scenario_index: int) -> tuple[float, int, int, float]:
+    """Run episode and return (score, steps_used, actions_taken, cumulative_reward)."""
+    r = _session.post(
+        f"{ENV_BASE_URL}/reset",
+        params={"task_id": task_id, "scenario_index": scenario_index},
+        timeout=30,
+    )
+    r.raise_for_status()
+    obs = r.json()
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user",   "content": _first_obs_msg(obs)},
+    ]
+    prev_queried: dict = {}
+    max_steps = obs.get("max_steps", 10)
+    actions_taken = 0
+    cumulative_reward = 0.0
+    for step_i in range(max_steps):
+        current_step = step_i + 1
+        raw = _llm_call_with_retry(messages)
+        messages.append({"role": "assistant", "content": raw or "{}"})
+        action = None
+        try:
+            if raw.strip():
+                action = _parse(raw)
+        except Exception:
+            pass
+        if action is None:
+            action = _smart_fallback(task_id, obs, current_step, max_steps)
+            print(f"    [FALLBACK] step {current_step}: {action.get('action_type')}", file=sys.stderr)
+        elif _should_override(task_id, action, obs, current_step, max_steps):
+            old_at = action.get("action_type")
+            action = _smart_fallback(task_id, obs, current_step, max_steps)
+            print(f"    [OVERRIDE] step {current_step}: {old_at} -> {action.get('action_type')}", file=sys.stderr)
+        sr = _session.post(f"{ENV_BASE_URL}/step", json=action, timeout=30)
+        sr.raise_for_status()
+        result = sr.json()
+        new_obs = result["observation"]
+        actions_taken += 1
+        step_reward = result['reward']['value']
+        cumulative_reward = result['reward'].get('cumulative', cumulative_reward + step_reward)
+        # Step detail output
+        print(
+            f"    step {current_step:>2}: {action.get('action_type'):<28} "
+            f"reward={step_reward:+.3f}  done={result['done']}"
+        )
+        if result.get("done"):
+            break
+        step_msg = _step_msg(new_obs, prev_queried)
+        messages.append({"role": "user", "content": step_msg})
+        prev_queried = {
+            k: dict(v)
+            for k, v in new_obs.get("queried_data", {}).items()
+            if isinstance(v, dict)
+        }
+        obs = new_obs
+        if len(messages) > 20:
+            messages = messages[:2] + messages[-16:]
+    g = _session.get(f"{ENV_BASE_URL}/grader", timeout=30)
+    g.raise_for_status()
+    score = g.json().get("total", 0.0)
+    return score, current_step, actions_taken, cumulative_reward
 if __name__ == "__main__":
     main()

openenv.yaml CHANGED Viewed

@@ -8,7 +8,11 @@ description: >
   across distributed systems. An AI agent classifies alert severity, performs
   root cause analysis through log/metric/dependency queries, and executes
   remediation sequences to resolve production incidents end-to-end.
-author: Einstein_Sidra
 license: MIT
 tags:
   - openenv
@@ -59,4 +63,7 @@ endpoints:
   state:    "GET /state"
   tasks:    "GET /tasks"
   grader:   "GET /grader"
-  baseline: "POST /baseline"

   across distributed systems. An AI agent classifies alert severity, performs
   root cause analysis through log/metric/dependency queries, and executes
   remediation sequences to resolve production incidents end-to-end.
+authors:
+  - name: "Einstein"
+    github: "MrEinsteinE"
+  - name: "Sidra"
+    github: "sidraaiman"
 license: MIT
 tags:
   - openenv
   state:    "GET /state"
   tasks:    "GET /tasks"
   grader:   "GET /grader"
+  baseline: "POST /baseline"
+repo: "https://github.com/MrEinsteinE/cloud-incident-response-openenv"
+space: "https://huggingface.co/spaces/Elliot89/cloud-incident-response"

requirements.txt CHANGED Viewed

@@ -5,5 +5,5 @@ requests>=2.31.0
 openai>=1.58.0
 httpx>=0.27.0,<0.29.0
 python-dotenv>=1.0.0
-gradio>=4.0.0,<6.0.0
 openenv-core>=0.2.0

 openai>=1.58.0
 httpx>=0.27.0,<0.29.0
 python-dotenv>=1.0.0
+gradio>=4.44.0,<5.0.0
 openenv-core>=0.2.0

server/app.py CHANGED Viewed

@@ -254,7 +254,15 @@ def baseline():
     except Exception:
         return {"raw_output": result.stdout[-3000:]}
 # ── Gradio UI ─────────────────────────────────────────────────────────────────
 import gradio as gr

     except Exception:
         return {"raw_output": result.stdout[-3000:]}
+@app.get("/status")
+def root_status():
+    """Root health check — returns JSON."""
+    return {
+        "status": "running",
+        "name": "cloud-incident-response",
+        "version": "0.1.0",
+        "tasks": list(ALL_TASKS.keys()),
+    }
 # ── Gradio UI ─────────────────────────────────────────────────────────────────
 import gradio as gr

server/environment.py CHANGED Viewed

@@ -250,7 +250,7 @@ class IncidentEnvironment:
         svc = (params.service or "").lower().strip()
         flag = (params.flag or "").lower().strip()
         runbook = (params.runbook_action or "").lower().strip()
-        target = (params.target or "").lower().strip()
         if not (svc or flag or runbook or target):
             r += rt["rem_no_target"]

         svc = (params.service or "").lower().strip()
         flag = (params.flag or "").lower().strip()
         runbook = (params.runbook_action or "").lower().strip()
+        target = (params.target or params.target_version or "").lower().strip()
         if not (svc or flag or runbook or target):
             r += rt["rem_no_target"]

tasks.py CHANGED Viewed

@@ -373,12 +373,13 @@ SCENARIOS: dict = {
                         # In RCA-001, replace the query_logs section:
         "query_logs": {
             "postgres-db": (
-                "2024-03-16T02:11:00Z LOG database system shut down\n"
-                "2024-03-16T02:10:58Z FATAL terminated by kernel OOM killer\n"
-                "2024-03-16T02:10:30Z LOG long-running query from "
-                "analytics-service consuming all available memory — "
-                "running for 12 minutes, no LIMIT clause"
-            ),
             "analytics-service": (
                 "2024-03-16T01:58:00Z INFO starting scheduled job: full_history_export\n"
                 "2024-03-16T01:58:01Z DEBUG executing: SELECT * FROM events "
@@ -678,18 +679,20 @@ SCENARIOS: dict = {
                         # In RCA-003, replace query_logs:
         "query_logs": {
             "user-service": (
-                "2024-03-18T08:14:00Z FATAL password authentication failed "
-                "for user 'app_user'\n"
-                "2024-03-18T08:14:01Z ERROR DB credentials rejected — "
-                "credentials were pushed by config-service at 08:12:00Z\n"
-                "2024-03-18T08:14:02Z WARN config-service credential rotation "
-                "may have sent wrong credentials"
-            ),
             "notification-service": (
-                "2024-03-18T08:14:05Z FATAL password authentication failed\n"
-                "2024-03-18T08:14:06Z WARN credentials from config-service "
-                "push at 08:12:00Z appear to be stale/invalid"
-            ),
             "api-gateway": (
                 "2024-03-18T08:14:10Z ERROR upstream user-service: 503\n"
                 "2024-03-18T08:14:11Z ERROR upstream notification-service: 503"
@@ -842,10 +845,12 @@ SCENARIOS: dict = {
             },
             "tool_responses": {
                 "query_logs": {
-                    "postgres-db": (
-                        "FATAL: terminated by kernel OOM killer — "
-                        "query from client 10.0.5.47 running 12min consuming all memory"
-                    ),
                     "analytics-service": (
                         "INFO: starting job full_history_export\n"
                         "WARN: query plan: 847M rows, cross-table JOIN, no LIMIT\n"
@@ -1102,14 +1107,16 @@ SCENARIOS: dict = {
             "tool_responses": {
                 "query_logs": {
                     "user-service": (
-                        "FATAL: password authentication failed for user 'app_user'\n"
-                        "ERROR: DB credentials rejected\n"
-                        "WARN: credentials last refreshed at 08:12:00Z"
-                    ),
-                    "notification-service": (
-                        "FATAL: password authentication failed\n"
-                        "WARN: credentials from 08:12:00Z appear stale"
-                    ),
                     "api-gateway": (
                         "ERROR: upstream user-service 503\n"
                         "ERROR: upstream notification-service 503"

                         # In RCA-001, replace the query_logs section:
         "query_logs": {
             "postgres-db": (
+    "2024-03-16T02:11:00Z LOG database system shut down\n"
+    "2024-03-16T02:10:58Z FATAL terminated by kernel OOM killer\n"
+    "2024-03-16T02:10:30Z LOG long-running analytics export query "
+    "consuming 31.8GB/32GB — sequential scan on events table "
+    "with cross-join, running 12 minutes, no LIMIT clause. "
+    "Investigate analytics-service scheduled jobs"
+),
             "analytics-service": (
                 "2024-03-16T01:58:00Z INFO starting scheduled job: full_history_export\n"
                 "2024-03-16T01:58:01Z DEBUG executing: SELECT * FROM events "
                         # In RCA-003, replace query_logs:
         "query_logs": {
             "user-service": (
+    "2024-03-18T08:14:00Z FATAL password authentication failed "
+    "for user 'app_user'\n"
+    "2024-03-18T08:14:01Z ERROR DB credentials rejected — "
+    "credentials were last pushed by config-service secrets "
+    "rotation at 08:12:00Z\n"
+    "2024-03-18T08:14:02Z WARN credential hash mismatch — "
+    "check config-service rotation job for issues"
+),
             "notification-service": (
+    "2024-03-18T08:14:05Z FATAL password authentication failed "
+    "for user 'app_user'\n"
+    "2024-03-18T08:14:06Z WARN credentials from config-service "
+    "rotation at 08:12:00Z appear invalid"
+),
             "api-gateway": (
                 "2024-03-18T08:14:10Z ERROR upstream user-service: 503\n"
                 "2024-03-18T08:14:11Z ERROR upstream notification-service: 503"
             },
             "tool_responses": {
                 "query_logs": {
+                    # RP-001 query_logs → postgres-db — REPLACE WITH:
+"postgres-db": (
+    "FATAL: terminated by kernel OOM killer — "
+    "query from client 10.0.5.47 running 12min consuming "
+    "31.8GB of 32GB available memory"
+),
                     "analytics-service": (
                         "INFO: starting job full_history_export\n"
                         "WARN: query plan: 847M rows, cross-table JOIN, no LIMIT\n"
             "tool_responses": {
                 "query_logs": {
                     "user-service": (
+    "FATAL: password authentication failed for user 'app_user'\n"
+    "ERROR: DB credentials rejected\n"
+    "WARN: credentials last refreshed at 08:12:00Z"
+),
+"notification-service": (
+    "FATAL: password authentication failed\n"
+    "WARN: credentials last refreshed at 08:12:00Z — "
+    "authentication rejected by postgres-db"
+),
                     "api-gateway": (
                         "ERROR: upstream user-service 503\n"
                         "ERROR: upstream notification-service 503"

uv.lock CHANGED Viewed

@@ -638,7 +638,7 @@ wheels = [
 [[package]]
 name = "fastapi"
-version = "0.135.2"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
     { name = "annotated-doc" },
@@ -647,9 +647,9 @@ dependencies = [
     { name = "typing-extensions" },
     { name = "typing-inspection" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/c4/73/5903c4b13beae98618d64eb9870c3fac4f605523dd0312ca5c80dadbd5b9/fastapi-0.135.2.tar.gz", hash = "sha256:88a832095359755527b7f63bb4c6bc9edb8329a026189eed83d6c1afcf419d56", size = 395833, upload-time = "2026-03-23T14:12:41.697Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/8f/ea/18f6d0457f9efb2fc6fa594857f92810cadb03024975726db6546b3d6fcf/fastapi-0.135.2-py3-none-any.whl", hash = "sha256:0af0447d541867e8db2a6a25c23a8c4bd80e2394ac5529bd87501bbb9e240ca5", size = 117407, upload-time = "2026-03-23T14:12:43.284Z" },
 ]
 [[package]]

 [[package]]
 name = "fastapi"
+version = "0.135.3"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
     { name = "annotated-doc" },
     { name = "typing-extensions" },
     { name = "typing-inspection" },
 ]
+sdist = { url = "https://files.pythonhosted.org/packages/f7/e6/7adb4c5fa231e82c35b8f5741a9f2d055f520c29af5546fd70d3e8e1cd2e/fastapi-0.135.3.tar.gz", hash = "sha256:bd6d7caf1a2bdd8d676843cdcd2287729572a1ef524fc4d65c17ae002a1be654", size = 396524, upload-time = "2026-04-01T16:23:58.188Z" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/84/a4/5caa2de7f917a04ada20018eccf60d6cc6145b0199d55ca3711b0fc08312/fastapi-0.135.3-py3-none-any.whl", hash = "sha256:9b0f590c813acd13d0ab43dd8494138eb58e484bfac405db1f3187cfc5810d98", size = 117734, upload-time = "2026-04-01T16:23:59.328Z" },
 ]
 [[package]]