--- title: Cloud Incident Response OpenEnv emoji: ๐Ÿšจ colorFrom: red colorTo: yellow sdk: docker app_port: 7860 pinned: false tags: - openenv - sre - cloud - incident-response - devops - real-world - agentic --- # โ˜๏ธ Cloud Incident Response โ€” OpenEnv Environment An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** โ€” the real-world on-call workflow that engineers at every cloud company perform daily. Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures โ€” connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions. ## Authors - **Einstein** โ€” Environment Design & Grader Implementation (https://huggingface.co/Elliot89) - **Sidra** โ€” Scenario Design & Testing (https://huggingface.co/sidraaiman1809) --- ## ๐ŸŽฏ Why This Environment Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop: | Phase | What the Agent Does | |---|---| | **Triage** | Read alert, assess blast radius, classify severity (P1โ€“P4) | | **Investigate** | Query logs, metrics, dependencies, recent deploys | | **Diagnose** | Correlate signals across services to find root cause | | **Remediate** | Execute correct runbook steps in the right sequence | | **Document** | Submit resolution summary for post-incident review | Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation. --- ## ๐Ÿ“Š Baseline Scores Using `Llama 3.1 8B Instruct` ยท deterministic (`temperature=0.0`) ยท fully reproducible | Task | Difficulty | S0 | S1 | S2 | Average | |---|---|---|---|---|---| | `alert_classification` | ๐ŸŸข Easy | 1.00 | 1.00 | 1.00 | **1.00** | | `root_cause_analysis` | ๐ŸŸก Medium | 1.00 | 0.20 | 1.00 | **0.73** | | `remediation_planning` | ๐Ÿ”ด Hard | 0.60 | 0.45 | 0.59 | **0.55** | | **Overall** | | | | | **0.76** | ### Score Interpretation ``` Easy 1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Clear metrics โ†’ straightforward classification Medium 0.73 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ Root cause hidden โ€” model fails on BGP scenario (S1=0.20) Hard 0.55 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Multi-phase execution with wrong-action penalties ``` - **Easy โ†’ 1.00:** Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries. - **Medium โ†’ 0.73:** Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause. - **Hard โ†’ 0.55:** Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (โˆ’0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations. --- ## ๐Ÿ—๏ธ Tasks | Task ID | Difficulty | Max Steps | Objective | Submission Action | |---|---|---|---|---| | `alert_classification` | ๐ŸŸข Easy | 3 | Classify alert severity (P1โ€“P4) | `submit_severity` | | `root_cause_analysis` | ๐ŸŸก Medium | 10 | Find root cause service + failure mode | `submit_root_cause` | | `remediation_planning` | ๐Ÿ”ด Hard | 15 | Diagnose + remediate + document | `submit_resolution` | ### Scenarios (3 per task = 9 total episodes) | ID | Incident Type | Root Cause | Why It's Hard | |---|---|---|---| | AC-001 | DB connection pool exhaustion | โ€” | Clear P1: 78% errors, $12k/min revenue loss | | AC-002 | CDN cache invalidation storm | โ€” | Ambiguous P2: degraded but checkout works | | AC-003 | Recommendation service errors | โ€” | Trap P3: 45% errors but zero revenue impact | | RCA-001 | Postgres OOM kill | analytics-service | Must correlate "analytics export query" in DB logs | | RCA-002 | BGP network partition | network-infra | No victim log names network-infra โ€” hardest scenario | | RCA-003 | Credential rotation bug | config-service | Must trace "secrets rotation" hint to config-service | | RP-001 | Full OOM remediation | analytics-service | 6-step sequence: disable job โ†’ restart chain | | RP-002 | Full BGP remediation | network-infra | 4-step sequence: restore routes โ†’ rollback โ†’ verify | | RP-003 | Full credential fix | config-service | 7-step sequence: rollback โ†’ rotate โ†’ restart โ†’ verify | --- ## ๐ŸŽฎ Action Space ### Diagnostic Actions (gather evidence) ```json {"action_type": "query_logs", "parameters": {"service": ""}} {"action_type": "check_metrics", "parameters": {"service": ""}} {"action_type": "check_dependencies", "parameters": {"service": ""}} {"action_type": "check_recent_deploys", "parameters": {"service": ""}} {"action_type": "check_service_status", "parameters": {"service": ""}} ``` ### Remediation Actions (fix the incident) ```json {"action_type": "restart_service", "parameters": {"service": ""}} {"action_type": "rollback_deploy", "parameters": {"service": ""}} {"action_type": "scale_service", "parameters": {"service": "", "replicas": 10}} {"action_type": "disable_feature_flag", "parameters": {"flag": ""}} {"action_type": "clear_cache", "parameters": {"service": ""}} {"action_type": "execute_runbook_step", "parameters": {"runbook_action": ""}} ``` ### Submission Actions (end the episode) ```json {"action_type": "submit_severity", "parameters": {"severity": "P1|P2|P3|P4", "service": ""}} {"action_type": "submit_root_cause", "parameters": {"service": "", "failure_mode": ""}} {"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}} ``` --- ## ๐Ÿ‘๏ธ Observation Space | Field | Type | Description | |---|---|---| | `episode_id` | string | Unique episode UUID | | `task_id` | string | Active task identifier | | `scenario_id` | string | Current scenario (e.g., `RCA-001`) | | `step_count` / `max_steps` | int | Progress through episode | | `incident_summary` | string | Plain-text incident description (no root cause hints) | | `alert` | dict | Alert payload with severity, symptoms, affected services | | `available_actions` | list | Valid action types for this task | | `queried_data` | dict | All evidence gathered so far | | `known_services` | list | Exact service names valid for actions | | `cumulative_reward` | float | Running reward total | | `done` | bool | Episode terminal flag | | `feedback` | string | Per-step feedback explaining reward | | `last_action_error` | string? | Error message if last action was invalid | --- ## ๐Ÿ’ฐ Reward Function Dense reward shaping throughout the trajectory โ€” not just terminal scoring. ### Per-Step Rewards | Event | Easy | Medium | Hard | |---|---|---|---| | Query new service (first time) | +0.04 | +0.04 | +0.03 | | Query new action on known service | +0.02 | +0.02 | +0.01 | | Repeat exact same query | โˆ’0.03 | โˆ’0.04 | โˆ’0.03 | | Query unknown service | โˆ’0.06 | โˆ’0.06 | โˆ’0.05 | | Correct remediation action | โ€” | +0.06 | +0.06 | | Wrong remediation action | โˆ’0.08 | โˆ’0.10 | โˆ’0.15 | | Step past halfway (non-submit) | โˆ’0.04 | โˆ’0.02 | โˆ’0.02 | | Timeout without submission | โˆ’0.15 | โˆ’0.15 | โˆ’0.20 | ### Grader Scoring (terminal, deterministic) | Task | Scoring Logic | |---|---| | `alert_classification` | 1.0 exact ยท 0.5 adjacent ยท 0.25 two-off ยท 0.0 wrong | | `root_cause_analysis` | Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05โ€“0.20 based on investigation effort | | `remediation_planning` | Scaled base (0.10โ€“0.50 by investigation depth) + 0.30 efficiency โˆ’ up to 0.30 wrong-action penalty + 0.10 summary quality | --- ## ๐Ÿ”Œ API Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/` | Gradio UI โ€” interactive environment demo | | `GET` | `/health` | `{"status":"ok","version":"0.1.0"}` | | `POST` | `/reset` | Start new episode (accepts `task_id`, `scenario_index`) | | `POST` | `/step` | Submit action โ†’ returns observation, reward, done, info | | `GET` | `/state` | Full current episode state with action history | | `GET` | `/tasks` | All tasks with action schemas | | `GET` | `/grader` | Score current episode (0.0โ€“1.0) with breakdown | --- ## ๐Ÿš€ Setup & Usage ### Local Development ```bash pip install -r requirements.txt uvicorn server.app:app --host 0.0.0.0 --port 7860 ``` ### Docker ```bash docker build -t cloud-incident-env . docker run -p 7860:7860 cloud-incident-env ``` ### Run Baseline Inference ```bash export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" export HF_TOKEN="your_token" python inference.py ``` ### Quick API Test ```bash # Reset curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0" # Step curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}' # Grade curl http://localhost:7860/grader ``` --- ## ๐Ÿ“ Project Structure ``` . โ”œโ”€โ”€ Dockerfile # Container build โ”œโ”€โ”€ README.md # This file โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ openenv.yaml # OpenEnv metadata + task definitions โ”œโ”€โ”€ inference.py # Baseline agent (OpenAI client + smart fallback) โ”œโ”€โ”€ tasks.py # 9 scenarios across 3 difficulty levels โ”œโ”€โ”€ graders.py # Deterministic graders (0.0โ€“1.0) โ””โ”€โ”€ server/ โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ app.py # FastAPI + Gradio endpoints โ”œโ”€โ”€ environment.py # Core step()/reset()/state() logic โ””โ”€โ”€ models.py # Typed Pydantic models (Action, Observation, Reward) ``` --- ## โœ… Validation ```bash # OpenEnv spec validation openenv validate # โ†’ [OK] Ready for multi-mode deployment # Docker build docker build -t cloud-incident-env . # โ†’ builds successfully # Health check curl http://localhost:7860/health # โ†’ {"status":"ok","version":"0.1.0"} ``` ## Team - **Einstein** โ€” [@MrEinsteinE](https://github.com/MrEinsteinE) - **Sidra** โ€” [@sidraaiman](https://github.com/sidraaiman)