name: cloud-incident-response version: "0.1.0" app_port: 7860 description: > OpenEnv environment simulating real-world cloud SRE on-call incident response. Distinct from Kubernetes ops — focuses on cross-service cascading failures, network partitions, OOM kills, credential rotation failures, and CDN storms across distributed systems. An AI agent classifies alert severity, performs root cause analysis through log/metric/dependency queries, and executes remediation sequences to resolve production incidents end-to-end. authors: - name: "Einstein" github: "MrEinsteinE" - name: "Sidra" github: "sidraaiman" license: MIT tags: - openenv - sre - cloud - incident-response - devops - real-world - agentic tasks: - id: alert_classification name: "Task 1: Alert Severity Classification" difficulty: easy max_steps: 3 score_range: [0.0, 1.0] description: > Classify incoming alert severity (P1-P4) by querying logs and metrics across affected cloud services. Target baseline: 0.75-1.0 with 8B model. - id: root_cause_analysis name: "Task 2: Root Cause Analysis" difficulty: medium max_steps: 10 score_range: [0.0, 1.0] description: > Trace a live incident through logs, metrics, dependencies, and recent deploys to identify the exact root cause service and failure mode. Root cause is NOT in the alert. Target baseline: 0.35-0.60 with 8B model. - id: remediation_planning name: "Task 3: Incident Remediation" difficulty: hard max_steps: 15 score_range: [0.0, 1.0] description: > Fully resolve a production incident end-to-end: diagnose the root cause, execute the correct multi-step remediation sequence, and submit a documented resolution summary. Wrong actions penalized. Target baseline: 0.20-0.45 with 8B model. endpoints: health: "GET /health" reset: "POST /reset" step: "POST /step" state: "GET /state" tasks: "GET /tasks" grader: "GET /grader" baseline: "POST /baseline" repo: "https://github.com/MrEinsteinE/cloud-incident-response-openenv" space: "https://huggingface.co/spaces/Elliot89/cloud-incident-response"