Spaces:
Sleeping
Sleeping
| name: cloud-incident-response | |
| version: "0.1.0" | |
| app_port: 7860 | |
| description: > | |
| OpenEnv environment simulating real-world cloud SRE on-call incident response. | |
| Distinct from Kubernetes ops — focuses on cross-service cascading failures, | |
| network partitions, OOM kills, and CDN storms across distributed systems. | |
| An AI agent classifies alert severity, performs root cause analysis through | |
| log/metric/dependency queries, and executes remediation sequences to resolve | |
| production incidents end-to-end. | |
| author: Elliot89 | |
| license: MIT | |
| tags: | |
| - openenv | |
| - sre | |
| - cloud | |
| - incident-response | |
| - devops | |
| - real-world | |
| - agentic | |
| tasks: | |
| - id: alert_classification | |
| name: "Task 1: Alert Severity Classification" | |
| difficulty: easy | |
| max_steps: 3 | |
| score_range: [0.0, 1.0] | |
| description: > | |
| Classify incoming alert severity (P1-P4) by querying | |
| logs and metrics across affected cloud services. | |
| - id: root_cause_analysis | |
| name: "Task 2: Root Cause Analysis" | |
| difficulty: medium | |
| max_steps: 10 | |
| score_range: [0.0, 1.0] | |
| description: > | |
| Trace a live incident through logs, metrics, dependencies, | |
| and recent deploys to identify the exact root cause service | |
| and failure mode across a distributed system. | |
| - id: remediation_planning | |
| name: "Task 3: Incident Remediation" | |
| difficulty: hard | |
| max_steps: 15 | |
| score_range: [0.0, 1.0] | |
| description: > | |
| Fully resolve a production incident end-to-end: diagnose | |
| the root cause, execute the correct remediation sequence, | |
| and submit a documented resolution summary. | |
| endpoints: | |
| health: "GET /health" | |
| reset: "POST /reset" | |
| step: "POST /step" | |
| state: "GET /state" | |
| tasks: "GET /tasks" | |
| grader: "GET /grader" | |
| baseline: "POST /baseline" | |