cloud-incident / openenv.yaml
Elliot89's picture
feat: cloud incident response OpenEnv v0.1.0
37204eb
name: cloud-incident-response
version: "0.1.0"
app_port: 7860
description: >
OpenEnv environment simulating real-world cloud SRE on-call incident response.
Distinct from Kubernetes ops — focuses on cross-service cascading failures,
network partitions, OOM kills, and CDN storms across distributed systems.
An AI agent classifies alert severity, performs root cause analysis through
log/metric/dependency queries, and executes remediation sequences to resolve
production incidents end-to-end.
author: Elliot89
license: MIT
tags:
- openenv
- sre
- cloud
- incident-response
- devops
- real-world
- agentic
tasks:
- id: alert_classification
name: "Task 1: Alert Severity Classification"
difficulty: easy
max_steps: 3
score_range: [0.0, 1.0]
description: >
Classify incoming alert severity (P1-P4) by querying
logs and metrics across affected cloud services.
- id: root_cause_analysis
name: "Task 2: Root Cause Analysis"
difficulty: medium
max_steps: 10
score_range: [0.0, 1.0]
description: >
Trace a live incident through logs, metrics, dependencies,
and recent deploys to identify the exact root cause service
and failure mode across a distributed system.
- id: remediation_planning
name: "Task 3: Incident Remediation"
difficulty: hard
max_steps: 15
score_range: [0.0, 1.0]
description: >
Fully resolve a production incident end-to-end: diagnose
the root cause, execute the correct remediation sequence,
and submit a documented resolution summary.
endpoints:
health: "GET /health"
reset: "POST /reset"
step: "POST /step"
state: "GET /state"
tasks: "GET /tasks"
grader: "GET /grader"
baseline: "POST /baseline"