An OpenEnv environment for training and evaluating AI agents on cloud SRE incident response โ the real-world on-call workflow that engineers at every cloud company perform daily.
Distinct from Kubernetes operations environments: this focuses on cross-service cascading failures in distributed microservice architectures โ connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions.
Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
Correlate signals across services to find root cause
Remediate
Execute correct runbook steps in the right sequence
Document
Submit resolution summary for post-incident review
Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
Easy 1.00 โโโโโโโโโโโโโโโโโโโโ Clear metrics โ straightforward classification
Medium 0.73 โโโโโโโโโโโโโโโ Root cause hidden โ model fails on BGP scenario (S1=0.20)
Hard 0.55 โโโโโโโโโโโ Multi-phase execution with wrong-action penalties
Easy โ 1.00: Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries.
Medium โ 0.73: Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause.
Hard โ 0.55: Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (โ0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations.
๐๏ธ Tasks
Task ID
Difficulty
Max Steps
Objective
Submission Action
alert_classification
๐ข Easy
3
Classify alert severity (P1โP4)
submit_severity
root_cause_analysis
๐ก Medium
10
Find root cause service + failure mode
submit_root_cause
remediation_planning
๐ด Hard
15
Diagnose + remediate + document
submit_resolution
Scenarios (3 per task = 9 total episodes)
ID
Incident Type
Root Cause
Why It's Hard
AC-001
DB connection pool exhaustion
โ
Clear P1: 78% errors, $12k/min revenue loss
AC-002
CDN cache invalidation storm
โ
Ambiguous P2: degraded but checkout works
AC-003
Recommendation service errors
โ
Trap P3: 45% errors but zero revenue impact
RCA-001
Postgres OOM kill
analytics-service
Must correlate "analytics export query" in DB logs
RCA-002
BGP network partition
network-infra
No victim log names network-infra โ hardest scenario
RCA-003
Credential rotation bug
config-service
Must trace "secrets rotation" hint to config-service