Spaces:
Sleeping
title: Cloud Incident Response OpenEnv
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- sre
- cloud
- incident-response
- devops
- real-world
- agentic
Cloud Incident Response β OpenEnv Environment
An OpenEnv environment for training and evaluating AI agents on cloud SRE incident response β the real-world on-call workflow that engineers at every cloud company perform daily.
Distinct from Kubernetes operations environments: this focuses on cross-service cascading failures in distributed microservice architectures β connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.
Why This Environment
Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
- Triage β Read alert, assess blast radius, classify severity (P1βP4)
- Investigate β Query logs, metrics, dependencies, recent deploys
- Diagnose β Correlate signals across services to find the root cause
- Remediate β Execute the correct runbook steps in the right sequence
- Document β Submit a resolution summary for post-incident review
Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
Tasks
| Task ID | Difficulty | Max Steps | What the Agent Does |
|---|---|---|---|
alert_classification |
Easy | 3 | Classify alert severity (P1βP4) from metrics and symptoms |
root_cause_analysis |
Medium | 10 | Trace logs/metrics/deps to find root cause service and failure mode |
remediation_planning |
Hard | 15 | Diagnose, remediate, and document full incident resolution |
Scenarios
| ID | Incident Type | Failure Pattern |
|---|---|---|
| AC-001 | DB connection pool exhaustion | postgres-db β auth-service β api-gateway cascade |
| AC-002 | CDN cache invalidation storm | Misconfigured purge job β 40Γ origin traffic |
| RCA-001 | Postgres OOM kill | Runaway analytics query β kernel OOM β all dependents down |
| RCA-002 | BGP network partition | Route withdrawal β AZ isolation β 61% checkout failures |
| RP-001 | Full OOM remediation | Disable job β restart DB β restore services β document |
| RP-002 | Full BGP remediation | Restore routes β rollback config β verify recovery β document |
Action Space
Diagnostic actions (gather evidence):
{"action_type": "query_logs", "parameters": {"service": "postgres-db"}}
{"action_type": "check_metrics", "parameters": {"service": "auth-service"}}
{"action_type": "check_dependencies", "parameters": {"service": "api-gateway"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
Remediation actions (fix the incident):
{"action_type": "restart_service", "parameters": {"service": "postgres-db"}}
{"action_type": "rollback_deploy", "parameters": {"service": "network-infra"}}
{"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
Submission actions (end the episode):
{"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "postgres-db"}}
{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM"}}
{"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}
Observation Space
| Field | Type | Description |
|---|---|---|
episode_id |
string | Unique episode UUID |
task_id |
string | Active task |
scenario_id |
string | Scenario (e.g. AC-001) |
step_count / max_steps |
int | Current progress and budget |
incident_summary |
string | Plain-text incident description |
alert |
dict | Alert payload with severity, symptoms, affected services |
available_actions |
list[str] | Valid action types for this task |
queried_data |
dict | All tool responses gathered so far |
cumulative_reward |
float | Running reward total |
done |
bool | Episode terminal flag |
feedback |
string | Per-step feedback |
Reward Function
Dense signals throughout the trajectory:
| Event | Reward |
|---|---|
| Query known service (first time) | +0.05 |
| Query known service (repeat) | +0.01 |
| Query unknown service | -0.05 |
| Correct remediation action | +0.10 |
| Wrong remediation action | -0.10 |
| Step past halfway (non-submit) | -0.02 |
| Timeout without submission | -0.10 |
| Grader score (terminal step) | 0.0β1.0 |
Grader scoring (deterministic, via GET /grader):
| Task | Scoring |
|---|---|
alert_classification |
1.0 exact Β· 0.5 adjacent Β· 0.25 two-off Β· 0.0 wrong |
root_cause_analysis |
0.6 base + up to 0.4 efficiency bonus |
remediation_planning |
0.6 base + 0.3 efficiency β 0.15 wrong penalty + 0.1 summary |
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | / |
{"status": "running", ...} β Space health check |
| GET | /health |
{"status": "ok", "version": "0.1.0"} |
| POST | /reset?task_id=...&scenario_index=... |
Start new episode |
| POST | /step |
Submit action (JSON body) |
| GET | /state |
Full current episode state |
| GET | /tasks |
All tasks with action schemas |
| GET | /grader |
Score current episode (0.0β1.0) |
| POST | /baseline |
Run inference.py, return scores |
Setup
# Local
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
# Docker
docker build -t cloud-incident-env .
docker run -p 7860:7860 \
-e API_BASE_URL="https://api-inference.huggingface.co/v1" \
-e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
-e HF_TOKEN="hf_your_token" \
cloud-incident-env
# Run inference script
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
python inference.py
Baseline Scores
Using meta-llama/Llama-3.1-8B-Instruct via HF Inference API:
| Task | Score |
|---|---|
alert_classification |
~0.75 |
root_cause_analysis |
~0.35 |
remediation_planning |
~0.20 |
| overall | ~0.43 |
Run python inference.py to reproduce.