cloud-incident / README.md
Elliot89's picture
fix: async lifespan init + JSON root for HF health check
2b5d42a
metadata
title: Cloud Incident Response OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - sre
  - cloud
  - incident-response
  - devops
  - real-world
  - agentic

Cloud Incident Response β€” OpenEnv Environment

An OpenEnv environment for training and evaluating AI agents on cloud SRE incident response β€” the real-world on-call workflow that engineers at every cloud company perform daily.

Distinct from Kubernetes operations environments: this focuses on cross-service cascading failures in distributed microservice architectures β€” connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.

Why This Environment

Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:

  1. Triage β€” Read alert, assess blast radius, classify severity (P1–P4)
  2. Investigate β€” Query logs, metrics, dependencies, recent deploys
  3. Diagnose β€” Correlate signals across services to find the root cause
  4. Remediate β€” Execute the correct runbook steps in the right sequence
  5. Document β€” Submit a resolution summary for post-incident review

Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.

Tasks

Task ID Difficulty Max Steps What the Agent Does
alert_classification Easy 3 Classify alert severity (P1–P4) from metrics and symptoms
root_cause_analysis Medium 10 Trace logs/metrics/deps to find root cause service and failure mode
remediation_planning Hard 15 Diagnose, remediate, and document full incident resolution

Scenarios

ID Incident Type Failure Pattern
AC-001 DB connection pool exhaustion postgres-db β†’ auth-service β†’ api-gateway cascade
AC-002 CDN cache invalidation storm Misconfigured purge job β†’ 40Γ— origin traffic
RCA-001 Postgres OOM kill Runaway analytics query β†’ kernel OOM β†’ all dependents down
RCA-002 BGP network partition Route withdrawal β†’ AZ isolation β†’ 61% checkout failures
RP-001 Full OOM remediation Disable job β†’ restart DB β†’ restore services β†’ document
RP-002 Full BGP remediation Restore routes β†’ rollback config β†’ verify recovery β†’ document

Action Space

Diagnostic actions (gather evidence):

{"action_type": "query_logs",           "parameters": {"service": "postgres-db"}}
{"action_type": "check_metrics",        "parameters": {"service": "auth-service"}}
{"action_type": "check_dependencies",   "parameters": {"service": "api-gateway"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}

Remediation actions (fix the incident):

{"action_type": "restart_service",      "parameters": {"service": "postgres-db"}}
{"action_type": "rollback_deploy",      "parameters": {"service": "network-infra"}}
{"action_type": "scale_service",        "parameters": {"service": "image-service", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}

Submission actions (end the episode):

{"action_type": "submit_severity",   "parameters": {"severity": "P1", "service": "postgres-db"}}
{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM"}}
{"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}

Observation Space

Field Type Description
episode_id string Unique episode UUID
task_id string Active task
scenario_id string Scenario (e.g. AC-001)
step_count / max_steps int Current progress and budget
incident_summary string Plain-text incident description
alert dict Alert payload with severity, symptoms, affected services
available_actions list[str] Valid action types for this task
queried_data dict All tool responses gathered so far
cumulative_reward float Running reward total
done bool Episode terminal flag
feedback string Per-step feedback

Reward Function

Dense signals throughout the trajectory:

Event Reward
Query known service (first time) +0.05
Query known service (repeat) +0.01
Query unknown service -0.05
Correct remediation action +0.10
Wrong remediation action -0.10
Step past halfway (non-submit) -0.02
Timeout without submission -0.10
Grader score (terminal step) 0.0–1.0

Grader scoring (deterministic, via GET /grader):

Task Scoring
alert_classification 1.0 exact Β· 0.5 adjacent Β· 0.25 two-off Β· 0.0 wrong
root_cause_analysis 0.6 base + up to 0.4 efficiency bonus
remediation_planning 0.6 base + 0.3 efficiency βˆ’ 0.15 wrong penalty + 0.1 summary

API Endpoints

Method Path Description
GET / {"status": "running", ...} β€” Space health check
GET /health {"status": "ok", "version": "0.1.0"}
POST /reset?task_id=...&scenario_index=... Start new episode
POST /step Submit action (JSON body)
GET /state Full current episode state
GET /tasks All tasks with action schemas
GET /grader Score current episode (0.0–1.0)
POST /baseline Run inference.py, return scores

Setup

# Local
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Docker
docker build -t cloud-incident-env .
docker run -p 7860:7860 \
  -e API_BASE_URL="https://api-inference.huggingface.co/v1" \
  -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
  -e HF_TOKEN="hf_your_token" \
  cloud-incident-env

# Run inference script
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
python inference.py

Baseline Scores

Using meta-llama/Llama-3.1-8B-Instruct via HF Inference API:

Task Score
alert_classification ~0.75
root_cause_analysis ~0.35
remediation_planning ~0.20
overall ~0.43

Run python inference.py to reproduce.