Elliot89's picture
Update README.md
7523833 verified
metadata
title: Cloud Incident Response OpenEnv
emoji: ๐Ÿšจ
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - sre
  - cloud
  - incident-response
  - devops
  - real-world
  - agentic

โ˜๏ธ Cloud Incident Response โ€” OpenEnv Environment

An OpenEnv environment for training and evaluating AI agents on cloud SRE incident response โ€” the real-world on-call workflow that engineers at every cloud company perform daily.

Distinct from Kubernetes operations environments: this focuses on cross-service cascading failures in distributed microservice architectures โ€” connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions.

Authors


๐ŸŽฏ Why This Environment

Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:

Phase What the Agent Does
Triage Read alert, assess blast radius, classify severity (P1โ€“P4)
Investigate Query logs, metrics, dependencies, recent deploys
Diagnose Correlate signals across services to find root cause
Remediate Execute correct runbook steps in the right sequence
Document Submit resolution summary for post-incident review

Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.


๐Ÿ“Š Baseline Scores

Using Llama 3.1 8B Instruct ยท deterministic (temperature=0.0) ยท fully reproducible

Task Difficulty S0 S1 S2 Average
alert_classification ๐ŸŸข Easy 1.00 1.00 1.00 1.00
root_cause_analysis ๐ŸŸก Medium 1.00 0.20 1.00 0.73
remediation_planning ๐Ÿ”ด Hard 0.60 0.45 0.59 0.55
Overall 0.76

Score Interpretation

Easy   1.00 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  Clear metrics โ†’ straightforward classification
Medium 0.73 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ       Root cause hidden โ€” model fails on BGP scenario (S1=0.20)
Hard   0.55 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ           Multi-phase execution with wrong-action penalties
  • Easy โ†’ 1.00: Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries.
  • Medium โ†’ 0.73: Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause.
  • Hard โ†’ 0.55: Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (โˆ’0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations.

๐Ÿ—๏ธ Tasks

Task ID Difficulty Max Steps Objective Submission Action
alert_classification ๐ŸŸข Easy 3 Classify alert severity (P1โ€“P4) submit_severity
root_cause_analysis ๐ŸŸก Medium 10 Find root cause service + failure mode submit_root_cause
remediation_planning ๐Ÿ”ด Hard 15 Diagnose + remediate + document submit_resolution

Scenarios (3 per task = 9 total episodes)

ID Incident Type Root Cause Why It's Hard
AC-001 DB connection pool exhaustion โ€” Clear P1: 78% errors, $12k/min revenue loss
AC-002 CDN cache invalidation storm โ€” Ambiguous P2: degraded but checkout works
AC-003 Recommendation service errors โ€” Trap P3: 45% errors but zero revenue impact
RCA-001 Postgres OOM kill analytics-service Must correlate "analytics export query" in DB logs
RCA-002 BGP network partition network-infra No victim log names network-infra โ€” hardest scenario
RCA-003 Credential rotation bug config-service Must trace "secrets rotation" hint to config-service
RP-001 Full OOM remediation analytics-service 6-step sequence: disable job โ†’ restart chain
RP-002 Full BGP remediation network-infra 4-step sequence: restore routes โ†’ rollback โ†’ verify
RP-003 Full credential fix config-service 7-step sequence: rollback โ†’ rotate โ†’ restart โ†’ verify

๐ŸŽฎ Action Space

Diagnostic Actions (gather evidence)

{"action_type": "query_logs",           "parameters": {"service": "<name>"}}
{"action_type": "check_metrics",        "parameters": {"service": "<name>"}}
{"action_type": "check_dependencies",   "parameters": {"service": "<name>"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}}
{"action_type": "check_service_status", "parameters": {"service": "<name>"}}

Remediation Actions (fix the incident)

{"action_type": "restart_service",      "parameters": {"service": "<name>"}}
{"action_type": "rollback_deploy",      "parameters": {"service": "<name>"}}
{"action_type": "scale_service",        "parameters": {"service": "<name>", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}}
{"action_type": "clear_cache",          "parameters": {"service": "<name>"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>"}}

Submission Actions (end the episode)

{"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "<name>"}}
{"action_type": "submit_root_cause", "parameters": {"service": "<name>", "failure_mode": "<description>"}}
{"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}}

๐Ÿ‘๏ธ Observation Space

Field Type Description
episode_id string Unique episode UUID
task_id string Active task identifier
scenario_id string Current scenario (e.g., RCA-001)
step_count / max_steps int Progress through episode
incident_summary string Plain-text incident description (no root cause hints)
alert dict Alert payload with severity, symptoms, affected services
available_actions list Valid action types for this task
queried_data dict All evidence gathered so far
known_services list Exact service names valid for actions
cumulative_reward float Running reward total
done bool Episode terminal flag
feedback string Per-step feedback explaining reward
last_action_error string? Error message if last action was invalid

๐Ÿ’ฐ Reward Function

Dense reward shaping throughout the trajectory โ€” not just terminal scoring.

Per-Step Rewards

Event Easy Medium Hard
Query new service (first time) +0.04 +0.04 +0.03
Query new action on known service +0.02 +0.02 +0.01
Repeat exact same query โˆ’0.03 โˆ’0.04 โˆ’0.03
Query unknown service โˆ’0.06 โˆ’0.06 โˆ’0.05
Correct remediation action โ€” +0.06 +0.06
Wrong remediation action โˆ’0.08 โˆ’0.10 โˆ’0.15
Step past halfway (non-submit) โˆ’0.04 โˆ’0.02 โˆ’0.02
Timeout without submission โˆ’0.15 โˆ’0.15 โˆ’0.20

Grader Scoring (terminal, deterministic)

Task Scoring Logic
alert_classification 1.0 exact ยท 0.5 adjacent ยท 0.25 two-off ยท 0.0 wrong
root_cause_analysis Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05โ€“0.20 based on investigation effort
remediation_planning Scaled base (0.10โ€“0.50 by investigation depth) + 0.30 efficiency โˆ’ up to 0.30 wrong-action penalty + 0.10 summary quality

๐Ÿ”Œ API Endpoints

Method Path Description
GET / Gradio UI โ€” interactive environment demo
GET /health {"status":"ok","version":"0.1.0"}
POST /reset Start new episode (accepts task_id, scenario_index)
POST /step Submit action โ†’ returns observation, reward, done, info
GET /state Full current episode state with action history
GET /tasks All tasks with action schemas
GET /grader Score current episode (0.0โ€“1.0) with breakdown

๐Ÿš€ Setup & Usage

Local Development

pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t cloud-incident-env .
docker run -p 7860:7860 cloud-incident-env

Run Baseline Inference

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token"
python inference.py

Quick API Test

# Reset
curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"

# Step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'

# Grade
curl http://localhost:7860/grader

๐Ÿ“ Project Structure

.
โ”œโ”€โ”€ Dockerfile              # Container build
โ”œโ”€โ”€ README.md               # This file
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ”œโ”€โ”€ openenv.yaml            # OpenEnv metadata + task definitions
โ”œโ”€โ”€ inference.py            # Baseline agent (OpenAI client + smart fallback)
โ”œโ”€โ”€ tasks.py                # 9 scenarios across 3 difficulty levels
โ”œโ”€โ”€ graders.py              # Deterministic graders (0.0โ€“1.0)
โ””โ”€โ”€ server/
    โ”œโ”€โ”€ __init__.py
    โ”œโ”€โ”€ app.py              # FastAPI + Gradio endpoints
    โ”œโ”€โ”€ environment.py      # Core step()/reset()/state() logic
    โ””โ”€โ”€ models.py           # Typed Pydantic models (Action, Observation, Reward)

โœ… Validation

# OpenEnv spec validation
openenv validate    # โ†’ [OK] Ready for multi-mode deployment

# Docker build
docker build -t cloud-incident-env .    # โ†’ builds successfully

# Health check
curl http://localhost:7860/health       # โ†’ {"status":"ok","version":"0.1.0"}

Team