Spaces:

Elliot89
/

cloud-incident-response

Sleeping

App Files Files Community

cloud-incident-response / README.md

Elliot89

Update README.md

7523833 verified about 1 month ago

preview code

raw

history blame contribute delete

10.6 kB

metadata

title: Cloud Incident Response OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - sre
  - cloud
  - incident-response
  - devops
  - real-world
  - agentic

☁️ Cloud Incident Response — OpenEnv Environment

An OpenEnv environment for training and evaluating AI agents on cloud SRE incident response — the real-world on-call workflow that engineers at every cloud company perform daily.

Distinct from Kubernetes operations environments: this focuses on cross-service cascading failures in distributed microservice architectures — connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions.

Authors

Einstein — Environment Design & Grader Implementation (https://huggingface.co/Elliot89)
Sidra — Scenario Design & Testing (https://huggingface.co/sidraaiman1809)

🎯 Why This Environment

Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:

Phase	What the Agent Does
Triage	Read alert, assess blast radius, classify severity (P1–P4)
Investigate	Query logs, metrics, dependencies, recent deploys
Diagnose	Correlate signals across services to find root cause
Remediate	Execute correct runbook steps in the right sequence
Document	Submit resolution summary for post-incident review

Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.

📊 Baseline Scores

Using Llama 3.1 8B Instruct · deterministic (temperature=0.0) · fully reproducible

Task	Difficulty	S0	S1	S2	Average
`alert_classification`	🟢 Easy	1.00	1.00	1.00	1.00
`root_cause_analysis`	🟡 Medium	1.00	0.20	1.00	0.73
`remediation_planning`	🔴 Hard	0.60	0.45	0.59	0.55
Overall					0.76

Score Interpretation

Easy   1.00 ████████████████████  Clear metrics → straightforward classification
Medium 0.73 ██████████████▌       Root cause hidden — model fails on BGP scenario (S1=0.20)
Hard   0.55 ███████████           Multi-phase execution with wrong-action penalties

Easy → 1.00: Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries.
Medium → 0.73: Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause.
Hard → 0.55: Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (−0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations.

🏗️ Tasks

Task ID	Difficulty	Max Steps	Objective	Submission Action
`alert_classification`	🟢 Easy	3	Classify alert severity (P1–P4)	`submit_severity`
`root_cause_analysis`	🟡 Medium	10	Find root cause service + failure mode	`submit_root_cause`
`remediation_planning`	🔴 Hard	15	Diagnose + remediate + document	`submit_resolution`

Scenarios (3 per task = 9 total episodes)

ID	Incident Type	Root Cause	Why It's Hard
AC-001	DB connection pool exhaustion	—	Clear P1: 78% errors, $12k/min revenue loss
AC-002	CDN cache invalidation storm	—	Ambiguous P2: degraded but checkout works
AC-003	Recommendation service errors	—	Trap P3: 45% errors but zero revenue impact
RCA-001	Postgres OOM kill	analytics-service	Must correlate "analytics export query" in DB logs
RCA-002	BGP network partition	network-infra	No victim log names network-infra — hardest scenario
RCA-003	Credential rotation bug	config-service	Must trace "secrets rotation" hint to config-service
RP-001	Full OOM remediation	analytics-service	6-step sequence: disable job → restart chain
RP-002	Full BGP remediation	network-infra	4-step sequence: restore routes → rollback → verify
RP-003	Full credential fix	config-service	7-step sequence: rollback → rotate → restart → verify

🎮 Action Space

Diagnostic Actions (gather evidence)

{"action_type": "query_logs",           "parameters": {"service": "<name>"}}
{"action_type": "check_metrics",        "parameters": {"service": "<name>"}}
{"action_type": "check_dependencies",   "parameters": {"service": "<name>"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}}
{"action_type": "check_service_status", "parameters": {"service": "<name>"}}

Remediation Actions (fix the incident)

{"action_type": "restart_service",      "parameters": {"service": "<name>"}}
{"action_type": "rollback_deploy",      "parameters": {"service": "<name>"}}
{"action_type": "scale_service",        "parameters": {"service": "<name>", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}}
{"action_type": "clear_cache",          "parameters": {"service": "<name>"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>"}}

Submission Actions (end the episode)

{"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "<name>"}}
{"action_type": "submit_root_cause", "parameters": {"service": "<name>", "failure_mode": "<description>"}}
{"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}}

👁️ Observation Space

Field	Type	Description
`episode_id`	string	Unique episode UUID
`task_id`	string	Active task identifier
`scenario_id`	string	Current scenario (e.g., `RCA-001`)
`step_count` / `max_steps`	int	Progress through episode
`incident_summary`	string	Plain-text incident description (no root cause hints)
`alert`	dict	Alert payload with severity, symptoms, affected services
`available_actions`	list	Valid action types for this task
`queried_data`	dict	All evidence gathered so far
`known_services`	list	Exact service names valid for actions
`cumulative_reward`	float	Running reward total
`done`	bool	Episode terminal flag
`feedback`	string	Per-step feedback explaining reward
`last_action_error`	string?	Error message if last action was invalid

💰 Reward Function

Dense reward shaping throughout the trajectory — not just terminal scoring.

Per-Step Rewards

Event	Easy	Medium	Hard
Query new service (first time)	+0.04	+0.04	+0.03
Query new action on known service	+0.02	+0.02	+0.01
Repeat exact same query	−0.03	−0.04	−0.03
Query unknown service	−0.06	−0.06	−0.05
Correct remediation action	—	+0.06	+0.06
Wrong remediation action	−0.08	−0.10	−0.15
Step past halfway (non-submit)	−0.04	−0.02	−0.02
Timeout without submission	−0.15	−0.15	−0.20

Grader Scoring (terminal, deterministic)

Task	Scoring Logic
`alert_classification`	1.0 exact · 0.5 adjacent · 0.25 two-off · 0.0 wrong
`root_cause_analysis`	Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05–0.20 based on investigation effort
`remediation_planning`	Scaled base (0.10–0.50 by investigation depth) + 0.30 efficiency − up to 0.30 wrong-action penalty + 0.10 summary quality

🔌 API Endpoints

Method	Path	Description
`GET`	`/`	Gradio UI — interactive environment demo
`GET`	`/health`	`{"status":"ok","version":"0.1.0"}`
`POST`	`/reset`	Start new episode (accepts `task_id`, `scenario_index`)
`POST`	`/step`	Submit action → returns observation, reward, done, info
`GET`	`/state`	Full current episode state with action history
`GET`	`/tasks`	All tasks with action schemas
`GET`	`/grader`	Score current episode (0.0–1.0) with breakdown

🚀 Setup & Usage

Local Development

pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t cloud-incident-env .
docker run -p 7860:7860 cloud-incident-env

Run Baseline Inference

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token"
python inference.py

Quick API Test

# Reset
curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"

# Step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'

# Grade
curl http://localhost:7860/grader

📁 Project Structure

.
├── Dockerfile              # Container build
├── README.md               # This file
├── requirements.txt        # Python dependencies
├── openenv.yaml            # OpenEnv metadata + task definitions
├── inference.py            # Baseline agent (OpenAI client + smart fallback)
├── tasks.py                # 9 scenarios across 3 difficulty levels
├── graders.py              # Deterministic graders (0.0–1.0)
└── server/
    ├── __init__.py
    ├── app.py              # FastAPI + Gradio endpoints
    ├── environment.py      # Core step()/reset()/state() logic
    └── models.py           # Typed Pydantic models (Action, Observation, Reward)

✅ Validation

# OpenEnv spec validation
openenv validate    # → [OK] Ready for multi-mode deployment

# Docker build
docker build -t cloud-incident-env .    # → builds successfully

# Health check
curl http://localhost:7860/health       # → {"status":"ok","version":"0.1.0"}

Team

Einstein — @MrEinsteinE
Sidra — @sidraaiman