Spaces:
Sleeping
Sleeping
| title: Cloud Incident Response OpenEnv | |
| emoji: ๐จ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| tags: | |
| - openenv | |
| - sre | |
| - cloud | |
| - incident-response | |
| - devops | |
| - real-world | |
| - agentic | |
| # โ๏ธ Cloud Incident Response โ OpenEnv Environment | |
| An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** โ the real-world on-call workflow that engineers at every cloud company perform daily. | |
| Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures โ connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions. | |
| ## Authors | |
| - **Einstein** โ Environment Design & Grader Implementation (https://huggingface.co/Elliot89) | |
| - **Sidra** โ Scenario Design & Testing (https://huggingface.co/sidraaiman1809) | |
| --- | |
| ## ๐ฏ Why This Environment | |
| Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop: | |
| | Phase | What the Agent Does | | |
| |---|---| | |
| | **Triage** | Read alert, assess blast radius, classify severity (P1โP4) | | |
| | **Investigate** | Query logs, metrics, dependencies, recent deploys | | |
| | **Diagnose** | Correlate signals across services to find root cause | | |
| | **Remediate** | Execute correct runbook steps in the right sequence | | |
| | **Document** | Submit resolution summary for post-incident review | | |
| Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation. | |
| --- | |
| ## ๐ Baseline Scores | |
| Using `Llama 3.1 8B Instruct` ยท deterministic (`temperature=0.0`) ยท fully reproducible | |
| | Task | Difficulty | S0 | S1 | S2 | Average | | |
| |---|---|---|---|---|---| | |
| | `alert_classification` | ๐ข Easy | 1.00 | 1.00 | 1.00 | **1.00** | | |
| | `root_cause_analysis` | ๐ก Medium | 1.00 | 0.20 | 1.00 | **0.73** | | |
| | `remediation_planning` | ๐ด Hard | 0.60 | 0.45 | 0.59 | **0.55** | | |
| | **Overall** | | | | | **0.76** | | |
| ### Score Interpretation | |
| ``` | |
| Easy 1.00 โโโโโโโโโโโโโโโโโโโโ Clear metrics โ straightforward classification | |
| Medium 0.73 โโโโโโโโโโโโโโโ Root cause hidden โ model fails on BGP scenario (S1=0.20) | |
| Hard 0.55 โโโโโโโโโโโ Multi-phase execution with wrong-action penalties | |
| ``` | |
| - **Easy โ 1.00:** Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries. | |
| - **Medium โ 0.73:** Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause. | |
| - **Hard โ 0.55:** Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (โ0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations. | |
| --- | |
| ## ๐๏ธ Tasks | |
| | Task ID | Difficulty | Max Steps | Objective | Submission Action | | |
| |---|---|---|---|---| | |
| | `alert_classification` | ๐ข Easy | 3 | Classify alert severity (P1โP4) | `submit_severity` | | |
| | `root_cause_analysis` | ๐ก Medium | 10 | Find root cause service + failure mode | `submit_root_cause` | | |
| | `remediation_planning` | ๐ด Hard | 15 | Diagnose + remediate + document | `submit_resolution` | | |
| ### Scenarios (3 per task = 9 total episodes) | |
| | ID | Incident Type | Root Cause | Why It's Hard | | |
| |---|---|---|---| | |
| | AC-001 | DB connection pool exhaustion | โ | Clear P1: 78% errors, $12k/min revenue loss | | |
| | AC-002 | CDN cache invalidation storm | โ | Ambiguous P2: degraded but checkout works | | |
| | AC-003 | Recommendation service errors | โ | Trap P3: 45% errors but zero revenue impact | | |
| | RCA-001 | Postgres OOM kill | analytics-service | Must correlate "analytics export query" in DB logs | | |
| | RCA-002 | BGP network partition | network-infra | No victim log names network-infra โ hardest scenario | | |
| | RCA-003 | Credential rotation bug | config-service | Must trace "secrets rotation" hint to config-service | | |
| | RP-001 | Full OOM remediation | analytics-service | 6-step sequence: disable job โ restart chain | | |
| | RP-002 | Full BGP remediation | network-infra | 4-step sequence: restore routes โ rollback โ verify | | |
| | RP-003 | Full credential fix | config-service | 7-step sequence: rollback โ rotate โ restart โ verify | | |
| --- | |
| ## ๐ฎ Action Space | |
| ### Diagnostic Actions (gather evidence) | |
| ```json | |
| {"action_type": "query_logs", "parameters": {"service": "<name>"}} | |
| {"action_type": "check_metrics", "parameters": {"service": "<name>"}} | |
| {"action_type": "check_dependencies", "parameters": {"service": "<name>"}} | |
| {"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}} | |
| {"action_type": "check_service_status", "parameters": {"service": "<name>"}} | |
| ``` | |
| ### Remediation Actions (fix the incident) | |
| ```json | |
| {"action_type": "restart_service", "parameters": {"service": "<name>"}} | |
| {"action_type": "rollback_deploy", "parameters": {"service": "<name>"}} | |
| {"action_type": "scale_service", "parameters": {"service": "<name>", "replicas": 10}} | |
| {"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}} | |
| {"action_type": "clear_cache", "parameters": {"service": "<name>"}} | |
| {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>"}} | |
| ``` | |
| ### Submission Actions (end the episode) | |
| ```json | |
| {"action_type": "submit_severity", "parameters": {"severity": "P1|P2|P3|P4", "service": "<name>"}} | |
| {"action_type": "submit_root_cause", "parameters": {"service": "<name>", "failure_mode": "<description>"}} | |
| {"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}} | |
| ``` | |
| --- | |
| ## ๐๏ธ Observation Space | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `episode_id` | string | Unique episode UUID | | |
| | `task_id` | string | Active task identifier | | |
| | `scenario_id` | string | Current scenario (e.g., `RCA-001`) | | |
| | `step_count` / `max_steps` | int | Progress through episode | | |
| | `incident_summary` | string | Plain-text incident description (no root cause hints) | | |
| | `alert` | dict | Alert payload with severity, symptoms, affected services | | |
| | `available_actions` | list | Valid action types for this task | | |
| | `queried_data` | dict | All evidence gathered so far | | |
| | `known_services` | list | Exact service names valid for actions | | |
| | `cumulative_reward` | float | Running reward total | | |
| | `done` | bool | Episode terminal flag | | |
| | `feedback` | string | Per-step feedback explaining reward | | |
| | `last_action_error` | string? | Error message if last action was invalid | | |
| --- | |
| ## ๐ฐ Reward Function | |
| Dense reward shaping throughout the trajectory โ not just terminal scoring. | |
| ### Per-Step Rewards | |
| | Event | Easy | Medium | Hard | | |
| |---|---|---|---| | |
| | Query new service (first time) | +0.04 | +0.04 | +0.03 | | |
| | Query new action on known service | +0.02 | +0.02 | +0.01 | | |
| | Repeat exact same query | โ0.03 | โ0.04 | โ0.03 | | |
| | Query unknown service | โ0.06 | โ0.06 | โ0.05 | | |
| | Correct remediation action | โ | +0.06 | +0.06 | | |
| | Wrong remediation action | โ0.08 | โ0.10 | โ0.15 | | |
| | Step past halfway (non-submit) | โ0.04 | โ0.02 | โ0.02 | | |
| | Timeout without submission | โ0.15 | โ0.15 | โ0.20 | | |
| ### Grader Scoring (terminal, deterministic) | |
| | Task | Scoring Logic | | |
| |---|---| | |
| | `alert_classification` | 1.0 exact ยท 0.5 adjacent ยท 0.25 two-off ยท 0.0 wrong | | |
| | `root_cause_analysis` | Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05โ0.20 based on investigation effort | | |
| | `remediation_planning` | Scaled base (0.10โ0.50 by investigation depth) + 0.30 efficiency โ up to 0.30 wrong-action penalty + 0.10 summary quality | | |
| --- | |
| ## ๐ API Endpoints | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/` | Gradio UI โ interactive environment demo | | |
| | `GET` | `/health` | `{"status":"ok","version":"0.1.0"}` | | |
| | `POST` | `/reset` | Start new episode (accepts `task_id`, `scenario_index`) | | |
| | `POST` | `/step` | Submit action โ returns observation, reward, done, info | | |
| | `GET` | `/state` | Full current episode state with action history | | |
| | `GET` | `/tasks` | All tasks with action schemas | | |
| | `GET` | `/grader` | Score current episode (0.0โ1.0) with breakdown | | |
| --- | |
| ## ๐ Setup & Usage | |
| ### Local Development | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t cloud-incident-env . | |
| docker run -p 7860:7860 cloud-incident-env | |
| ``` | |
| ### Run Baseline Inference | |
| ```bash | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" | |
| export HF_TOKEN="your_token" | |
| python inference.py | |
| ``` | |
| ### Quick API Test | |
| ```bash | |
| # Reset | |
| curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0" | |
| # Step | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}' | |
| # Grade | |
| curl http://localhost:7860/grader | |
| ``` | |
| --- | |
| ## ๐ Project Structure | |
| ``` | |
| . | |
| โโโ Dockerfile # Container build | |
| โโโ README.md # This file | |
| โโโ requirements.txt # Python dependencies | |
| โโโ openenv.yaml # OpenEnv metadata + task definitions | |
| โโโ inference.py # Baseline agent (OpenAI client + smart fallback) | |
| โโโ tasks.py # 9 scenarios across 3 difficulty levels | |
| โโโ graders.py # Deterministic graders (0.0โ1.0) | |
| โโโ server/ | |
| โโโ __init__.py | |
| โโโ app.py # FastAPI + Gradio endpoints | |
| โโโ environment.py # Core step()/reset()/state() logic | |
| โโโ models.py # Typed Pydantic models (Action, Observation, Reward) | |
| ``` | |
| --- | |
| ## โ Validation | |
| ```bash | |
| # OpenEnv spec validation | |
| openenv validate # โ [OK] Ready for multi-mode deployment | |
| # Docker build | |
| docker build -t cloud-incident-env . # โ builds successfully | |
| # Health check | |
| curl http://localhost:7860/health # โ {"status":"ok","version":"0.1.0"} | |
| ``` | |
| ## Team | |
| - **Einstein** โ [@MrEinsteinE](https://github.com/MrEinsteinE) | |
| - **Sidra** โ [@sidraaiman](https://github.com/sidraaiman) |