Spaces:
Sleeping
Sleeping
File size: 6,675 Bytes
d5fc8a7 37204eb d5fc8a7 fb49640 d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 2b5d42a d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 37204eb d5fc8a7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ---
title: Cloud Incident Response OpenEnv
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- sre
- cloud
- incident-response
- devops
- real-world
- agentic
---
# Cloud Incident Response β OpenEnv Environment
An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** β the real-world on-call workflow that engineers at every cloud company perform daily.
Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures β connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.
## Why This Environment
Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
1. **Triage** β Read alert, assess blast radius, classify severity (P1βP4)
2. **Investigate** β Query logs, metrics, dependencies, recent deploys
3. **Diagnose** β Correlate signals across services to find the root cause
4. **Remediate** β Execute the correct runbook steps in the right sequence
5. **Document** β Submit a resolution summary for post-incident review
Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
## Tasks
| Task ID | Difficulty | Max Steps | What the Agent Does |
|---|---|---|---|
| `alert_classification` | Easy | 3 | Classify alert severity (P1βP4) from metrics and symptoms |
| `root_cause_analysis` | Medium | 10 | Trace logs/metrics/deps to find root cause service and failure mode |
| `remediation_planning` | Hard | 15 | Diagnose, remediate, and document full incident resolution |
### Scenarios
| ID | Incident Type | Failure Pattern |
|---|---|---|
| AC-001 | DB connection pool exhaustion | postgres-db β auth-service β api-gateway cascade |
| AC-002 | CDN cache invalidation storm | Misconfigured purge job β 40Γ origin traffic |
| RCA-001 | Postgres OOM kill | Runaway analytics query β kernel OOM β all dependents down |
| RCA-002 | BGP network partition | Route withdrawal β AZ isolation β 61% checkout failures |
| RP-001 | Full OOM remediation | Disable job β restart DB β restore services β document |
| RP-002 | Full BGP remediation | Restore routes β rollback config β verify recovery β document |
## Action Space
**Diagnostic actions** (gather evidence):
```json
{"action_type": "query_logs", "parameters": {"service": "postgres-db"}}
{"action_type": "check_metrics", "parameters": {"service": "auth-service"}}
{"action_type": "check_dependencies", "parameters": {"service": "api-gateway"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
{"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
```
**Remediation actions** (fix the incident):
```json
{"action_type": "restart_service", "parameters": {"service": "postgres-db"}}
{"action_type": "rollback_deploy", "parameters": {"service": "network-infra"}}
{"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
```
**Submission actions** (end the episode):
```json
{"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "postgres-db"}}
{"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM"}}
{"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}
```
## Observation Space
| Field | Type | Description |
|---|---|---|
| `episode_id` | string | Unique episode UUID |
| `task_id` | string | Active task |
| `scenario_id` | string | Scenario (e.g. `AC-001`) |
| `step_count` / `max_steps` | int | Current progress and budget |
| `incident_summary` | string | Plain-text incident description |
| `alert` | dict | Alert payload with severity, symptoms, affected services |
| `available_actions` | list[str] | Valid action types for this task |
| `queried_data` | dict | All tool responses gathered so far |
| `cumulative_reward` | float | Running reward total |
| `done` | bool | Episode terminal flag |
| `feedback` | string | Per-step feedback |
## Reward Function
Dense signals throughout the trajectory:
| Event | Reward |
|---|---|
| Query known service (first time) | +0.05 |
| Query known service (repeat) | +0.01 |
| Query unknown service | -0.05 |
| Correct remediation action | +0.10 |
| Wrong remediation action | -0.10 |
| Step past halfway (non-submit) | -0.02 |
| Timeout without submission | -0.10 |
| Grader score (terminal step) | 0.0β1.0 |
**Grader scoring** (deterministic, via `GET /grader`):
| Task | Scoring |
|---|---|
| `alert_classification` | 1.0 exact Β· 0.5 adjacent Β· 0.25 two-off Β· 0.0 wrong |
| `root_cause_analysis` | 0.6 base + up to 0.4 efficiency bonus |
| `remediation_planning` | 0.6 base + 0.3 efficiency β 0.15 wrong penalty + 0.1 summary |
## API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | `/` | `{"status": "running", ...}` β Space health check |
| GET | `/health` | `{"status": "ok", "version": "0.1.0"}` |
| POST | `/reset?task_id=...&scenario_index=...` | Start new episode |
| POST | `/step` | Submit action (JSON body) |
| GET | `/state` | Full current episode state |
| GET | `/tasks` | All tasks with action schemas |
| GET | `/grader` | Score current episode (0.0β1.0) |
| POST | `/baseline` | Run inference.py, return scores |
## Setup
```bash
# Local
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
# Docker
docker build -t cloud-incident-env .
docker run -p 7860:7860 \
-e API_BASE_URL="https://api-inference.huggingface.co/v1" \
-e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
-e HF_TOKEN="hf_your_token" \
cloud-incident-env
# Run inference script
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
python inference.py
```
## Baseline Scores
Using `meta-llama/Llama-3.1-8B-Instruct` via HF Inference API:
| Task | Score |
|---|---|
| `alert_classification` | ~0.75 |
| `root_cause_analysis` | ~0.35 |
| `remediation_planning` | ~0.20 |
| **overall** | **~0.43** |
*Run `python inference.py` to reproduce.*
|