---
title: Cloud Incident Response OpenEnv
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - sre
  - cloud
  - incident-response
  - devops
  - real-world
  - agentic
---

# ☁️ Cloud Incident Response — OpenEnv Environment

An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** — the real-world on-call workflow that engineers at every cloud company perform daily.

Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures — connection pool exhaustion, CDN cache storms, OOM kills, credential rotation failures, and BGP network partitions.

## Authors

- **Einstein** — Environment Design & Grader Implementation (https://huggingface.co/Elliot89)
- **Sidra** — Scenario Design & Testing (https://huggingface.co/sidraaiman1809)

---

## 🎯 Why This Environment

Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:

| Phase | What the Agent Does |
|---|---|
| **Triage** | Read alert, assess blast radius, classify severity (P1–P4) |
| **Investigate** | Query logs, metrics, dependencies, recent deploys |
| **Diagnose** | Correlate signals across services to find root cause |
| **Remediate** | Execute correct runbook steps in the right sequence |
| **Document** | Submit resolution summary for post-incident review |

Agents trained here learn the same skills a human SRE develops: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.

---

## 📊 Baseline Scores

Using `Llama 3.1 8B Instruct` · deterministic (`temperature=0.0`) · fully reproducible

| Task | Difficulty | S0 | S1 | S2 | Average |
|---|---|---|---|---|---|
| `alert_classification` | 🟢 Easy | 1.00 | 1.00 | 1.00 | **1.00** |
| `root_cause_analysis` | 🟡 Medium | 1.00 | 0.20 | 1.00 | **0.73** |
| `remediation_planning` | 🔴 Hard | 0.60 | 0.45 | 0.59 | **0.55** |
| **Overall** | | | | | **0.76** |

### Score Interpretation

```
Easy   1.00 ████████████████████  Clear metrics → straightforward classification
Medium 0.73 ██████████████▌       Root cause hidden — model fails on BGP scenario (S1=0.20)
Hard   0.55 ███████████           Multi-phase execution with wrong-action penalties
```

- **Easy → 1.00:** Alert metrics (error rate, revenue impact) directly indicate severity. An 8B model reliably classifies P1/P2/P3 with 2 diagnostic queries.
- **Medium → 0.73:** Root cause service is NOT in the alert. Model must investigate beyond the blast radius. Succeeds on OOM and credential scenarios but fails on BGP network partition (S1=0.20) where no victim log names the root cause.
- **Hard → 0.55:** Same diagnostic challenge as medium PLUS multi-step remediation sequence, wrong-action penalties (−0.10 each), and documentation quality scoring. Model wastes steps on repeated status checks and sometimes executes counterproductive remediations.

---

## 🏗️ Tasks

| Task ID | Difficulty | Max Steps | Objective | Submission Action |
|---|---|---|---|---|
| `alert_classification` | 🟢 Easy | 3 | Classify alert severity (P1–P4) | `submit_severity` |
| `root_cause_analysis` | 🟡 Medium | 10 | Find root cause service + failure mode | `submit_root_cause` |
| `remediation_planning` | 🔴 Hard | 15 | Diagnose + remediate + document | `submit_resolution` |

### Scenarios (3 per task = 9 total episodes)

| ID | Incident Type | Root Cause | Why It's Hard |
|---|---|---|---|
| AC-001 | DB connection pool exhaustion | — | Clear P1: 78% errors, $12k/min revenue loss |
| AC-002 | CDN cache invalidation storm | — | Ambiguous P2: degraded but checkout works |
| AC-003 | Recommendation service errors | — | Trap P3: 45% errors but zero revenue impact |
| RCA-001 | Postgres OOM kill | analytics-service | Must correlate "analytics export query" in DB logs |
| RCA-002 | BGP network partition | network-infra | No victim log names network-infra — hardest scenario |
| RCA-003 | Credential rotation bug | config-service | Must trace "secrets rotation" hint to config-service |
| RP-001 | Full OOM remediation | analytics-service | 6-step sequence: disable job → restart chain |
| RP-002 | Full BGP remediation | network-infra | 4-step sequence: restore routes → rollback → verify |
| RP-003 | Full credential fix | config-service | 7-step sequence: rollback → rotate → restart → verify |

---

## 🎮 Action Space

### Diagnostic Actions (gather evidence)
```json
{"action_type": "query_logs",           "parameters": {"service": "<name>"}}
{"action_type": "check_metrics",        "parameters": {"service": "<name>"}}
{"action_type": "check_dependencies",   "parameters": {"service": "<name>"}}
{"action_type": "check_recent_deploys", "parameters": {"service": "<name>"}}
{"action_type": "check_service_status", "parameters": {"service": "<name>"}}
```

### Remediation Actions (fix the incident)
```json
{"action_type": "restart_service",      "parameters": {"service": "<name>"}}
{"action_type": "rollback_deploy",      "parameters": {"service": "<name>"}}
{"action_type": "scale_service",        "parameters": {"service": "<name>", "replicas": 10}}
{"action_type": "disable_feature_flag", "parameters": {"flag": "<flag_name>"}}
{"action_type": "clear_cache",          "parameters": {"service": "<name>"}}
{"action_type": "execute_runbook_step", "parameters": {"runbook_action": "<action>"}}
```

### Submission Actions (end the episode)
```json
{"action_type": "submit_severity",   "parameters": {"severity": "P1|P2|P3|P4", "service": "<name>"}}
{"action_type": "submit_root_cause", "parameters": {"service": "<name>", "failure_mode": "<description>"}}
{"action_type": "submit_resolution", "parameters": {"summary": "<3+ sentence summary>"}}
```

---

## 👁️ Observation Space

| Field | Type | Description |
|---|---|---|
| `episode_id` | string | Unique episode UUID |
| `task_id` | string | Active task identifier |
| `scenario_id` | string | Current scenario (e.g., `RCA-001`) |
| `step_count` / `max_steps` | int | Progress through episode |
| `incident_summary` | string | Plain-text incident description (no root cause hints) |
| `alert` | dict | Alert payload with severity, symptoms, affected services |
| `available_actions` | list | Valid action types for this task |
| `queried_data` | dict | All evidence gathered so far |
| `known_services` | list | Exact service names valid for actions |
| `cumulative_reward` | float | Running reward total |
| `done` | bool | Episode terminal flag |
| `feedback` | string | Per-step feedback explaining reward |
| `last_action_error` | string? | Error message if last action was invalid |

---

## 💰 Reward Function

Dense reward shaping throughout the trajectory — not just terminal scoring.

### Per-Step Rewards

| Event | Easy | Medium | Hard |
|---|---|---|---|
| Query new service (first time) | +0.04 | +0.04 | +0.03 |
| Query new action on known service | +0.02 | +0.02 | +0.01 |
| Repeat exact same query | −0.03 | −0.04 | −0.03 |
| Query unknown service | −0.06 | −0.06 | −0.05 |
| Correct remediation action | — | +0.06 | +0.06 |
| Wrong remediation action | −0.08 | −0.10 | −0.15 |
| Step past halfway (non-submit) | −0.04 | −0.02 | −0.02 |
| Timeout without submission | −0.15 | −0.15 | −0.20 |

### Grader Scoring (terminal, deterministic)

| Task | Scoring Logic |
|---|---|
| `alert_classification` | 1.0 exact · 0.5 adjacent · 0.25 two-off · 0.0 wrong |
| `root_cause_analysis` | Up to 0.6 base (service + failure mode) + up to 0.4 efficiency bonus. Wrong service: 0.05–0.20 based on investigation effort |
| `remediation_planning` | Scaled base (0.10–0.50 by investigation depth) + 0.30 efficiency − up to 0.30 wrong-action penalty + 0.10 summary quality |

---

## 🔌 API Endpoints

| Method | Path | Description |
|---|---|---|
| `GET` | `/` | Gradio UI — interactive environment demo |
| `GET` | `/health` | `{"status":"ok","version":"0.1.0"}` |
| `POST` | `/reset` | Start new episode (accepts `task_id`, `scenario_index`) |
| `POST` | `/step` | Submit action → returns observation, reward, done, info |
| `GET` | `/state` | Full current episode state with action history |
| `GET` | `/tasks` | All tasks with action schemas |
| `GET` | `/grader` | Score current episode (0.0–1.0) with breakdown |

---

## 🚀 Setup & Usage

### Local Development
```bash
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
```

### Docker
```bash
docker build -t cloud-incident-env .
docker run -p 7860:7860 cloud-incident-env
```

### Run Baseline Inference
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token"
python inference.py
```

### Quick API Test
```bash
# Reset
curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"

# Step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'

# Grade
curl http://localhost:7860/grader
```

---

## 📁 Project Structure

```
.
├── Dockerfile              # Container build
├── README.md               # This file
├── requirements.txt        # Python dependencies
├── openenv.yaml            # OpenEnv metadata + task definitions
├── inference.py            # Baseline agent (OpenAI client + smart fallback)
├── tasks.py                # 9 scenarios across 3 difficulty levels
├── graders.py              # Deterministic graders (0.0–1.0)
└── server/
    ├── __init__.py
    ├── app.py              # FastAPI + Gradio endpoints
    ├── environment.py      # Core step()/reset()/state() logic
    └── models.py           # Typed Pydantic models (Action, Observation, Reward)
```

---

## ✅ Validation

```bash
# OpenEnv spec validation
openenv validate    # → [OK] Ready for multi-mode deployment

# Docker build
docker build -t cloud-incident-env .    # → builds successfully

# Health check
curl http://localhost:7860/health       # → {"status":"ok","version":"0.1.0"}
```

## Team
- **Einstein** — [@MrEinsteinE](https://github.com/MrEinsteinE)
- **Sidra** — [@sidraaiman](https://github.com/sidraaiman)