Elliot89 commited on
Commit
c5c7c22
Β·
1 Parent(s): a50dd28

Update README (from agent)

Browse files
Files changed (1) hide show
  1. README.md +300 -121
README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  ---
2
  title: Cloud Incident Response OpenEnv
3
  emoji: 🚨
@@ -14,48 +19,118 @@ tags:
14
  - devops
15
  - real-world
16
  - agentic
17
- ---
 
 
18
 
19
- # Cloud Incident Response β€” OpenEnv Environment
20
 
21
- An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** β€” the real-world on-call workflow that engineers at every cloud company perform daily.
22
 
23
- Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures β€” connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.
24
 
25
- ## Why This Environment
26
 
27
- Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
28
 
29
- 1. **Triage** β€” Read alert, assess blast radius, classify severity (P1–P4)
30
- 2. **Investigate** β€” Query logs, metrics, dependencies, recent deploys
31
- 3. **Diagnose** β€” Correlate signals across services to find the root cause
32
- 4. **Remediate** β€” Execute the correct runbook steps in the right sequence
33
- 5. **Document** β€” Submit a resolution summary for post-incident review
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
36
 
37
- ## Tasks
38
 
39
- | Task ID | Difficulty | Max Steps | What the Agent Does |
40
- |---|---|---|---|
41
- | `alert_classification` | Easy | 3 | Classify alert severity (P1–P4) from metrics and symptoms |
42
- | `root_cause_analysis` | Medium | 10 | Trace logs/metrics/deps to find root cause service and failure mode |
43
- | `remediation_planning` | Hard | 15 | Diagnose, remediate, and document full incident resolution |
44
 
45
- ### Scenarios
46
 
47
- | ID | Incident Type | Root Cause | Failure Pattern |
48
  |---|---|---|---|
49
- | AC-001 | DB connection pool exhaustion | postgres-db / auth-service deploy | api-gateway β†’ auth-service β†’ postgres-db cascade |
50
- | AC-002 | CDN cache invalidation storm | cdn-edge purge cronjob misconfigured | 40Γ— origin traffic spike |
51
- | RCA-001 | Postgres OOM kill | analytics-service unbounded query | Kernel OOM β†’ DB crash loop β†’ all dependents down |
52
- | RCA-002 | BGP network partition | network-infra config change | Route withdrawal β†’ AZ isolation β†’ 61% checkout failures |
53
- | RP-001 | Full OOM remediation | analytics-service | Disable job β†’ restart DB β†’ restore services β†’ document |
54
- | RP-002 | Full BGP remediation | network-infra | Restore routes β†’ rollback config β†’ verify recovery β†’ document |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Action Space
57
 
58
- **Diagnostic actions** (gather evidence):
 
59
  ```json
60
  {"action_type": "query_logs", "parameters": {"service": "postgres-db"}}
61
  {"action_type": "check_metrics", "parameters": {"service": "auth-service"}}
@@ -64,142 +139,246 @@ Agents trained here learn the same skills a human SRE uses: service dependency t
64
  {"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
65
  ```
66
 
67
- **Remediation actions** (fix the incident):
 
68
  ```json
69
  {"action_type": "restart_service", "parameters": {"service": "postgres-db"}}
70
- {"action_type": "rollback_deploy", "parameters": {"service": "network-infra", "target_version": "previous"}}
71
- {"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}}
72
  {"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
73
  {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
 
 
74
  ```
75
 
76
- **Submission actions** (end the episode):
 
77
  ```json
78
- {"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "postgres-db"}}
79
  {"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}}
80
- {"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}
81
  ```
82
 
83
- ## Observation Space
84
-
85
- | Field | Type | Description |
86
- |---|---|---|
87
- | `episode_id` | string | Unique episode UUID |
88
- | `task_id` | string | Active task |
89
- | `scenario_id` | string | Scenario (e.g. `AC-001`) |
90
- | `step_count` / `max_steps` | int | Current step and budget |
91
- | `incident_summary` | string | Plain-text incident description |
92
- | `alert` | dict | Alert payload with severity, symptoms, affected services |
93
- | `available_actions` | list[str] | Valid action types for this task |
94
- | `queried_data` | dict | All tool responses gathered so far |
95
- | `known_services` | list[str] | Exact service names to use in actions |
96
- | `cumulative_reward` | float | Running reward total |
97
- | `done` | bool | Episode terminal flag |
98
- | `feedback` | string | Per-step feedback string |
99
 
100
  ## Reward Function
101
 
102
- Dense reward shaping throughout the trajectory:
103
 
104
- | Event | Reward |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  |---|---|
106
- | Query known service (first time) | +0.05 |
107
- | Query known service (repeat) | +0.01 |
108
- | Query unknown service | βˆ’0.05 |
109
- | Correct remediation action | +0.10 |
110
- | Wrong remediation action | βˆ’0.10 |
111
- | Step past halfway (non-submit) | βˆ’0.02 |
112
- | Timeout without submission | βˆ’0.10 |
113
- | Grader score (terminal step) | 0.0–1.0 |
114
 
115
- **Grader scoring** (deterministic, via `GET /grader`):
116
 
117
- | Task | Scoring Logic |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  |---|---|
119
- | `alert_classification` | 1.0 exact Β· 0.5 adjacent Β· 0.25 two-off Β· 0.0 wrong/none |
120
- | `root_cause_analysis` | 0.6 base (svc+mode) + up to 0.4 efficiency bonus |
121
- | `remediation_planning` | 0.6 base + 0.3 efficiency βˆ’ 0.15 wrong penalty + 0.1 summary |
 
 
122
 
123
- ## API Endpoints
124
 
125
- | Method | Path | Description |
126
  |---|---|---|
127
- | GET | `/` | `{"status":"running",...}` β€” HF Space health |
128
- | GET | `/health` | `{"status":"ok","version":"0.1.0"}` |
129
- | POST | `/reset?task_id=...&scenario_index=...` | Start new episode |
130
- | POST | `/step` | Submit action (JSON body) |
131
- | GET | `/state` | Full current episode state |
132
- | GET | `/tasks` | All tasks with action schemas |
133
- | GET | `/grader` | Score current episode (0.0–1.0) |
134
- | POST | `/baseline` | Run inference.py, return scores |
135
-
136
- ## Setup & Usage
137
-
138
- ### Local development
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  ```bash
140
  pip install -r requirements.txt
141
  uvicorn server.app:app --host 0.0.0.0 --port 7860
142
  ```
143
 
144
  ### Docker
 
145
  ```bash
146
  docker build -t cloud-incident-env .
147
- docker run -p 7860:7860 \
148
- -e API_BASE_URL="https://api-inference.huggingface.co/v1" \
149
- -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
150
- -e HF_TOKEN="hf_your_token" \
151
- cloud-incident-env
152
  ```
153
 
154
- ### Run inference script
 
155
  ```bash
156
- export API_BASE_URL="https://api-inference.huggingface.co/v1"
157
- export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
158
- export HF_TOKEN="hf_your_token"
159
  python inference.py
160
  ```
161
 
162
- ### Quick API test
163
- ```bash
164
- # Start new episode
165
- curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"
166
 
167
- # Submit an action
168
- curl -X POST http://localhost:7860/step \
169
- -H "Content-Type: application/json" \
170
- -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'
171
 
172
- # Check score
173
- curl http://localhost:7860/grader
 
 
 
 
 
 
 
 
 
 
 
 
174
  ```
175
 
176
- ## Baseline Scores
177
 
178
- Using `meta-llama/Llama-3.1-8B-Instruct` via HF Inference API:
179
 
180
- | Task | Scenario 0 | Scenario 1 | Average |
181
- |---|---|---|---|
182
- | `alert_classification` | ~1.00 | ~0.50 | ~0.75 |
183
- | `root_cause_analysis` | ~0.45 | ~0.35 | ~0.40 |
184
- | `remediation_planning` | ~0.25 | ~0.20 | ~0.23 |
185
- | **overall** | | | **~0.46** |
186
 
187
- *Run `python inference.py` to reproduce.*
188
 
189
- ## Project Structure
190
 
191
- ```
192
- .
193
- β”œβ”€β”€ Dockerfile
194
- β”œβ”€β”€ README.md
195
- β”œβ”€β”€ requirements.txt
196
- β”œβ”€β”€ openenv.yaml
197
- β”œβ”€β”€ tasks.py # Scenario definitions (6 scenarios across 3 tasks)
198
- β”œβ”€β”€ graders.py # Deterministic graders for all tasks
199
- β”œβ”€β”€ inference.py # Baseline agent + smart fallback logic
200
- └── server/
201
- β”œβ”€β”€ __init__.py
202
- β”œβ”€β”€ app.py # FastAPI endpoints
203
- β”œβ”€β”€ environment.py # Core OpenEnv step/reset/state logic
204
- └── models.py # Typed Pydantic models (Action, Observation, Reward)
205
  ```
 
1
+
2
+
3
+ ## `README.md`
4
+
5
+ ```markdown
6
  ---
7
  title: Cloud Incident Response OpenEnv
8
  emoji: 🚨
 
19
  - devops
20
  - real-world
21
  - agentic
22
+ ---
23
+
24
+ # ☁️ Cloud Incident Response β€” OpenEnv Environment
25
 
26
+ An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** β€” the real-world on-call workflow that engineers perform daily at every cloud company.
27
 
28
+ Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures β€” OOM kills from runaway analytics queries, BGP network partitions isolating availability zones, and credential rotation bugs pushing stale secrets to production services.
29
 
30
+ ---
31
 
32
+ ## OpenEnv Interface
33
 
34
+ This environment implements the **full OpenEnv specification** with typed Pydantic models:
35
 
36
+ | Method | Endpoint | Input | Returns |
37
+ |---|---|---|---|
38
+ | `POST` | `/reset` | `{"task_id": "...", "scenario_index": 0}` or `{}` | `Observation` |
39
+ | `POST` | `/step` | `Action` JSON body | `{observation, reward, done, info}` |
40
+ | `GET` | `/state` | β€” | `EpisodeState` |
41
+ | `GET` | `/health` | β€” | `{"status": "ok"}` |
42
+ | `GET` | `/tasks` | β€” | Task list + action schemas |
43
+ | `GET` | `/grader` | β€” | Score 0.0–1.0 with breakdown |
44
+ | `POST` | `/baseline` | β€” | Run inference.py, return scores |
45
+
46
+ ### Typed Models
47
+
48
+ ```python
49
+ # Action β€” submitted by the agent
50
+ Action {
51
+ action_type: str, # e.g. "query_logs", "restart_service", "submit_severity"
52
+ parameters: {
53
+ service?: str, # Target service name
54
+ severity?: str, # P1|P2|P3|P4 (for submit_severity)
55
+ failure_mode?: str, # Root cause description (for submit_root_cause)
56
+ summary?: str, # Resolution summary (for submit_resolution)
57
+ flag?: str, # Feature flag name (for disable_feature_flag)
58
+ runbook_action?: str, # Runbook step (for execute_runbook_step)
59
+ target_version?: str, # Deploy version (for rollback_deploy)
60
+ }
61
+ }
62
+
63
+ # Observation β€” returned to the agent
64
+ Observation {
65
+ episode_id: str, # Unique episode UUID
66
+ task_id: str, # Active task
67
+ scenario_id: str, # Current scenario (e.g. "AC-001")
68
+ step_count: int, # Steps taken so far
69
+ max_steps: int, # Budget (3, 10, or 15)
70
+ incident_summary: str, # Plain-text incident description
71
+ alert: dict, # Alert payload: title, symptoms, error_rate, revenue_impact
72
+ available_actions: [str], # Valid action types for this task
73
+ queried_data: dict, # All evidence gathered so far
74
+ known_services: [str], # Valid service names for actions
75
+ cumulative_reward: float, # Running reward total
76
+ done: bool, # Episode complete flag
77
+ feedback: str, # Per-step reward explanation
78
+ last_action_error: str?, # Error from last action (null if OK)
79
+ }
80
+
81
+ # Reward β€” returned after each step
82
+ Reward {
83
+ score: float, # Step reward value
84
+ value: float, # Alias for score (backward compatibility)
85
+ reason: str, # Human-readable explanation
86
+ cumulative: float, # Running total
87
+ }
88
+ ```
89
 
90
+ ---
91
 
92
+ ## Tasks (3 Difficulty Levels, 9 Scenarios)
93
 
94
+ | Task ID | Difficulty | Max Steps | Scenarios | What the Agent Does |
95
+ |---|---|---|---|---|
96
+ | `alert_classification` | 🟒 Easy | 3 | 3 | Classify alert severity P1–P4 from metrics and symptoms |
97
+ | `root_cause_analysis` | 🟑 Medium | 10 | 3 | Trace failure chain across 8 services to find root cause |
98
+ | `remediation_planning` | πŸ”΄ Hard | 15 | 3 | Diagnose + execute multi-step remediation + document resolution |
99
 
100
+ ### Scenario Details
101
 
102
+ | ID | Incident | Root Cause | Challenge |
103
  |---|---|---|---|
104
+ | AC-001 | DB connection pool exhaustion | β€” | Clear P1: 78% errors, $12k/min |
105
+ | AC-002 | CDN cache invalidation storm | β€” | Ambiguous P2: degraded but checkout works |
106
+ | AC-003 | Recommendation engine errors | β€” | Trap P3: 45% errors but zero revenue impact |
107
+ | RCA-001 | Postgres OOM crash loop | analytics-service (unbounded query) | Root cause NOT in alert, 8 services to investigate |
108
+ | RCA-002 | Cross-AZ checkout failures | network-infra (BGP route withdrawal) | Network issue disguised as application failure |
109
+ | RCA-003 | DB authentication failures | config-service (stale credential rotation) | Multiple red herring deploys on other services |
110
+ | RP-001 | Full OOM incident | analytics-service | 6-step remediation sequence, wrong actions penalized |
111
+ | RP-002 | Full BGP incident | network-infra | 4-step runbook + config rollback, 8 services |
112
+ | RP-003 | Full credential incident | config-service | 7-step sequence, credential rotation + service restarts |
113
+
114
+ ### Why This Is Genuinely Difficult
115
+
116
+ - **Medium**: Root cause service is NEVER in the alert's `affected_services`. Agent must query logs on victim services, follow breadcrumbs that name the culprit, then investigate that service. 8 known services with red herring deploys.
117
+ - **Hard**: Same diagnostic challenge PLUS must execute 4–7 remediation actions in logical order. Wrong actions (e.g. restarting a healthy service) carry βˆ’0.15 penalties. Resolution summary must reference specific services and actions.
118
+
119
+ ### Baseline Scores
120
+
121
+ | Model | Easy | Medium | Hard | Overall |
122
+ |---|---|---|---|---|
123
+ | `llama-3.1-8b-instant` | 1.0 | 0.65 | 0.70 | 0.78 |
124
+ | `llama-3.3-70b-versatile` | 1.0 | 0.99 | 0.80 | 0.93 |
125
+
126
+ 70B consistently outperforms 8B on medium/hard tasks, proving the environment differentiates model quality.
127
+
128
+ ---
129
 
130
  ## Action Space
131
 
132
+ ### πŸ” Diagnostic Actions (gather evidence)
133
+
134
  ```json
135
  {"action_type": "query_logs", "parameters": {"service": "postgres-db"}}
136
  {"action_type": "check_metrics", "parameters": {"service": "auth-service"}}
 
139
  {"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
140
  ```
141
 
142
+ ### πŸ”§ Remediation Actions (fix the incident)
143
+
144
  ```json
145
  {"action_type": "restart_service", "parameters": {"service": "postgres-db"}}
146
+ {"action_type": "rollback_deploy", "parameters": {"service": "config-service", "target_version": "previous"}}
 
147
  {"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
148
  {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
149
+ {"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}}
150
+ {"action_type": "clear_cache", "parameters": {"service": "redis-session"}}
151
  ```
152
 
153
+ ### πŸ“ Submission Actions (end episode)
154
+
155
  ```json
156
+ {"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "api-gateway"}}
157
  {"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}}
158
+ {"action_type": "submit_resolution", "parameters": {"summary": "3+ sentence description of what failed, what you did, and current status"}}
159
  ```
160
 
161
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
  ## Reward Function
164
 
165
+ Dense reward shaping provides signal over the **full trajectory** (not just binary end-of-episode):
166
 
167
+ | Signal | Reward | Description |
168
+ |---|---|---|
169
+ | Query new service | +0.03 to +0.04 | First diagnostic action on a service |
170
+ | Query new action type | +0.01 to +0.02 | Different diagnostic on already-queried service |
171
+ | Repeat same query | βˆ’0.03 to βˆ’0.04 | Same (action, service) pair again |
172
+ | Unknown service | βˆ’0.05 to βˆ’0.06 | Service not in known_services |
173
+ | Correct remediation | +0.06 | Action matches correct remediation sequence |
174
+ | Wrong remediation | βˆ’0.12 to βˆ’0.15 | Action in wrong_actions list (e.g. restarting healthy service) |
175
+ | Correct submission type | +0.02 | Using the right submit action for the task |
176
+ | Wrong submission type | βˆ’0.08 to βˆ’0.12 | e.g. submit_severity during remediation_planning |
177
+ | Past halfway (non-submit) | βˆ’0.015 to βˆ’0.04 | Per-step efficiency penalty |
178
+ | Timeout | βˆ’0.15 to βˆ’0.20 | No submission before max_steps |
179
+ | Exact repeat action | βˆ’0.04 to βˆ’0.05 | Identical action+params as a previous step |
180
+ | **Grader score** | **0.0–1.0** | **Added at terminal step** |
181
+
182
+ ### Grading (deterministic, reproducible, 0.0–1.0)
183
+
184
+ | Task | Scoring Logic |
185
  |---|---|
186
+ | `alert_classification` | 1.0 exact match Β· 0.5 adjacent (P1↔P2) Β· 0.25 two-off Β· 0.0 wrong |
187
+ | `root_cause_analysis` | 0.6 base (correct service + failure mode) + up to 0.4 efficiency bonus |
188
+ | `remediation_planning` | 0.6 base + 0.3 efficiency (correct steps matched) βˆ’ 0.15 penalty (wrong actions) + 0.1 summary quality |
 
 
 
 
 
189
 
190
+ ---
191
 
192
+ ## πŸ–₯️ Interactive UI Walkthrough
193
+
194
+ The Gradio UI at `/` provides a visual interface for human evaluation. Here's how to use it:
195
+
196
+ ### 🟒 Easy Task: Alert Classification
197
+
198
+ 1. **Select Task**: Choose `🟒 Easy β€” Alert Classification` from the Task dropdown
199
+ 2. **Select Scenario**: Choose `Scenario 2` (the tricky P3 trap)
200
+ 3. **Click** `πŸ”„ Reset Environment`
201
+ 4. **Read** the observation panel β€” recommendation-service errors at 45%
202
+ 5. **Investigate**: Set Action Type to `πŸ” check_metrics`, Service to `recommendation-service`, click `▢️ Execute Action`
203
+ 6. **Read evidence** β€” "User impact: NONE", "Revenue: unchanged", "Checkout: 100%"
204
+ 7. **Submit**: Set Action Type to `πŸ“ submit_severity`, expand `πŸ“‹ Parameters`, set Severity to `P3 Medium`, click `▢️ Execute Action`
205
+ 8. **Grade**: Click `πŸ“Š Grade` β€” should show **1.0** for exact P3 match
206
+
207
+ ### 🟑 Medium Task: Root Cause Analysis
208
+
209
+ 1. **Select Task**: `🟑 Medium β€” Root Cause Analysis`, **Scenario**: `Scenario 0`
210
+ 2. **Click** `πŸ”„ Reset Environment`
211
+ 3. **Read** the observation β€” postgres-db crash loop, multiple services down
212
+ 4. **Query victim**: Action Type `πŸ” query_logs`, Service `postgres-db`, click `▢️ Execute Action`
213
+ 5. **Read evidence** β€” logs say *"query from analytics-service consuming all memory"*
214
+ 6. **Follow breadcrumb**: Action Type `πŸ” query_logs`, Service `analytics-service`, click `▢️ Execute Action`
215
+ 7. **Read evidence** β€” "full_history_export job", "847M row scan", "no LIMIT"
216
+ 8. **Confirm**: Action Type `πŸ” check_recent_deploys`, Service `analytics-service`, click `▢️ Execute Action`
217
+ 9. **Read evidence** β€” "Deploy 6h ago: cross-table JOIN without LIMIT clause"
218
+ 10. **Submit**: Action Type `πŸ“ submit_root_cause`, Service `analytics-service`, Failure Mode: `unbounded query OOM killing postgres-db`, click `▢️ Execute Action`
219
+ 11. **Grade**: Click `πŸ“Š Grade` β€” should show **0.85–1.0**
220
+
221
+ ### πŸ”΄ Hard Task: Remediation Planning
222
+
223
+ 1. **Select Task**: `πŸ”΄ Hard β€” Remediation Planning`, **Scenario**: `Scenario 0`
224
+ 2. **Click** `πŸ”„ Reset Environment`
225
+ 3. **Diagnose**: `πŸ” query_logs` on `postgres-db` β†’ see "analytics-service" breadcrumb
226
+ 4. **Confirm**: `πŸ” query_logs` on `analytics-service` β†’ see "full_history_export, no LIMIT"
227
+ 5. **Fix Step 1**: `πŸ”§ disable_feature_flag`, Flag: `full_history_export` β†’ "job DISABLED"
228
+ 6. **Fix Step 2**: `πŸ”§ restart_service` on `analytics-service` β†’ "restarted β€” idle"
229
+ 7. **Fix Step 3**: `πŸ”§ restart_service` on `postgres-db` β†’ "accepting connections (12/500)"
230
+ 8. **Fix Step 4**: `πŸ”§ restart_service` on `auth-service` β†’ "reconnected OK"
231
+ 9. **Fix Step 5**: `πŸ”§ restart_service` on `order-service` β†’ "writes resuming"
232
+ 10. **Verify**: `πŸ”§ execute_runbook_step`, Runbook Action: `verify_db_health` β†’ "healthy"
233
+ 11. **Submit**: `πŸ“ submit_resolution`, Summary: *"The analytics-service deployed a full_history_export job with an unbounded query that OOM-killed postgres-db. We disabled the full_history_export flag, restarted analytics-service, then restarted postgres-db, auth-service, and order-service. All services recovered and postgres-db is healthy."*
234
+ 12. **Grade**: Click `πŸ“Š Grade` β€” should show **0.85–1.0**
235
+
236
+ ### UI Controls Reference
237
+
238
+ | Button | Purpose |
239
  |---|---|
240
+ | `πŸ”„ Reset Environment` | Start a new episode |
241
+ | `▢️ Execute Action` | Run the selected action |
242
+ | `πŸ“‹ Parameters` | Expand to fill severity / failure_mode / summary / flag / runbook fields |
243
+ | `πŸ“Š Grade` | See final grader score (0.0–1.0) after episode ends |
244
+ | `πŸ“‹ State` | Refresh the state panel |
245
 
246
+ ### Common Mistakes & Penalties
247
 
248
+ | Mistake | Penalty | Why |
249
  |---|---|---|
250
+ | Wrong submission type (e.g. `submit_severity` in hard task) | βˆ’0.12 | Each task has ONE correct submission action |
251
+ | Restarting a healthy service (e.g. `restart redis-session`) | βˆ’0.15 | Wrong remediation action |
252
+ | Querying a service not in `known_services` | βˆ’0.06 | Invalid target |
253
+ | Repeating the exact same action | βˆ’0.04 | Infinite loop detection |
254
+ | Not submitting before max steps | βˆ’0.20 | Timeout penalty |
255
+ | Using remediation actions in easy task | βˆ’0.08 | Not available for alert classification |
256
+
257
+ ---
258
+
259
+ ## API Usage
260
+
261
+ ### Quick Test
262
+
263
+ ```bash
264
+ # Reset with defaults (alert_classification, scenario 0)
265
+ curl -X POST http://localhost:7860/reset \
266
+ -H "Content-Type: application/json" -d '{}'
267
+
268
+ # Reset with specific task
269
+ curl -X POST http://localhost:7860/reset \
270
+ -H "Content-Type: application/json" \
271
+ -d '{"task_id": "root_cause_analysis", "scenario_index": 1}'
272
+
273
+ # Take a step
274
+ curl -X POST http://localhost:7860/step \
275
+ -H "Content-Type: application/json" \
276
+ -d '{"action_type": "query_logs", "parameters": {"service": "postgres-db"}}'
277
+
278
+ # Check state
279
+ curl http://localhost:7860/state
280
+
281
+ # Grade current episode
282
+ curl http://localhost:7860/grader
283
+ ```
284
+
285
+ ### Full Episode Example (Python)
286
+
287
+ ```python
288
+ import requests
289
+
290
+ BASE = "http://localhost:7860"
291
+
292
+ # Start episode
293
+ obs = requests.post(f"{BASE}/reset", json={
294
+ "task_id": "alert_classification", "scenario_index": 0
295
+ }).json()
296
+
297
+ print(f"Incident: {obs['incident_summary']}")
298
+ print(f"Services: {obs['known_services']}")
299
+
300
+ # Investigate
301
+ result = requests.post(f"{BASE}/step", json={
302
+ "action_type": "check_metrics",
303
+ "parameters": {"service": obs["known_services"][0]}
304
+ }).json()
305
+
306
+ print(f"Reward: {result['reward']['score']:+.3f}")
307
+ print(f"Done: {result['done']}")
308
+
309
+ # Submit
310
+ result = requests.post(f"{BASE}/step", json={
311
+ "action_type": "submit_severity",
312
+ "parameters": {"severity": "P1", "service": obs["known_services"][0]}
313
+ }).json()
314
+
315
+ # Grade
316
+ grade = requests.get(f"{BASE}/grader").json()
317
+ print(f"Score: {grade['total']}")
318
+ ```
319
+
320
+ ---
321
+
322
+ ## Setup
323
+
324
+ ### Local Development
325
+
326
  ```bash
327
  pip install -r requirements.txt
328
  uvicorn server.app:app --host 0.0.0.0 --port 7860
329
  ```
330
 
331
  ### Docker
332
+
333
  ```bash
334
  docker build -t cloud-incident-env .
335
+ docker run -p 7860:7860 cloud-incident-env
 
 
 
 
336
  ```
337
 
338
+ ### Run Baseline Agent
339
+
340
  ```bash
341
+ export API_BASE_URL="https://api.groq.com/openai/v1"
342
+ export MODEL_NAME="llama-3.1-8b-instant"
343
+ export HF_TOKEN="gsk_your_groq_key"
344
  python inference.py
345
  ```
346
 
347
+ ---
 
 
 
348
 
349
+ ## Project Structure
 
 
 
350
 
351
+ ```
352
+ β”œβ”€β”€ Dockerfile # Docker build for HF Spaces
353
+ β”œβ”€β”€ README.md # This file
354
+ β”œβ”€β”€ requirements.txt # Python dependencies
355
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (tasks, endpoints)
356
+ β”œβ”€β”€ pyproject.toml # Project metadata
357
+ β”œβ”€β”€ tasks.py # 9 scenarios across 3 difficulty levels
358
+ β”œβ”€β”€ graders.py # Deterministic graders (0.0–1.0)
359
+ β”œβ”€β”€ inference.py # Baseline LLM agent with fallback logic
360
+ └── server/
361
+ β”œβ”€β”€ __init__.py
362
+ β”œβ”€β”€ app.py # FastAPI + Gradio endpoints
363
+ β”œβ”€β”€ environment.py # Core step/reset/state logic + reward shaping
364
+ └── models.py # Typed Pydantic models (Action, Observation, Reward)
365
  ```
366
 
367
+ ---
368
 
369
+ ## Design Decisions
370
 
371
+ ### Why Cloud Incident Response?
 
 
 
 
 
372
 
373
+ Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This is a universal, high-value skill that AI agents should learn. The environment simulates the exact decision loop: triage β†’ investigate β†’ diagnose β†’ remediate β†’ document.
374
 
375
+ ### Why These Specific Incidents?
376
 
377
+ - **OOM kills** (RCA-001, RP-001): The most common database failure mode β€” a runaway query consumes all memory and crashes the DB, taking down every dependent service.
378
+ - **BGP partitions** (RCA-002, RP-002): Network-layer failures that look like application failures β€” services appear "down" but are actually healthy, just unreachable.
379
+ - **Credential rotation bugs** (RCA-003, RP-003): Configuration management failures that cause cascading auth failures β€” the DB is fine but clients have wrong passwords.
380
+
381
+ ### Why Dense Rewards?
382
+
383
+ Sparse rewards (only at episode end) provide no learning signal for RL agents. Our reward function gives feedback on **every step**: positive for useful investigation, negative for wasted actions, with the terminal grader score layered on top. This enables both RL training and LLM agent evaluation.
 
 
 
 
 
 
 
384
  ```