Elliot89 commited on
Commit
f97cc91
Β·
1 Parent(s): 77eb356

initial commit

Browse files
Files changed (1) hide show
  1. README.md +5 -200
README.md CHANGED
@@ -1,205 +1,10 @@
1
  ---
2
- title: Cloud Incident Response OpenEnv
3
- emoji: 🚨
4
- colorFrom: red
5
- colorTo: yellow
6
  sdk: docker
7
- app_port: 7860
8
  pinned: false
9
- tags:
10
- - openenv
11
- - sre
12
- - cloud
13
- - incident-response
14
- - devops
15
- - real-world
16
- - agentic
17
  ---
18
 
19
- # Cloud Incident Response β€” OpenEnv Environment
20
-
21
- An OpenEnv environment for training and evaluating AI agents on **cloud SRE incident response** β€” the real-world on-call workflow that engineers at every cloud company perform daily.
22
-
23
- Distinct from Kubernetes operations environments: this focuses on **cross-service cascading failures** in distributed microservice architectures β€” connection pool exhaustion, CDN cache storms, OOM kills, and BGP network partitions.
24
-
25
- ## Why This Environment
26
-
27
- Every cloud company employs SREs who respond to production incidents under time pressure with incomplete information. This environment simulates the exact decision loop:
28
-
29
- 1. **Triage** β€” Read alert, assess blast radius, classify severity (P1–P4)
30
- 2. **Investigate** β€” Query logs, metrics, dependencies, recent deploys
31
- 3. **Diagnose** β€” Correlate signals across services to find the root cause
32
- 4. **Remediate** β€” Execute the correct runbook steps in the right sequence
33
- 5. **Document** β€” Submit a resolution summary for post-incident review
34
-
35
- Agents trained here learn the same skills a human SRE uses: service dependency traversal, log correlation, cascading failure analysis, and targeted remediation.
36
-
37
- ## Tasks
38
-
39
- | Task ID | Difficulty | Max Steps | What the Agent Does |
40
- |---|---|---|---|
41
- | `alert_classification` | Easy | 3 | Classify alert severity (P1–P4) from metrics and symptoms |
42
- | `root_cause_analysis` | Medium | 10 | Trace logs/metrics/deps to find root cause service and failure mode |
43
- | `remediation_planning` | Hard | 15 | Diagnose, remediate, and document full incident resolution |
44
-
45
- ### Scenarios
46
-
47
- | ID | Incident Type | Root Cause | Failure Pattern |
48
- |---|---|---|---|
49
- | AC-001 | DB connection pool exhaustion | postgres-db / auth-service deploy | api-gateway β†’ auth-service β†’ postgres-db cascade |
50
- | AC-002 | CDN cache invalidation storm | cdn-edge purge cronjob misconfigured | 40Γ— origin traffic spike |
51
- | RCA-001 | Postgres OOM kill | analytics-service unbounded query | Kernel OOM β†’ DB crash loop β†’ all dependents down |
52
- | RCA-002 | BGP network partition | network-infra config change | Route withdrawal β†’ AZ isolation β†’ 61% checkout failures |
53
- | RP-001 | Full OOM remediation | analytics-service | Disable job β†’ restart DB β†’ restore services β†’ document |
54
- | RP-002 | Full BGP remediation | network-infra | Restore routes β†’ rollback config β†’ verify recovery β†’ document |
55
-
56
- ## Action Space
57
-
58
- **Diagnostic actions** (gather evidence):
59
- ```json
60
- {"action_type": "query_logs", "parameters": {"service": "postgres-db"}}
61
- {"action_type": "check_metrics", "parameters": {"service": "auth-service"}}
62
- {"action_type": "check_dependencies", "parameters": {"service": "api-gateway"}}
63
- {"action_type": "check_recent_deploys", "parameters": {"service": "analytics-service"}}
64
- {"action_type": "check_service_status", "parameters": {"service": "payment-service"}}
65
- ```
66
-
67
- **Remediation actions** (fix the incident):
68
- ```json
69
- {"action_type": "restart_service", "parameters": {"service": "postgres-db"}}
70
- {"action_type": "rollback_deploy", "parameters": {"service": "network-infra", "target_version": "previous"}}
71
- {"action_type": "scale_service", "parameters": {"service": "image-service", "replicas": 10}}
72
- {"action_type": "disable_feature_flag", "parameters": {"flag": "full_history_export"}}
73
- {"action_type": "execute_runbook_step", "parameters": {"runbook_action": "restore_bgp_routes"}}
74
- ```
75
-
76
- **Submission actions** (end the episode):
77
- ```json
78
- {"action_type": "submit_severity", "parameters": {"severity": "P1", "service": "postgres-db"}}
79
- {"action_type": "submit_root_cause", "parameters": {"service": "analytics-service", "failure_mode": "unbounded query OOM killing postgres-db"}}
80
- {"action_type": "submit_resolution", "parameters": {"summary": "Disabled analytics job, restarted postgres-db..."}}
81
- ```
82
-
83
- ## Observation Space
84
-
85
- | Field | Type | Description |
86
- |---|---|---|
87
- | `episode_id` | string | Unique episode UUID |
88
- | `task_id` | string | Active task |
89
- | `scenario_id` | string | Scenario (e.g. `AC-001`) |
90
- | `step_count` / `max_steps` | int | Current step and budget |
91
- | `incident_summary` | string | Plain-text incident description |
92
- | `alert` | dict | Alert payload with severity, symptoms, affected services |
93
- | `available_actions` | list[str] | Valid action types for this task |
94
- | `queried_data` | dict | All tool responses gathered so far |
95
- | `known_services` | list[str] | Exact service names to use in actions |
96
- | `cumulative_reward` | float | Running reward total |
97
- | `done` | bool | Episode terminal flag |
98
- | `feedback` | string | Per-step feedback string |
99
-
100
- ## Reward Function
101
-
102
- Dense reward shaping throughout the trajectory:
103
-
104
- | Event | Reward |
105
- |---|---|
106
- | Query known service (first time) | +0.05 |
107
- | Query known service (repeat) | +0.01 |
108
- | Query unknown service | βˆ’0.05 |
109
- | Correct remediation action | +0.10 |
110
- | Wrong remediation action | βˆ’0.10 |
111
- | Step past halfway (non-submit) | βˆ’0.02 |
112
- | Timeout without submission | βˆ’0.10 |
113
- | Grader score (terminal step) | 0.0–1.0 |
114
-
115
- **Grader scoring** (deterministic, via `GET /grader`):
116
-
117
- | Task | Scoring Logic |
118
- |---|---|
119
- | `alert_classification` | 1.0 exact Β· 0.5 adjacent Β· 0.25 two-off Β· 0.0 wrong/none |
120
- | `root_cause_analysis` | 0.6 base (svc+mode) + up to 0.4 efficiency bonus |
121
- | `remediation_planning` | 0.6 base + 0.3 efficiency βˆ’ 0.15 wrong penalty + 0.1 summary |
122
-
123
- ## API Endpoints
124
-
125
- | Method | Path | Description |
126
- |---|---|---|
127
- | GET | `/` | `{"status":"running",...}` β€” HF Space health |
128
- | GET | `/health` | `{"status":"ok","version":"0.1.0"}` |
129
- | POST | `/reset?task_id=...&scenario_index=...` | Start new episode |
130
- | POST | `/step` | Submit action (JSON body) |
131
- | GET | `/state` | Full current episode state |
132
- | GET | `/tasks` | All tasks with action schemas |
133
- | GET | `/grader` | Score current episode (0.0–1.0) |
134
- | POST | `/baseline` | Run inference.py, return scores |
135
-
136
- ## Setup & Usage
137
-
138
- ### Local development
139
- ```bash
140
- pip install -r requirements.txt
141
- uvicorn server.app:app --host 0.0.0.0 --port 7860
142
- ```
143
-
144
- ### Docker
145
- ```bash
146
- docker build -t cloud-incident-env .
147
- docker run -p 7860:7860 \
148
- -e API_BASE_URL="https://api-inference.huggingface.co/v1" \
149
- -e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
150
- -e HF_TOKEN="hf_your_token" \
151
- cloud-incident-env
152
- ```
153
-
154
- ### Run inference script
155
- ```bash
156
- export API_BASE_URL="https://api-inference.huggingface.co/v1"
157
- export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
158
- export HF_TOKEN="hf_your_token"
159
- python inference.py
160
- ```
161
-
162
- ### Quick API test
163
- ```bash
164
- # Start new episode
165
- curl -X POST "http://localhost:7860/reset?task_id=alert_classification&scenario_index=0"
166
-
167
- # Submit an action
168
- curl -X POST http://localhost:7860/step \
169
- -H "Content-Type: application/json" \
170
- -d '{"action_type":"query_logs","parameters":{"service":"api-gateway"}}'
171
-
172
- # Check score
173
- curl http://localhost:7860/grader
174
- ```
175
-
176
- ## Baseline Scores
177
-
178
- Using `meta-llama/Llama-3.1-8B-Instruct` via HF Inference API:
179
-
180
- | Task | Scenario 0 | Scenario 1 | Average |
181
- |---|---|---|---|
182
- | `alert_classification` | ~1.00 | ~0.50 | ~0.75 |
183
- | `root_cause_analysis` | ~0.45 | ~0.35 | ~0.40 |
184
- | `remediation_planning` | ~0.25 | ~0.20 | ~0.23 |
185
- | **overall** | | | **~0.46** |
186
-
187
- *Run `python inference.py` to reproduce.*
188
-
189
- ## Project Structure
190
-
191
- ```
192
- .
193
- β”œβ”€β”€ Dockerfile
194
- β”œβ”€β”€ README.md
195
- β”œβ”€β”€ requirements.txt
196
- β”œβ”€β”€ openenv.yaml
197
- β”œβ”€β”€ tasks.py # Scenario definitions (6 scenarios across 3 tasks)
198
- β”œβ”€β”€ graders.py # Deterministic graders for all tasks
199
- β”œβ”€β”€ inference.py # Baseline agent + smart fallback logic
200
- └── server/
201
- β”œβ”€β”€ __init__.py
202
- β”œβ”€β”€ app.py # FastAPI endpoints
203
- β”œβ”€β”€ environment.py # Core OpenEnv step/reset/state logic
204
- └── models.py # Typed Pydantic models (Action, Observation, Reward)
205
- ```
 
1
  ---
2
+ title: Cloud Incident Response Openenv
3
+ emoji: 🌍
4
+ colorFrom: indigo
5
+ colorTo: indigo
6
  sdk: docker
 
7
  pinned: false
 
 
 
 
 
 
 
 
8
  ---
9
 
10
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference