aamrinder commited on
Commit
b2476be
Β·
verified Β·
1 Parent(s): aed7de4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +127 -5
README.md CHANGED
@@ -1,10 +1,132 @@
1
  ---
2
- title: Sre Incident Env
3
- emoji: 😻
4
  colorFrom: red
5
- colorTo: green
6
  sdk: docker
7
- pinned: false
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: SRE Incident Response Environment
3
+ emoji: πŸ”§
4
  colorFrom: red
5
+ colorTo: yellow
6
  sdk: docker
7
+ app_port: 8000
8
+ tags:
9
+ - openenv
10
  ---
11
 
12
+ # SRE Incident Response Environment
13
+
14
+ An OpenEnv environment where AI agents diagnose and resolve production infrastructure incidents in a simulated microservices cluster.
15
+
16
+ ## Motivation
17
+
18
+ Site Reliability Engineering (SRE) incident response is a high-stakes, real-world task performed daily by millions of engineers. Agents must investigate alerts, trace dependencies, identify root causes, and apply fixes under time pressure β€” all while avoiding destructive actions on healthy services.
19
+
20
+ ## Action Space
21
+
22
+ The agent sends structured commands:
23
+
24
+ ```python
25
+ SREAction(command="check_logs", target="api-gateway", parameters={"lines": 20})
26
+ ```
27
+
28
+ | Command | Target | Parameters | Description |
29
+ |---------|--------|------------|-------------|
30
+ | `check_logs` | service | `{lines: int}` | View recent log entries |
31
+ | `get_metrics` | service | | CPU, memory, latency, error rate |
32
+ | `list_alerts` | β€” | | All active alerts |
33
+ | `check_dependencies` | service | | Dependency graph |
34
+ | `check_network` | service | | Network connections |
35
+ | `check_processes` | service | | Running processes with PIDs |
36
+ | `restart_service` | service | | Restart a service |
37
+ | `scale_service` | service | `{replicas: int}` | Scale up/down |
38
+ | `rollback_service` | service | | Rollback to previous deploy |
39
+ | `kill_process` | service | `{pid: str}` | Kill a specific process |
40
+ | `update_config` | service | `{key, value}` | Update config |
41
+ | `rotate_credentials` | service | | Rotate service credentials |
42
+ | `clear_disk` | service | `{path: str}` | Clear disk space |
43
+ | `submit_diagnosis` | β€” | `{root_cause, affected_services}` | Submit root cause |
44
+
45
+ ## Observation Space
46
+
47
+ ```python
48
+ SREObservation(
49
+ output: str, # Command result text
50
+ alerts: list[dict], # Active alerts
51
+ system_health: float, # 0-100 cluster health
52
+ services_status: dict, # {service: "healthy"|"degraded"|"down"}
53
+ step_count: int,
54
+ max_steps: int,
55
+ available_commands: list[str],
56
+ done: bool,
57
+ reward: float | None,
58
+ )
59
+ ```
60
+
61
+ ## Tasks
62
+
63
+ ### Easy β€” Memory Leak in API Gateway
64
+ Single service (`api-gateway`) with memory leak causing OOM kills. Clear log signals, no red herrings. **Optimal: ~5 steps. Max: 15 steps.**
65
+
66
+ ### Medium β€” Cascading Database Failure
67
+ `postgres-primary` connection pool exhausted, causing cascading failures across 3 dependent services. Includes red herring alerts on `cache-service`. **Optimal: ~10 steps. Max: 20 steps.**
68
+
69
+ ### Hard β€” Crypto-Mining Attack + Disk Full
70
+ Compromised `worker-service` running crypto miner (xmrig). Concurrent disk full on `log-aggregator`. Agent must kill malicious process, rollback deployment, rotate credentials, AND clear disk. **Optimal: ~15 steps. Max: 25 steps.**
71
+
72
+ ## Reward Design
73
+
74
+ The grader runs at every step. Each step's reward is the **increase** in grader score since the last step:
75
+
76
+ - Step makes progress (e.g. restarts the right service) β†’ reward > 0
77
+ - Step makes no progress (e.g. checks an irrelevant service) β†’ reward = 0
78
+ - Sum of all step rewards = final grader score (0.0-1.0)
79
+
80
+ Each task's grader evaluates weighted binary criteria (e.g., "Was the root cause service restarted?" = 0 or 1, weight 0.4). The final score is the weighted average.
81
+
82
+ Progressive system degradation creates time pressure β€” services get worse each step, making criteria harder to satisfy if the agent is slow.
83
+
84
+ ## Setup
85
+
86
+ ```bash
87
+ # Install
88
+ pip install openenv-core
89
+
90
+ # Run locally
91
+ git clone <this-repo>
92
+ cd sre-incident-env
93
+ pip install -r requirements.txt
94
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
95
+
96
+ # Or via Docker
97
+ docker build -t sre-incident-env .
98
+ docker run -p 8000:8000 sre-incident-env
99
+ ```
100
+
101
+ ## Usage
102
+
103
+ ```python
104
+ from sre_incident_env import SREIncidentEnv, SREAction
105
+
106
+ async with SREIncidentEnv(base_url="http://localhost:8000") as env:
107
+ result = await env.reset(task_id="easy")
108
+ print(result.observation.output) # Initial alert description
109
+
110
+ result = await env.step(SREAction(command="check_logs", target="api-gateway"))
111
+ print(result.observation.output) # Log entries
112
+ print(result.reward) # Per-step reward
113
+ ```
114
+
115
+ ## Baseline Scores
116
+
117
+ | Task | Score | Steps |
118
+ |------|-------|-------|
119
+ | Easy | ~0.60 | 5-8 |
120
+ | Medium | ~0.35 | 10-15 |
121
+ | Hard | ~0.25 | 15-20 |
122
+
123
+ *Scores from Qwen2.5-72B-Instruct via HF Router.*
124
+
125
+ ## Environment Variables
126
+
127
+ | Variable | Default | Description |
128
+ |----------|---------|-------------|
129
+ | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint |
130
+ | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model identifier |
131
+ | `HF_TOKEN` | β€” | HuggingFace API key |
132
+ | `IMAGE_NAME` | β€” | Docker image name |