Spaces:

aamrinder
/

sre-incident-env

Sleeping

App Files Files Community

sre-incident-env / README.md

aamrinder

Upload README.md with huggingface_hub

b2476be verified about 1 month ago

preview code

raw

history blame contribute delete

4.77 kB

	---
	title: SRE Incident Response Environment
	emoji: 🔧
	colorFrom: red
	colorTo: yellow
	sdk: docker
	app_port: 8000
	tags:
	- openenv
	---

	# SRE Incident Response Environment

	An OpenEnv environment where AI agents diagnose and resolve production infrastructure incidents in a simulated microservices cluster.

	## Motivation

	Site Reliability Engineering (SRE) incident response is a high-stakes, real-world task performed daily by millions of engineers. Agents must investigate alerts, trace dependencies, identify root causes, and apply fixes under time pressure — all while avoiding destructive actions on healthy services.

	## Action Space

	The agent sends structured commands:

	```python
	SREAction(command="check_logs", target="api-gateway", parameters={"lines": 20})
	```

	\| Command \| Target \| Parameters \| Description \|
	\|---------\|--------\|------------\|-------------\|
	\| `check_logs` \| service \| `{lines: int}` \| View recent log entries \|
	\| `get_metrics` \| service \| \| CPU, memory, latency, error rate \|
	\| `list_alerts` \| — \| \| All active alerts \|
	\| `check_dependencies` \| service \| \| Dependency graph \|
	\| `check_network` \| service \| \| Network connections \|
	\| `check_processes` \| service \| \| Running processes with PIDs \|
	\| `restart_service` \| service \| \| Restart a service \|
	\| `scale_service` \| service \| `{replicas: int}` \| Scale up/down \|
	\| `rollback_service` \| service \| \| Rollback to previous deploy \|
	\| `kill_process` \| service \| `{pid: str}` \| Kill a specific process \|
	\| `update_config` \| service \| `{key, value}` \| Update config \|
	\| `rotate_credentials` \| service \| \| Rotate service credentials \|
	\| `clear_disk` \| service \| `{path: str}` \| Clear disk space \|
	\| `submit_diagnosis` \| — \| `{root_cause, affected_services}` \| Submit root cause \|

	## Observation Space

	```python
	SREObservation(
	output: str, # Command result text
	alerts: list[dict], # Active alerts
	system_health: float, # 0-100 cluster health
	services_status: dict, # {service: "healthy"\|"degraded"\|"down"}
	step_count: int,
	max_steps: int,
	available_commands: list[str],
	done: bool,
	reward: float \| None,
	)
	```

	## Tasks

	### Easy — Memory Leak in API Gateway
	Single service (`api-gateway`) with memory leak causing OOM kills. Clear log signals, no red herrings. Optimal: ~5 steps. Max: 15 steps.

	### Medium — Cascading Database Failure
	`postgres-primary` connection pool exhausted, causing cascading failures across 3 dependent services. Includes red herring alerts on `cache-service`. Optimal: ~10 steps. Max: 20 steps.

	### Hard — Crypto-Mining Attack + Disk Full
	Compromised `worker-service` running crypto miner (xmrig). Concurrent disk full on `log-aggregator`. Agent must kill malicious process, rollback deployment, rotate credentials, AND clear disk. Optimal: ~15 steps. Max: 25 steps.

	## Reward Design

	The grader runs at every step. Each step's reward is the increase in grader score since the last step:

	- Step makes progress (e.g. restarts the right service) → reward > 0
	- Step makes no progress (e.g. checks an irrelevant service) → reward = 0
	- Sum of all step rewards = final grader score (0.0-1.0)

	Each task's grader evaluates weighted binary criteria (e.g., "Was the root cause service restarted?" = 0 or 1, weight 0.4). The final score is the weighted average.

	Progressive system degradation creates time pressure — services get worse each step, making criteria harder to satisfy if the agent is slow.

	## Setup

	```bash
	# Install
	pip install openenv-core

	# Run locally
	git clone <this-repo>
	cd sre-incident-env
	pip install -r requirements.txt
	uvicorn server.app:app --host 0.0.0.0 --port 8000

	# Or via Docker
	docker build -t sre-incident-env .
	docker run -p 8000:8000 sre-incident-env
	```

	## Usage

	```python
	from sre_incident_env import SREIncidentEnv, SREAction

	async with SREIncidentEnv(base_url="http://localhost:8000") as env:
	result = await env.reset(task_id="easy")
	print(result.observation.output) # Initial alert description

	result = await env.step(SREAction(command="check_logs", target="api-gateway"))
	print(result.observation.output) # Log entries
	print(result.reward) # Per-step reward
	```

	## Baseline Scores

	\| Task \| Score \| Steps \|
	\|------\|-------\|-------\|
	\| Easy \| ~0.60 \| 5-8 \|
	\| Medium \| ~0.35 \| 10-15 \|
	\| Hard \| ~0.25 \| 15-20 \|

	Scores from Qwen2.5-72B-Instruct via HF Router.

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `API_BASE_URL` \| `https://router.huggingface.co/v1` \| LLM API endpoint \|
	\| `MODEL_NAME` \| `Qwen/Qwen2.5-72B-Instruct` \| Model identifier \|
	\| `HF_TOKEN` \| — \| HuggingFace API key \|
	\| `IMAGE_NAME` \| — \| Docker image name \|