Spaces:
Sleeping
Sleeping
| title: SRE Incident Response Environment | |
| emoji: π§ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # SRE Incident Response Environment | |
| An OpenEnv environment where AI agents diagnose and resolve production infrastructure incidents in a simulated microservices cluster. | |
| ## Motivation | |
| Site Reliability Engineering (SRE) incident response is a high-stakes, real-world task performed daily by millions of engineers. Agents must investigate alerts, trace dependencies, identify root causes, and apply fixes under time pressure β all while avoiding destructive actions on healthy services. | |
| ## Action Space | |
| The agent sends structured commands: | |
| ```python | |
| SREAction(command="check_logs", target="api-gateway", parameters={"lines": 20}) | |
| ``` | |
| | Command | Target | Parameters | Description | | |
| |---------|--------|------------|-------------| | |
| | `check_logs` | service | `{lines: int}` | View recent log entries | | |
| | `get_metrics` | service | | CPU, memory, latency, error rate | | |
| | `list_alerts` | β | | All active alerts | | |
| | `check_dependencies` | service | | Dependency graph | | |
| | `check_network` | service | | Network connections | | |
| | `check_processes` | service | | Running processes with PIDs | | |
| | `restart_service` | service | | Restart a service | | |
| | `scale_service` | service | `{replicas: int}` | Scale up/down | | |
| | `rollback_service` | service | | Rollback to previous deploy | | |
| | `kill_process` | service | `{pid: str}` | Kill a specific process | | |
| | `update_config` | service | `{key, value}` | Update config | | |
| | `rotate_credentials` | service | | Rotate service credentials | | |
| | `clear_disk` | service | `{path: str}` | Clear disk space | | |
| | `submit_diagnosis` | β | `{root_cause, affected_services}` | Submit root cause | | |
| ## Observation Space | |
| ```python | |
| SREObservation( | |
| output: str, # Command result text | |
| alerts: list[dict], # Active alerts | |
| system_health: float, # 0-100 cluster health | |
| services_status: dict, # {service: "healthy"|"degraded"|"down"} | |
| step_count: int, | |
| max_steps: int, | |
| available_commands: list[str], | |
| done: bool, | |
| reward: float | None, | |
| ) | |
| ``` | |
| ## Tasks | |
| ### Easy β Memory Leak in API Gateway | |
| Single service (`api-gateway`) with memory leak causing OOM kills. Clear log signals, no red herrings. **Optimal: ~5 steps. Max: 15 steps.** | |
| ### Medium β Cascading Database Failure | |
| `postgres-primary` connection pool exhausted, causing cascading failures across 3 dependent services. Includes red herring alerts on `cache-service`. **Optimal: ~10 steps. Max: 20 steps.** | |
| ### Hard β Crypto-Mining Attack + Disk Full | |
| Compromised `worker-service` running crypto miner (xmrig). Concurrent disk full on `log-aggregator`. Agent must kill malicious process, rollback deployment, rotate credentials, AND clear disk. **Optimal: ~15 steps. Max: 25 steps.** | |
| ## Reward Design | |
| The grader runs at every step. Each step's reward is the **increase** in grader score since the last step: | |
| - Step makes progress (e.g. restarts the right service) β reward > 0 | |
| - Step makes no progress (e.g. checks an irrelevant service) β reward = 0 | |
| - Sum of all step rewards = final grader score (0.0-1.0) | |
| Each task's grader evaluates weighted binary criteria (e.g., "Was the root cause service restarted?" = 0 or 1, weight 0.4). The final score is the weighted average. | |
| Progressive system degradation creates time pressure β services get worse each step, making criteria harder to satisfy if the agent is slow. | |
| ## Setup | |
| ```bash | |
| # Install | |
| pip install openenv-core | |
| # Run locally | |
| git clone <this-repo> | |
| cd sre-incident-env | |
| pip install -r requirements.txt | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| # Or via Docker | |
| docker build -t sre-incident-env . | |
| docker run -p 8000:8000 sre-incident-env | |
| ``` | |
| ## Usage | |
| ```python | |
| from sre_incident_env import SREIncidentEnv, SREAction | |
| async with SREIncidentEnv(base_url="http://localhost:8000") as env: | |
| result = await env.reset(task_id="easy") | |
| print(result.observation.output) # Initial alert description | |
| result = await env.step(SREAction(command="check_logs", target="api-gateway")) | |
| print(result.observation.output) # Log entries | |
| print(result.reward) # Per-step reward | |
| ``` | |
| ## Baseline Scores | |
| | Task | Score | Steps | | |
| |------|-------|-------| | |
| | Easy | ~0.60 | 5-8 | | |
| | Medium | ~0.35 | 10-15 | | |
| | Hard | ~0.25 | 15-20 | | |
| *Scores from Qwen2.5-72B-Instruct via HF Router.* | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint | | |
| | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model identifier | | |
| | `HF_TOKEN` | β | HuggingFace API key | | |
| | `IMAGE_NAME` | β | Docker image name | | |