Spaces:

aamrinder
/

sre-incident-env

Sleeping

App Files Files Community

sre-incident-env / README.md

aamrinder

Upload README.md with huggingface_hub

b2476be verified about 1 month ago

preview code

raw

history blame contribute delete

4.77 kB

metadata

title: SRE Incident Response Environment
emoji: 🔧
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv

SRE Incident Response Environment

An OpenEnv environment where AI agents diagnose and resolve production infrastructure incidents in a simulated microservices cluster.

Motivation

Site Reliability Engineering (SRE) incident response is a high-stakes, real-world task performed daily by millions of engineers. Agents must investigate alerts, trace dependencies, identify root causes, and apply fixes under time pressure — all while avoiding destructive actions on healthy services.

Action Space

The agent sends structured commands:

SREAction(command="check_logs", target="api-gateway", parameters={"lines": 20})

Command	Target	Parameters	Description
`check_logs`	service	`{lines: int}`	View recent log entries
`get_metrics`	service		CPU, memory, latency, error rate
`list_alerts`	—		All active alerts
`check_dependencies`	service		Dependency graph
`check_network`	service		Network connections
`check_processes`	service		Running processes with PIDs
`restart_service`	service		Restart a service
`scale_service`	service	`{replicas: int}`	Scale up/down
`rollback_service`	service		Rollback to previous deploy
`kill_process`	service	`{pid: str}`	Kill a specific process
`update_config`	service	`{key, value}`	Update config
`rotate_credentials`	service		Rotate service credentials
`clear_disk`	service	`{path: str}`	Clear disk space
`submit_diagnosis`	—	`{root_cause, affected_services}`	Submit root cause

Observation Space

SREObservation(
    output: str,              # Command result text
    alerts: list[dict],       # Active alerts
    system_health: float,     # 0-100 cluster health
    services_status: dict,    # {service: "healthy"|"degraded"|"down"}
    step_count: int,
    max_steps: int,
    available_commands: list[str],
    done: bool,
    reward: float | None,
)

Tasks

Easy — Memory Leak in API Gateway

Single service (api-gateway) with memory leak causing OOM kills. Clear log signals, no red herrings. Optimal: ~5 steps. Max: 15 steps.

Medium — Cascading Database Failure

postgres-primary connection pool exhausted, causing cascading failures across 3 dependent services. Includes red herring alerts on cache-service. Optimal: ~10 steps. Max: 20 steps.

Hard — Crypto-Mining Attack + Disk Full

Compromised worker-service running crypto miner (xmrig). Concurrent disk full on log-aggregator. Agent must kill malicious process, rollback deployment, rotate credentials, AND clear disk. Optimal: ~15 steps. Max: 25 steps.

Reward Design

The grader runs at every step. Each step's reward is the increase in grader score since the last step:

Step makes progress (e.g. restarts the right service) → reward > 0
Step makes no progress (e.g. checks an irrelevant service) → reward = 0
Sum of all step rewards = final grader score (0.0-1.0)

Each task's grader evaluates weighted binary criteria (e.g., "Was the root cause service restarted?" = 0 or 1, weight 0.4). The final score is the weighted average.

Progressive system degradation creates time pressure — services get worse each step, making criteria harder to satisfy if the agent is slow.

Setup

# Install
pip install openenv-core

# Run locally
git clone <this-repo>
cd sre-incident-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Or via Docker
docker build -t sre-incident-env .
docker run -p 8000:8000 sre-incident-env

Usage

from sre_incident_env import SREIncidentEnv, SREAction

async with SREIncidentEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(task_id="easy")
    print(result.observation.output)  # Initial alert description

    result = await env.step(SREAction(command="check_logs", target="api-gateway"))
    print(result.observation.output)  # Log entries
    print(result.reward)              # Per-step reward

Baseline Scores

Task	Score	Steps
Easy	~0.60	5-8
Medium	~0.35	10-15
Hard	~0.25	15-20

Scores from Qwen2.5-72B-Instruct via HF Router.

Environment Variables

Variable	Default	Description
`API_BASE_URL`	`https://router.huggingface.co/v1`	LLM API endpoint
`MODEL_NAME`	`Qwen/Qwen2.5-72B-Instruct`	Model identifier
`HF_TOKEN`	—	HuggingFace API key
`IMAGE_NAME`	—	Docker image name