Spaces:

ZeroTsai0308
/

sre-agent

Sleeping

File size: 9,475 Bytes

---
title: SRE Agent
emoji: 🔧
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - sre
  - agent
  - smolagents
  - aiops
  - monitoring
  - incident-response
  - root-cause-analysis
  - time-series
  - anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA
---

# 🔧 SRE Agent — AI-Powered Site Reliability Engineering

An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.

## 🏗️ Architecture

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                          SRE Manager (CodeAgent)                             │
│    Orchestrates the full incident investigation & reporting lifecycle         │
│    planning_interval=3 for periodic re-assessment                            │
├──────────┬──────────────┬──────────────┬──────────────┬──────────────────────┤
│ Metrics  │  Log Agent   │  RCA Agent   │  Infra Agent │  Reporting Agent     │
│  Agent   │ (ToolCalling)│ (ToolCalling)│ (ToolCalling)│  (ToolCalling)       │
│(ToolCall)│              │              │              │                      │
├──────────┼──────────────┼──────────────┼──────────────┼──────────────────────┤
│ • Anomaly│ • Log Parser │ • Correlator │ • Resource   │ • Incident Report    │
│  Detector│ • Log Anomaly│ • Dependency │   Util.      │ • Executive Summary  │
│ • Fore-  │   Detector   │   Analyzer   │ • Health     │ • Weekly Report      │
│   caster │ • Pattern    │ • Change     │   Checker    │ • Report Formatter   │
│ • Correl.│   Extractor  │   Correlation│ • Alert      │   (MD/JSON/HTML/     │
│ • Stats  │              │              │   Summary    │    Slack/Summary)    │
│          │              │              │ • SLO Checker│                      │
│          │              │              │ • Runbook    │                      │
└──────────┴──────────────┴──────────────┴──────────────┴──────────────────────┘
```

**6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**.

### Design Principles (Literature-Backed)

| Principle | Source | Implementation |
|-----------|--------|----------------|
| Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents |
| Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
| Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) → Isolation Forest (slow) cascade |
| Statistical pre-filter → LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
| Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
| Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |

## 🧰 19 Specialized SRE Tools

### 📊 Time-Series Analysis (4 tools)
| Tool | Description |
|------|-------------|
| `timeseries_anomaly_detector` | Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring |
| `timeseries_forecaster` | Holt's Exponential Smoothing with confidence intervals and threshold breach alerting |
| `timeseries_correlator` | Cross-correlation with lag analysis — finds leading/lagging indicators in metric cascades |
| `metric_stats` | Comprehensive statistics: percentiles, trend analysis, coefficient of variation |

### 📝 Log Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `log_parser` | Structured log parsing with severity filtering, error burst detection, and pattern matching |
| `log_anomaly_detector` | Template-based anomaly detection — finds new error patterns and frequency shifts |
| `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |

### 🔍 Root Cause Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `rca_correlator` | Multi-signal correlation engine — temporal alignment, hypothesis generation, confidence scoring |
| `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
| `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |

### ⚙️ Infrastructure & Alerting (5 tools)
| Tool | Description |
|------|-------------|
| `alert_summary` | Active alerts grouped by severity/service with correlation analysis |
| `slo_checker` | SLO compliance checking with error budget calculation and burn rate analysis |
| `runbook_search` | Keyword-matched runbook search with step-by-step remediation procedures |
| `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
| `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |

### 📄 Reporting (4 tools) — NEW
| Tool | Description |
|------|-------------|
| `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
| `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences |
| `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
| `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |

## 💬 Example Prompts

Try these to see the agent in action:

1. **Full incident investigation:**
   > "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."

2. **Time-series analysis:**
   > "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."

3. **Root cause analysis:**
   > "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."

4. **Capacity planning:**
   > "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"

5. **Quick health check:**
   > "Check the health of the payment-service and its SLO compliance."

6. **Log analysis:**
   > "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"

7. **Executive summary:** (NEW)
   > "Generate an executive summary of the current incident for the VP of Engineering."

8. **Weekly SRE report:** (NEW)
   > "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."

9. **Multi-format report:** (NEW)
   > "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."

## 🔬 Technical Details

- **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
- **LLM Backbone:** Qwen/Qwen2.5-Coder-32B-Instruct (configurable via `MODEL_ID` env var)
- **Anomaly Detection:** Z-score (σ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
- **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
- **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
- **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
- **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation

## 📚 References

- [TrioXpert: Automated Incident Management](https://arxiv.org/abs/2506.10043) — Multi-modal preprocessing pipeline
- [AMER-RCL: Agentic Memory Enhanced Recursive Reasoning](https://arxiv.org/abs/2601.02732) — Recursive trace-based RCL
- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) — Fast/slow detection + symbolic verifier
- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) — Training-free data processor + multi-agent RCA
- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) — 16-category failure taxonomy
- [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) — Statistical pre-filter + LLM reasoning
- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) — Comprehensive method taxonomy