File size: 9,475 Bytes
8fd1698
09b591b
 
8fd1698
09b591b
8fd1698
09b591b
8fd1698
 
09b591b
 
 
 
 
 
 
 
 
 
 
 
8fd1698
 
09b591b
 
be00444
09b591b
 
 
 
be00444
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09b591b
 
be00444
 
09b591b
 
 
 
be00444
09b591b
 
 
 
be00444
09b591b
be00444
09b591b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be00444
09b591b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be00444
 
 
 
 
 
 
 
09b591b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be00444
 
 
 
 
 
 
 
 
09b591b
 
 
 
 
 
 
be00444
09b591b
 
 
 
 
 
 
 
 
be00444
09b591b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: SRE Agent
emoji: πŸ”§
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - sre
  - agent
  - smolagents
  - aiops
  - monitoring
  - incident-response
  - root-cause-analysis
  - time-series
  - anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA
---

# πŸ”§ SRE Agent β€” AI-Powered Site Reliability Engineering

An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          SRE Manager (CodeAgent)                             β”‚
β”‚    Orchestrates the full incident investigation & reporting lifecycle         β”‚
β”‚    planning_interval=3 for periodic re-assessment                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Metrics  β”‚  Log Agent   β”‚  RCA Agent   β”‚  Infra Agent β”‚  Reporting Agent     β”‚
β”‚  Agent   β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚  (ToolCalling)       β”‚
β”‚(ToolCall)β”‚              β”‚              β”‚              β”‚                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Anomalyβ”‚ β€’ Log Parser β”‚ β€’ Correlator β”‚ β€’ Resource   β”‚ β€’ Incident Report    β”‚
β”‚  Detectorβ”‚ β€’ Log Anomalyβ”‚ β€’ Dependency β”‚   Util.      β”‚ β€’ Executive Summary  β”‚
β”‚ β€’ Fore-  β”‚   Detector   β”‚   Analyzer   β”‚ β€’ Health     β”‚ β€’ Weekly Report      β”‚
β”‚   caster β”‚ β€’ Pattern    β”‚ β€’ Change     β”‚   Checker    β”‚ β€’ Report Formatter   β”‚
β”‚ β€’ Correl.β”‚   Extractor  β”‚   Correlationβ”‚ β€’ Alert      β”‚   (MD/JSON/HTML/     β”‚
β”‚ β€’ Stats  β”‚              β”‚              β”‚   Summary    β”‚    Slack/Summary)    β”‚
β”‚          β”‚              β”‚              β”‚ β€’ SLO Checkerβ”‚                      β”‚
β”‚          β”‚              β”‚              β”‚ β€’ Runbook    β”‚                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**.

### Design Principles (Literature-Backed)

| Principle | Source | Implementation |
|-----------|--------|----------------|
| Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents |
| Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
| Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β†’ Isolation Forest (slow) cascade |
| Statistical pre-filter β†’ LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
| Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
| Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |

## 🧰 19 Specialized SRE Tools

### πŸ“Š Time-Series Analysis (4 tools)
| Tool | Description |
|------|-------------|
| `timeseries_anomaly_detector` | Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring |
| `timeseries_forecaster` | Holt's Exponential Smoothing with confidence intervals and threshold breach alerting |
| `timeseries_correlator` | Cross-correlation with lag analysis β€” finds leading/lagging indicators in metric cascades |
| `metric_stats` | Comprehensive statistics: percentiles, trend analysis, coefficient of variation |

### πŸ“ Log Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `log_parser` | Structured log parsing with severity filtering, error burst detection, and pattern matching |
| `log_anomaly_detector` | Template-based anomaly detection β€” finds new error patterns and frequency shifts |
| `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |

### πŸ” Root Cause Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `rca_correlator` | Multi-signal correlation engine β€” temporal alignment, hypothesis generation, confidence scoring |
| `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
| `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |

### βš™οΈ Infrastructure & Alerting (5 tools)
| Tool | Description |
|------|-------------|
| `alert_summary` | Active alerts grouped by severity/service with correlation analysis |
| `slo_checker` | SLO compliance checking with error budget calculation and burn rate analysis |
| `runbook_search` | Keyword-matched runbook search with step-by-step remediation procedures |
| `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
| `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |

### πŸ“„ Reporting (4 tools) β€” NEW
| Tool | Description |
|------|-------------|
| `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
| `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences |
| `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
| `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |

## πŸ’¬ Example Prompts

Try these to see the agent in action:

1. **Full incident investigation:**
   > "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."

2. **Time-series analysis:**
   > "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."

3. **Root cause analysis:**
   > "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."

4. **Capacity planning:**
   > "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"

5. **Quick health check:**
   > "Check the health of the payment-service and its SLO compliance."

6. **Log analysis:**
   > "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"

7. **Executive summary:** (NEW)
   > "Generate an executive summary of the current incident for the VP of Engineering."

8. **Weekly SRE report:** (NEW)
   > "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."

9. **Multi-format report:** (NEW)
   > "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."

## πŸ”¬ Technical Details

- **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
- **LLM Backbone:** Qwen/Qwen2.5-Coder-32B-Instruct (configurable via `MODEL_ID` env var)
- **Anomaly Detection:** Z-score (Οƒ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
- **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
- **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
- **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
- **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation

## πŸ“š References

- [TrioXpert: Automated Incident Management](https://arxiv.org/abs/2506.10043) β€” Multi-modal preprocessing pipeline
- [AMER-RCL: Agentic Memory Enhanced Recursive Reasoning](https://arxiv.org/abs/2601.02732) β€” Recursive trace-based RCL
- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β€” Fast/slow detection + symbolic verifier
- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β€” Training-free data processor + multi-agent RCA
- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β€” 16-category failure taxonomy
- [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) β€” Statistical pre-filter + LLM reasoning
- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β€” Comprehensive method taxonomy