Spaces:
Sleeping
Sleeping
File size: 9,475 Bytes
8fd1698 09b591b 8fd1698 09b591b 8fd1698 09b591b 8fd1698 09b591b 8fd1698 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b be00444 09b591b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
title: SRE Agent
emoji: π§
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
- sre
- agent
- smolagents
- aiops
- monitoring
- incident-response
- root-cause-analysis
- time-series
- anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA
---
# π§ SRE Agent β AI-Powered Site Reliability Engineering
An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.
## ποΈ Architecture
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SRE Manager (CodeAgent) β
β Orchestrates the full incident investigation & reporting lifecycle β
β planning_interval=3 for periodic re-assessment β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ€
β Metrics β Log Agent β RCA Agent β Infra Agent β Reporting Agent β
β Agent β (ToolCalling)β (ToolCalling)β (ToolCalling)β (ToolCalling) β
β(ToolCall)β β β β β
ββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β β’ Anomalyβ β’ Log Parser β β’ Correlator β β’ Resource β β’ Incident Report β
β Detectorβ β’ Log Anomalyβ β’ Dependency β Util. β β’ Executive Summary β
β β’ Fore- β Detector β Analyzer β β’ Health β β’ Weekly Report β
β caster β β’ Pattern β β’ Change β Checker β β’ Report Formatter β
β β’ Correl.β Extractor β Correlationβ β’ Alert β (MD/JSON/HTML/ β
β β’ Stats β β β Summary β Slack/Summary) β
β β β β β’ SLO Checkerβ β
β β β β β’ Runbook β β
ββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ
```
**6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**.
### Design Principles (Literature-Backed)
| Principle | Source | Implementation |
|-----------|--------|----------------|
| Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents |
| Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
| Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β Isolation Forest (slow) cascade |
| Statistical pre-filter β LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
| Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
| Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |
## π§° 19 Specialized SRE Tools
### π Time-Series Analysis (4 tools)
| Tool | Description |
|------|-------------|
| `timeseries_anomaly_detector` | Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring |
| `timeseries_forecaster` | Holt's Exponential Smoothing with confidence intervals and threshold breach alerting |
| `timeseries_correlator` | Cross-correlation with lag analysis β finds leading/lagging indicators in metric cascades |
| `metric_stats` | Comprehensive statistics: percentiles, trend analysis, coefficient of variation |
### π Log Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `log_parser` | Structured log parsing with severity filtering, error burst detection, and pattern matching |
| `log_anomaly_detector` | Template-based anomaly detection β finds new error patterns and frequency shifts |
| `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |
### π Root Cause Analysis (3 tools)
| Tool | Description |
|------|-------------|
| `rca_correlator` | Multi-signal correlation engine β temporal alignment, hypothesis generation, confidence scoring |
| `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
| `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |
### βοΈ Infrastructure & Alerting (5 tools)
| Tool | Description |
|------|-------------|
| `alert_summary` | Active alerts grouped by severity/service with correlation analysis |
| `slo_checker` | SLO compliance checking with error budget calculation and burn rate analysis |
| `runbook_search` | Keyword-matched runbook search with step-by-step remediation procedures |
| `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
| `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
### π Reporting (4 tools) β NEW
| Tool | Description |
|------|-------------|
| `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
| `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences |
| `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
| `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |
## π¬ Example Prompts
Try these to see the agent in action:
1. **Full incident investigation:**
> "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."
2. **Time-series analysis:**
> "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."
3. **Root cause analysis:**
> "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."
4. **Capacity planning:**
> "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"
5. **Quick health check:**
> "Check the health of the payment-service and its SLO compliance."
6. **Log analysis:**
> "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
7. **Executive summary:** (NEW)
> "Generate an executive summary of the current incident for the VP of Engineering."
8. **Weekly SRE report:** (NEW)
> "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."
9. **Multi-format report:** (NEW)
> "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."
## π¬ Technical Details
- **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
- **LLM Backbone:** Qwen/Qwen2.5-Coder-32B-Instruct (configurable via `MODEL_ID` env var)
- **Anomaly Detection:** Z-score (Ο > 3) + Isolation Forest (contamination=0.05) with consensus scoring
- **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
- **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
- **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
- **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation
## π References
- [TrioXpert: Automated Incident Management](https://arxiv.org/abs/2506.10043) β Multi-modal preprocessing pipeline
- [AMER-RCL: Agentic Memory Enhanced Recursive Reasoning](https://arxiv.org/abs/2601.02732) β Recursive trace-based RCL
- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β Fast/slow detection + symbolic verifier
- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β Training-free data processor + multi-agent RCA
- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β 16-category failure taxonomy
- [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) β Statistical pre-filter + LLM reasoning
- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β Comprehensive method taxonomy
|