Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: SRE Agent
emoji: π§
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
- sre
- agent
- smolagents
- aiops
- monitoring
- incident-response
- root-cause-analysis
- time-series
- anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA
π§ SRE Agent β AI-Powered Site Reliability Engineering
An intelligent, multi-agent SRE system built with smolagents that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.
ποΈ Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SRE Manager (CodeAgent) β
β Orchestrates the full incident investigation & reporting lifecycle β
β planning_interval=3 for periodic re-assessment β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ€
β Metrics β Log Agent β RCA Agent β Infra Agent β Reporting Agent β
β Agent β (ToolCalling)β (ToolCalling)β (ToolCalling)β (ToolCalling) β
β(ToolCall)β β β β β
ββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
β β’ Anomalyβ β’ Log Parser β β’ Correlator β β’ Resource β β’ Incident Report β
β Detectorβ β’ Log Anomalyβ β’ Dependency β Util. β β’ Executive Summary β
β β’ Fore- β Detector β Analyzer β β’ Health β β’ Weekly Report β
β caster β β’ Pattern β β’ Change β Checker β β’ Report Formatter β
β β’ Correl.β Extractor β Correlationβ β’ Alert β (MD/JSON/HTML/ β
β β’ Stats β β β Summary β Slack/Summary) β
β β β β β’ SLO Checkerβ β
β β β β β’ Runbook β β
ββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ
6 Agents (1 Manager + 5 Workers) with 19 specialized tools.
Design Principles (Literature-Backed)
| Principle | Source | Implementation |
|---|---|---|
| Multi-agent collaboration | OpsAgent | Manager CodeAgent + 5 specialist ToolCallingAgents |
| Recursive trace traversal | AMER-RCL | ServiceDependencyAnalyzer with BFS topology walk |
| Fast/Slow detection cascade | CloudAnoBench | Z-score (fast) β Isolation Forest (slow) cascade |
| Statistical pre-filter β LLM | RCACopilot | Tools do statistical analysis, LLM interprets results |
| Anti-stalling planning | Reasoning Failure Taxonomy | planning_interval=3, max_steps limits, cross-modal checks |
| Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |
π§° 19 Specialized SRE Tools
π Time-Series Analysis (4 tools)
| Tool | Description |
|---|---|
timeseries_anomaly_detector |
Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring |
timeseries_forecaster |
Holt's Exponential Smoothing with confidence intervals and threshold breach alerting |
timeseries_correlator |
Cross-correlation with lag analysis β finds leading/lagging indicators in metric cascades |
metric_stats |
Comprehensive statistics: percentiles, trend analysis, coefficient of variation |
π Log Analysis (3 tools)
| Tool | Description |
|---|---|
log_parser |
Structured log parsing with severity filtering, error burst detection, and pattern matching |
log_anomaly_detector |
Template-based anomaly detection β finds new error patterns and frequency shifts |
log_pattern_extractor |
Extracts error codes, exception types, service names, and key phrases for RCA |
π Root Cause Analysis (3 tools)
| Tool | Description |
|---|---|
rca_correlator |
Multi-signal correlation engine β temporal alignment, hypothesis generation, confidence scoring |
service_dependency_analyzer |
Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
change_correlator |
Correlates deployments, config changes, and scaling events with incident timing |
βοΈ Infrastructure & Alerting (5 tools)
| Tool | Description |
|---|---|
alert_summary |
Active alerts grouped by severity/service with correlation analysis |
slo_checker |
SLO compliance checking with error budget calculation and burn rate analysis |
runbook_search |
Keyword-matched runbook search with step-by-step remediation procedures |
resource_utilization |
Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
service_health_checker |
Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
π Reporting (4 tools) β NEW
| Tool | Description |
|---|---|
incident_report_generator |
Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
executive_summary |
Concise business-impact summaries for VP/C-level/board audiences |
sre_weekly_report |
Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
report_formatter |
Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |
π¬ Example Prompts
Try these to see the agent in action:
Full incident investigation:
"Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."
Time-series analysis:
"Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."
Root cause analysis:
"Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."
Capacity planning:
"Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"
Quick health check:
"Check the health of the payment-service and its SLO compliance."
Log analysis:
"Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
Executive summary: (NEW)
"Generate an executive summary of the current incident for the VP of Engineering."
Weekly SRE report: (NEW)
"Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."
Multi-format report: (NEW)
"Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."
π¬ Technical Details
- Agent Framework: smolagents v1.24+
- LLM Backbone: Qwen/Qwen2.5-Coder-32B-Instruct (configurable via
MODEL_IDenv var) - Anomaly Detection: Z-score (Ο > 3) + Isolation Forest (contamination=0.05) with consensus scoring
- Forecasting: Holt's Exponential Smoothing with configurable confidence intervals
- Correlation: Pearson cross-correlation with lag analysis (max_lag configurable)
- Report Formats: Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
- Simulated Environment: All tools support
'auto'mode with realistic microservice data generation
π References
- TrioXpert: Automated Incident Management β Multi-modal preprocessing pipeline
- AMER-RCL: Agentic Memory Enhanced Recursive Reasoning β Recursive trace-based RCL
- CloudAnoBench: Context-Aware Anomaly Detection β Fast/slow detection + symbolic verifier
- OpsAgent: Self-Evolving Multi-Agent β Training-free data processor + multi-agent RCA
- LLM Reasoning Failures for RCA β 16-category failure taxonomy
- RCACopilot: Root Cause Analysis β Statistical pre-filter + LLM reasoning
- Time-Series Anomaly Detection Survey β Comprehensive method taxonomy