sre-agent / README.md
ZeroTsai0308's picture
docs: update README for 6-agent architecture with reporting_agent
be00444 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: SRE Agent
emoji: πŸ”§
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - sre
  - agent
  - smolagents
  - aiops
  - monitoring
  - incident-response
  - root-cause-analysis
  - time-series
  - anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA

πŸ”§ SRE Agent β€” AI-Powered Site Reliability Engineering

An intelligent, multi-agent SRE system built with smolagents that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          SRE Manager (CodeAgent)                             β”‚
β”‚    Orchestrates the full incident investigation & reporting lifecycle         β”‚
β”‚    planning_interval=3 for periodic re-assessment                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Metrics  β”‚  Log Agent   β”‚  RCA Agent   β”‚  Infra Agent β”‚  Reporting Agent     β”‚
β”‚  Agent   β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚  (ToolCalling)       β”‚
β”‚(ToolCall)β”‚              β”‚              β”‚              β”‚                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Anomalyβ”‚ β€’ Log Parser β”‚ β€’ Correlator β”‚ β€’ Resource   β”‚ β€’ Incident Report    β”‚
β”‚  Detectorβ”‚ β€’ Log Anomalyβ”‚ β€’ Dependency β”‚   Util.      β”‚ β€’ Executive Summary  β”‚
β”‚ β€’ Fore-  β”‚   Detector   β”‚   Analyzer   β”‚ β€’ Health     β”‚ β€’ Weekly Report      β”‚
β”‚   caster β”‚ β€’ Pattern    β”‚ β€’ Change     β”‚   Checker    β”‚ β€’ Report Formatter   β”‚
β”‚ β€’ Correl.β”‚   Extractor  β”‚   Correlationβ”‚ β€’ Alert      β”‚   (MD/JSON/HTML/     β”‚
β”‚ β€’ Stats  β”‚              β”‚              β”‚   Summary    β”‚    Slack/Summary)    β”‚
β”‚          β”‚              β”‚              β”‚ β€’ SLO Checkerβ”‚                      β”‚
β”‚          β”‚              β”‚              β”‚ β€’ Runbook    β”‚                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6 Agents (1 Manager + 5 Workers) with 19 specialized tools.

Design Principles (Literature-Backed)

Principle Source Implementation
Multi-agent collaboration OpsAgent Manager CodeAgent + 5 specialist ToolCallingAgents
Recursive trace traversal AMER-RCL ServiceDependencyAnalyzer with BFS topology walk
Fast/Slow detection cascade CloudAnoBench Z-score (fast) β†’ Isolation Forest (slow) cascade
Statistical pre-filter β†’ LLM RCACopilot Tools do statistical analysis, LLM interprets results
Anti-stalling planning Reasoning Failure Taxonomy planning_interval=3, max_steps limits, cross-modal checks
Separation of concerns SRP Dedicated reporting agent separated from RCA and remediation

🧰 19 Specialized SRE Tools

πŸ“Š Time-Series Analysis (4 tools)

Tool Description
timeseries_anomaly_detector Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring
timeseries_forecaster Holt's Exponential Smoothing with confidence intervals and threshold breach alerting
timeseries_correlator Cross-correlation with lag analysis β€” finds leading/lagging indicators in metric cascades
metric_stats Comprehensive statistics: percentiles, trend analysis, coefficient of variation

πŸ“ Log Analysis (3 tools)

Tool Description
log_parser Structured log parsing with severity filtering, error burst detection, and pattern matching
log_anomaly_detector Template-based anomaly detection β€” finds new error patterns and frequency shifts
log_pattern_extractor Extracts error codes, exception types, service names, and key phrases for RCA

πŸ” Root Cause Analysis (3 tools)

Tool Description
rca_correlator Multi-signal correlation engine β€” temporal alignment, hypothesis generation, confidence scoring
service_dependency_analyzer Topology analysis with blast radius calculation, SPOFs, and investigation ordering
change_correlator Correlates deployments, config changes, and scaling events with incident timing

βš™οΈ Infrastructure & Alerting (5 tools)

Tool Description
alert_summary Active alerts grouped by severity/service with correlation analysis
slo_checker SLO compliance checking with error budget calculation and burn rate analysis
runbook_search Keyword-matched runbook search with step-by-step remediation procedures
resource_utilization Pod-level CPU/memory/disk/network metrics with aggregate health scoring
service_health_checker Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates

πŸ“„ Reporting (4 tools) β€” NEW

Tool Description
incident_report_generator Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items
executive_summary Concise business-impact summaries for VP/C-level/board audiences
sre_weekly_report Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations
report_formatter Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary

πŸ’¬ Example Prompts

Try these to see the agent in action:

  1. Full incident investigation:

    "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."

  2. Time-series analysis:

    "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."

  3. Root cause analysis:

    "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."

  4. Capacity planning:

    "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"

  5. Quick health check:

    "Check the health of the payment-service and its SLO compliance."

  6. Log analysis:

    "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"

  7. Executive summary: (NEW)

    "Generate an executive summary of the current incident for the VP of Engineering."

  8. Weekly SRE report: (NEW)

    "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."

  9. Multi-format report: (NEW)

    "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."

πŸ”¬ Technical Details

  • Agent Framework: smolagents v1.24+
  • LLM Backbone: Qwen/Qwen2.5-Coder-32B-Instruct (configurable via MODEL_ID env var)
  • Anomaly Detection: Z-score (Οƒ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
  • Forecasting: Holt's Exponential Smoothing with configurable confidence intervals
  • Correlation: Pearson cross-correlation with lag analysis (max_lag configurable)
  • Report Formats: Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
  • Simulated Environment: All tools support 'auto' mode with realistic microservice data generation

πŸ“š References