Spaces:

ZeroTsai0308
/

sre-agent

Sleeping

App Files Files Community

sre-agent / README.md

ZeroTsai0308

docs: update README for 6-agent architecture with reporting_agent

be00444 verified 10 days ago

preview code

raw

history blame contribute delete

9.48 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: SRE Agent
emoji: 🔧
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - sre
  - agent
  - smolagents
  - aiops
  - monitoring
  - incident-response
  - root-cause-analysis
  - time-series
  - anomaly-detection
short_description: AI SRE Agent for incident analysis and RCA

🔧 SRE Agent — AI-Powered Site Reliability Engineering

An intelligent, multi-agent SRE system built with smolagents that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.

🏗️ Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                          SRE Manager (CodeAgent)                             │
│    Orchestrates the full incident investigation & reporting lifecycle         │
│    planning_interval=3 for periodic re-assessment                            │
├──────────┬──────────────┬──────────────┬──────────────┬──────────────────────┤
│ Metrics  │  Log Agent   │  RCA Agent   │  Infra Agent │  Reporting Agent     │
│  Agent   │ (ToolCalling)│ (ToolCalling)│ (ToolCalling)│  (ToolCalling)       │
│(ToolCall)│              │              │              │                      │
├──────────┼──────────────┼──────────────┼──────────────┼──────────────────────┤
│ • Anomaly│ • Log Parser │ • Correlator │ • Resource   │ • Incident Report    │
│  Detector│ • Log Anomaly│ • Dependency │   Util.      │ • Executive Summary  │
│ • Fore-  │   Detector   │   Analyzer   │ • Health     │ • Weekly Report      │
│   caster │ • Pattern    │ • Change     │   Checker    │ • Report Formatter   │
│ • Correl.│   Extractor  │   Correlation│ • Alert      │   (MD/JSON/HTML/     │
│ • Stats  │              │              │   Summary    │    Slack/Summary)    │
│          │              │              │ • SLO Checker│                      │
│          │              │              │ • Runbook    │                      │
└──────────┴──────────────┴──────────────┴──────────────┴──────────────────────┘

6 Agents (1 Manager + 5 Workers) with 19 specialized tools.

Design Principles (Literature-Backed)

Principle	Source	Implementation
Multi-agent collaboration	OpsAgent	Manager CodeAgent + 5 specialist ToolCallingAgents
Recursive trace traversal	AMER-RCL	ServiceDependencyAnalyzer with BFS topology walk
Fast/Slow detection cascade	CloudAnoBench	Z-score (fast) → Isolation Forest (slow) cascade
Statistical pre-filter → LLM	RCACopilot	Tools do statistical analysis, LLM interprets results
Anti-stalling planning	Reasoning Failure Taxonomy	planning_interval=3, max_steps limits, cross-modal checks
Separation of concerns	SRP	Dedicated reporting agent separated from RCA and remediation

🧰 19 Specialized SRE Tools

📊 Time-Series Analysis (4 tools)

Tool	Description
`timeseries_anomaly_detector`	Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring
`timeseries_forecaster`	Holt's Exponential Smoothing with confidence intervals and threshold breach alerting
`timeseries_correlator`	Cross-correlation with lag analysis — finds leading/lagging indicators in metric cascades
`metric_stats`	Comprehensive statistics: percentiles, trend analysis, coefficient of variation

📝 Log Analysis (3 tools)

Tool	Description
`log_parser`	Structured log parsing with severity filtering, error burst detection, and pattern matching
`log_anomaly_detector`	Template-based anomaly detection — finds new error patterns and frequency shifts
`log_pattern_extractor`	Extracts error codes, exception types, service names, and key phrases for RCA

🔍 Root Cause Analysis (3 tools)

Tool	Description
`rca_correlator`	Multi-signal correlation engine — temporal alignment, hypothesis generation, confidence scoring
`service_dependency_analyzer`	Topology analysis with blast radius calculation, SPOFs, and investigation ordering
`change_correlator`	Correlates deployments, config changes, and scaling events with incident timing

⚙️ Infrastructure & Alerting (5 tools)

Tool	Description
`alert_summary`	Active alerts grouped by severity/service with correlation analysis
`slo_checker`	SLO compliance checking with error budget calculation and burn rate analysis
`runbook_search`	Keyword-matched runbook search with step-by-step remediation procedures
`resource_utilization`	Pod-level CPU/memory/disk/network metrics with aggregate health scoring
`service_health_checker`	Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates

📄 Reporting (4 tools) — NEW

Tool	Description
`incident_report_generator`	Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items
`executive_summary`	Concise business-impact summaries for VP/C-level/board audiences
`sre_weekly_report`	Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations
`report_formatter`	Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary

💬 Example Prompts

Try these to see the agent in action:

Full incident investigation:

"Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."
Time-series analysis:

"Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."
Root cause analysis:

"Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."
Capacity planning:

"Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"
Quick health check:

"Check the health of the payment-service and its SLO compliance."
Log analysis:

"Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
Executive summary: (NEW)

"Generate an executive summary of the current incident for the VP of Engineering."
Weekly SRE report: (NEW)

"Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."
Multi-format report: (NEW)

"Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."

🔬 Technical Details

Agent Framework: smolagents v1.24+
LLM Backbone: Qwen/Qwen2.5-Coder-32B-Instruct (configurable via MODEL_ID env var)
Anomaly Detection: Z-score (σ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
Forecasting: Holt's Exponential Smoothing with configurable confidence intervals
Correlation: Pearson cross-correlation with lag analysis (max_lag configurable)
Report Formats: Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
Simulated Environment: All tools support 'auto' mode with realistic microservice data generation

📚 References

TrioXpert: Automated Incident Management — Multi-modal preprocessing pipeline
AMER-RCL: Agentic Memory Enhanced Recursive Reasoning — Recursive trace-based RCL
CloudAnoBench: Context-Aware Anomaly Detection — Fast/slow detection + symbolic verifier
OpsAgent: Self-Evolving Multi-Agent — Training-free data processor + multi-agent RCA
LLM Reasoning Failures for RCA — 16-category failure taxonomy
RCACopilot: Root Cause Analysis — Statistical pre-filter + LLM reasoning
Time-Series Anomaly Detection Survey — Comprehensive method taxonomy