Spaces:

ZeroTsai0308
/

sre-agent

Sleeping

App Files Files Community

sre-agent / README.md

ZeroTsai0308

docs: update README for 6-agent architecture with reporting_agent

be00444 verified 11 days ago

preview code

raw

history blame contribute delete

9.48 kB

	---
	title: SRE Agent
	emoji: 🔧
	colorFrom: red
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.33.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	tags:
	- sre
	- agent
	- smolagents
	- aiops
	- monitoring
	- incident-response
	- root-cause-analysis
	- time-series
	- anomaly-detection
	short_description: AI SRE Agent for incident analysis and RCA
	---

	# 🔧 SRE Agent — AI-Powered Site Reliability Engineering

	An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.

	## 🏗️ Architecture

	```
	┌──────────────────────────────────────────────────────────────────────────────┐
	│ SRE Manager (CodeAgent) │
	│ Orchestrates the full incident investigation & reporting lifecycle │
	│ planning_interval=3 for periodic re-assessment │
	├──────────┬──────────────┬──────────────┬──────────────┬──────────────────────┤
	│ Metrics │ Log Agent │ RCA Agent │ Infra Agent │ Reporting Agent │
	│ Agent │ (ToolCalling)│ (ToolCalling)│ (ToolCalling)│ (ToolCalling) │
	│(ToolCall)│ │ │ │ │
	├──────────┼──────────────┼──────────────┼──────────────┼──────────────────────┤
	│ • Anomaly│ • Log Parser │ • Correlator │ • Resource │ • Incident Report │
	│ Detector│ • Log Anomaly│ • Dependency │ Util. │ • Executive Summary │
	│ • Fore- │ Detector │ Analyzer │ • Health │ • Weekly Report │
	│ caster │ • Pattern │ • Change │ Checker │ • Report Formatter │
	│ • Correl.│ Extractor │ Correlation│ • Alert │ (MD/JSON/HTML/ │
	│ • Stats │ │ │ Summary │ Slack/Summary) │
	│ │ │ │ • SLO Checker│ │
	│ │ │ │ • Runbook │ │
	└──────────┴──────────────┴──────────────┴──────────────┴──────────────────────┘
	```

	6 Agents (1 Manager + 5 Workers) with 19 specialized tools.

	### Design Principles (Literature-Backed)

	\| Principle \| Source \| Implementation \|
	\|-----------\|--------\|----------------\|
	\| Multi-agent collaboration \| [OpsAgent](https://arxiv.org/abs/2510.24145) \| Manager CodeAgent + 5 specialist ToolCallingAgents \|
	\| Recursive trace traversal \| [AMER-RCL](https://arxiv.org/abs/2601.02732) \| ServiceDependencyAnalyzer with BFS topology walk \|
	\| Fast/Slow detection cascade \| [CloudAnoBench](https://arxiv.org/abs/2508.01844) \| Z-score (fast) → Isolation Forest (slow) cascade \|
	\| Statistical pre-filter → LLM \| [RCACopilot](https://arxiv.org/abs/2507.03224) \| Tools do statistical analysis, LLM interprets results \|
	\| Anti-stalling planning \| [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) \| planning_interval=3, max_steps limits, cross-modal checks \|
	\| Separation of concerns \| SRP \| Dedicated reporting agent separated from RCA and remediation \|

	## 🧰 19 Specialized SRE Tools

	### 📊 Time-Series Analysis (4 tools)
	\| Tool \| Description \|
	\|------\|-------------\|
	\| `timeseries_anomaly_detector` \| Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring \|
	\| `timeseries_forecaster` \| Holt's Exponential Smoothing with confidence intervals and threshold breach alerting \|
	\| `timeseries_correlator` \| Cross-correlation with lag analysis — finds leading/lagging indicators in metric cascades \|
	\| `metric_stats` \| Comprehensive statistics: percentiles, trend analysis, coefficient of variation \|

	### 📝 Log Analysis (3 tools)
	\| Tool \| Description \|
	\|------\|-------------\|
	\| `log_parser` \| Structured log parsing with severity filtering, error burst detection, and pattern matching \|
	\| `log_anomaly_detector` \| Template-based anomaly detection — finds new error patterns and frequency shifts \|
	\| `log_pattern_extractor` \| Extracts error codes, exception types, service names, and key phrases for RCA \|

	### 🔍 Root Cause Analysis (3 tools)
	\| Tool \| Description \|
	\|------\|-------------\|
	\| `rca_correlator` \| Multi-signal correlation engine — temporal alignment, hypothesis generation, confidence scoring \|
	\| `service_dependency_analyzer` \| Topology analysis with blast radius calculation, SPOFs, and investigation ordering \|
	\| `change_correlator` \| Correlates deployments, config changes, and scaling events with incident timing \|

	### ⚙️ Infrastructure & Alerting (5 tools)
	\| Tool \| Description \|
	\|------\|-------------\|
	\| `alert_summary` \| Active alerts grouped by severity/service with correlation analysis \|
	\| `slo_checker` \| SLO compliance checking with error budget calculation and burn rate analysis \|
	\| `runbook_search` \| Keyword-matched runbook search with step-by-step remediation procedures \|
	\| `resource_utilization` \| Pod-level CPU/memory/disk/network metrics with aggregate health scoring \|
	\| `service_health_checker` \| Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates \|

	### 📄 Reporting (4 tools) — NEW
	\| Tool \| Description \|
	\|------\|-------------\|
	\| `incident_report_generator` \| Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items \|
	\| `executive_summary` \| Concise business-impact summaries for VP/C-level/board audiences \|
	\| `sre_weekly_report` \| Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations \|
	\| `report_formatter` \| Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary \|

	## 💬 Example Prompts

	Try these to see the agent in action:

	1. Full incident investigation:
	> "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report."

	2. Time-series analysis:
	> "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated."

	3. Root cause analysis:
	> "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause."

	4. Capacity planning:
	> "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?"

	5. Quick health check:
	> "Check the health of the payment-service and its SLO compliance."

	6. Log analysis:
	> "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"

	7. Executive summary: (NEW)
	> "Generate an executive summary of the current incident for the VP of Engineering."

	8. Weekly SRE report: (NEW)
	> "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."

	9. Multi-format report: (NEW)
	> "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."

	## 🔬 Technical Details

	- Agent Framework: [smolagents](https://huggingface.co/docs/smolagents) v1.24+
	- LLM Backbone: Qwen/Qwen2.5-Coder-32B-Instruct (configurable via `MODEL_ID` env var)
	- Anomaly Detection: Z-score (σ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
	- Forecasting: Holt's Exponential Smoothing with configurable confidence intervals
	- Correlation: Pearson cross-correlation with lag analysis (max_lag configurable)
	- Report Formats: Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
	- Simulated Environment: All tools support `'auto'` mode with realistic microservice data generation

	## 📚 References

	- [TrioXpert: Automated Incident Management](https://arxiv.org/abs/2506.10043) — Multi-modal preprocessing pipeline
	- [AMER-RCL: Agentic Memory Enhanced Recursive Reasoning](https://arxiv.org/abs/2601.02732) — Recursive trace-based RCL
	- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) — Fast/slow detection + symbolic verifier
	- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) — Training-free data processor + multi-agent RCA
	- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) — 16-category failure taxonomy
	- [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) — Statistical pre-filter + LLM reasoning
	- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) — Comprehensive method taxonomy