Spaces:
Sleeping
Sleeping
| title: SRE Agent | |
| emoji: π§ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.33.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| tags: | |
| - sre | |
| - agent | |
| - smolagents | |
| - aiops | |
| - monitoring | |
| - incident-response | |
| - root-cause-analysis | |
| - time-series | |
| - anomaly-detection | |
| short_description: AI SRE Agent for incident analysis and RCA | |
| # π§ SRE Agent β AI-Powered Site Reliability Engineering | |
| An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports. | |
| ## ποΈ Architecture | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β SRE Manager (CodeAgent) β | |
| β Orchestrates the full incident investigation & reporting lifecycle β | |
| β planning_interval=3 for periodic re-assessment β | |
| ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ€ | |
| β Metrics β Log Agent β RCA Agent β Infra Agent β Reporting Agent β | |
| β Agent β (ToolCalling)β (ToolCalling)β (ToolCalling)β (ToolCalling) β | |
| β(ToolCall)β β β β β | |
| ββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€ | |
| β β’ Anomalyβ β’ Log Parser β β’ Correlator β β’ Resource β β’ Incident Report β | |
| β Detectorβ β’ Log Anomalyβ β’ Dependency β Util. β β’ Executive Summary β | |
| β β’ Fore- β Detector β Analyzer β β’ Health β β’ Weekly Report β | |
| β caster β β’ Pattern β β’ Change β Checker β β’ Report Formatter β | |
| β β’ Correl.β Extractor β Correlationβ β’ Alert β (MD/JSON/HTML/ β | |
| β β’ Stats β β β Summary β Slack/Summary) β | |
| β β β β β’ SLO Checkerβ β | |
| β β β β β’ Runbook β β | |
| ββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ | |
| ``` | |
| **6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**. | |
| ### Design Principles (Literature-Backed) | |
| | Principle | Source | Implementation | | |
| |-----------|--------|----------------| | |
| | Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents | | |
| | Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk | | |
| | Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β Isolation Forest (slow) cascade | | |
| | Statistical pre-filter β LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results | | |
| | Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks | | |
| | Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation | | |
| ## π§° 19 Specialized SRE Tools | |
| ### π Time-Series Analysis (4 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `timeseries_anomaly_detector` | Multi-method anomaly detection: Z-score + Isolation Forest with consensus scoring | | |
| | `timeseries_forecaster` | Holt's Exponential Smoothing with confidence intervals and threshold breach alerting | | |
| | `timeseries_correlator` | Cross-correlation with lag analysis β finds leading/lagging indicators in metric cascades | | |
| | `metric_stats` | Comprehensive statistics: percentiles, trend analysis, coefficient of variation | | |
| ### π Log Analysis (3 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `log_parser` | Structured log parsing with severity filtering, error burst detection, and pattern matching | | |
| | `log_anomaly_detector` | Template-based anomaly detection β finds new error patterns and frequency shifts | | |
| | `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA | | |
| ### π Root Cause Analysis (3 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `rca_correlator` | Multi-signal correlation engine β temporal alignment, hypothesis generation, confidence scoring | | |
| | `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering | | |
| | `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing | | |
| ### βοΈ Infrastructure & Alerting (5 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `alert_summary` | Active alerts grouped by severity/service with correlation analysis | | |
| | `slo_checker` | SLO compliance checking with error budget calculation and burn rate analysis | | |
| | `runbook_search` | Keyword-matched runbook search with step-by-step remediation procedures | | |
| | `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring | | |
| | `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates | | |
| ### π Reporting (4 tools) β NEW | |
| | Tool | Description | | |
| |------|-------------| | |
| | `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items | | |
| | `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences | | |
| | `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations | | |
| | `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary | | |
| ## π¬ Example Prompts | |
| Try these to see the agent in action: | |
| 1. **Full incident investigation:** | |
| > "Investigate the payment-service outage. Check what's alerting, analyze metrics and logs, find the root cause, and generate an incident report." | |
| 2. **Time-series analysis:** | |
| > "Detect anomalies in CPU utilization and p99 latency for the payment-service. Also check if these metrics are correlated." | |
| 3. **Root cause analysis:** | |
| > "Run a root cause analysis for the payment-service. Check recent changes, analyze service dependencies, and identify the most likely root cause." | |
| 4. **Capacity planning:** | |
| > "Forecast memory usage for the database-primary. Will we hit capacity limits in the next 30 periods?" | |
| 5. **Quick health check:** | |
| > "Check the health of the payment-service and its SLO compliance." | |
| 6. **Log analysis:** | |
| > "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?" | |
| 7. **Executive summary:** (NEW) | |
| > "Generate an executive summary of the current incident for the VP of Engineering." | |
| 8. **Weekly SRE report:** (NEW) | |
| > "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting." | |
| 9. **Multi-format report:** (NEW) | |
| > "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki." | |
| ## π¬ Technical Details | |
| - **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+ | |
| - **LLM Backbone:** Qwen/Qwen2.5-Coder-32B-Instruct (configurable via `MODEL_ID` env var) | |
| - **Anomaly Detection:** Z-score (Ο > 3) + Isolation Forest (contamination=0.05) with consensus scoring | |
| - **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals | |
| - **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable) | |
| - **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary | |
| - **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation | |
| ## π References | |
| - [TrioXpert: Automated Incident Management](https://arxiv.org/abs/2506.10043) β Multi-modal preprocessing pipeline | |
| - [AMER-RCL: Agentic Memory Enhanced Recursive Reasoning](https://arxiv.org/abs/2601.02732) β Recursive trace-based RCL | |
| - [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β Fast/slow detection + symbolic verifier | |
| - [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β Training-free data processor + multi-agent RCA | |
| - [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β 16-category failure taxonomy | |
| - [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) β Statistical pre-filter + LLM reasoning | |
| - [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β Comprehensive method taxonomy | |