petter2025's picture
Update README.md
220196d verified
|
raw
history blame
3.82 kB
metadata
title: Agentic Reliability Framework MVP
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
license: mit

🧠 Agentic Reliability Framework MVP

Adaptive anomaly detection + AI-driven self-healing + persistent FAISS memory.

This project explores agentic reliability systems — blending observability, vector-based persistence, and AI inference to create self-healing cloud operations.

Built with:

  • Gradio 5.49.1 for live visualization & dashboard UI
  • 🧩 FastAPI for REST endpoints (/add-event) with API key support
  • 🧠 Sentence Transformers (all-MiniLM-L6-v2) for embedding-based anomaly memory
  • 🔍 FAISS for similarity search across past incidents
  • 🔒 FileLock for safe concurrent saves in multi-user environments
  • 🤖 Hugging Face Router Inference API for adaptive reliability insights
  • ☁️ Python 3.10 runtime

🚀 Features

Capability Description
Adaptive Anomaly Detection Detects anomalies dynamically based on latency and error-rate thresholds
AI Root Cause Analysis Uses the Hugging Face Inference API for contextual one-line incident summaries
Self-Healing Actions Simulates healing actions (scale-up, restart, etc.)
Persistent Memory (FAISS) Learns from prior incidents, clusters patterns, and retrieves similar cases
Secure REST API /add-event endpoint secured by X-API-Key header
Interactive Gradio UI Visualize, test, and analyze events live in your browser

🧠 Example Output

Event Processed (Anomaly)

Component: api-service Latency: 224 ms Error Rate: 0.062 Status: Anomaly Analysis: Error 404: Not Found Healing Action: Restarted container (Found 3 similar incidents)


🧩 Architecture Overview

┌──────────────────────┐ │ Gradio Frontend UI │ └─────────┬────────────┘ │ (submit telemetry) ▼ ┌──────────────────────┐ │ FastAPI /add-event │ │ + API Key validation │ └─────────┬────────────┘ │ (call) ▼ ┌─────────────────────────────┐ │ Hugging Face Inference API │ │ → Reliability insight text │ └─────────┬───────────────────┘ │ ▼ ┌─────────────────────────────┐ │ FAISS + Sentence Transformers│ │ → Embedding + similarity map │ └─────────────────────────────┘


🧾 API Usage

Endpoint:
POST /add-event

Headers:
X-API-Key: <your_api_key>

Body:

{
  "component": "api-service",
  "latency": 200,
  "error_rate": 0.04
}

{
  "status": "ok",
  "event": {
    "timestamp": "2025-11-08 23:29:03",
    "component": "api-service",
    "status": "Anomaly",
    "analysis": "Error 404: Not Found",
    "healing_action": "Restarted container Found 3 similar incidents ..."
  }
}

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework
pip install -r requirements.txt
python app.py

Then open http://localhost:7860

🌍 Live Space & Collaboration

👉 Launch Live Demo on Hugging Face

👉 Contribute or Fork on GitHub

🧭 Author

Juan D. Petter
AI Engineer & Cloud Architect
Building Agentic Systems for Scalable Automation | ex-NetApp
🔗 LinkedIn
 • GitHub

🪪 License

MIT License © 2025 Juan D. Petter