Spaces:

OGrohit
/

logtriage-env

Running

App Files Files Community

OGrohit commited on Mar 27

Commit

5bb7721

unverified ·

1 Parent(s): bc4d7e0

Replaced by DAYS_1-2_SUMMARY_FINAL.md

Browse files

Files changed (1) hide show

DAYS_1-2_SUMMARY.md +0 -465

DAYS_1-2_SUMMARY.md DELETED Viewed

@@ -1,465 +0,0 @@
-# 📊 DAYS 1-2 COMPLETION SUMMARY
-**Date:** March 27, 2026
-**Status:** ✅ Days 1-2 COMPLETE (40% of project done)
-**Next:** Day 3 (Remaining scenarios)
----
-## What's New in Day 2
-### Three Core Files Implemented
-#### 1. **server/environment.py** (~250 lines)
-**The Brain of the Environment**
-```python
-class LogTriageEnvironment:
-    def reset(task_id, seed=None):
-        # Start new episode
-        # Load scenario (single_crash)
-        # Generate initial logs + system state
-        # Return: TriageObservation (first observation)
-    def step(action: TriageAction):
-        # Process agent's action
-        # Calculate reward based on correctness
-        # Generate next logs
-        # Update episode state
-        # Return: TriageObservation (next observation + reward)
-    @property
-    def state(self):
-        # Return: EpisodeState (episode tracking)
-```
-**What It Does:**
-- ✅ Manages episode lifecycle
-- ✅ Loads scenarios dynamically
-- ✅ Generates observations per step
-- ✅ Calculates shaped rewards
-- ✅ Tracks agent actions
-- ✅ Manages state across steps
-#### 2. **server/log_generator.py** (~400 lines)
-**The Log Synthesis Engine**
-```python
-NOISE_TEMPLATES = {
-    "api-gateway": [...],  # Irrelevant but realistic logs
-    "auth-service": [...],
-    "user-db": [...],
-    # ... etc for all 7 services
-}
-SIGNAL_TEMPLATES = {
-    "api-gateway": {...},  # Relevant error signals
-    "payment-service": {...},
-    # ... etc
-}
-def generate_log_batch(services, num_logs, noise_ratio, signals, seed):
-    # Generates realistic-looking log lines
-    # Mixes noise and signals
-    # Deterministic with seed
-    # Returns: [LogLine, LogLine, ...]
-def generate_healthy_system_state(services, timestamp):
-    # Returns per-service health snapshot
-    # status (up/degraded/down)
-    # error_rate (0.0-1.0)
-    # latency_p99_ms (milliseconds)
-```
-**What It Does:**
-- ✅ Generates realistic microservice logs
-- ✅ Has noise templates for each service
-- ✅ Has error signal templates
-- ✅ Mixes noise and signals realistically
-- ✅ Generates system state snapshots
-- ✅ Fully deterministic with seeds
-#### 3. **server/scenarios/single_crash.py** (~150 lines)
-**Task 1 Scenario Definition**
-```python
-GROUND_TRUTH = {
-    "severity": "P1",
-    "root_cause": "payment-service",
-    "remediation_prefixes": {"restart"},
-    "remediation_service": "payment-service",
-    "correct_teams": {"backend-team", "sre-team"},
-    "max_steps": 8,
-    "noise_ratio": 0.20,
-}
-STEP_SIGNALS = [
-    # Step 0: Initial signs
-    [("payment-service", "ERROR", "NullPointerException..."), ...],
-    # Step 1: Escalating errors
-    [("payment-service", "FATAL", "all retries exhausted"), ...],
-    # ... more steps
-]
-```
-**What It Does:**
-- ✅ Defines Task 1 scenario (single_crash)
-- ✅ Sets ground truth (correct answers)
-- ✅ Defines error signals per step
-- ✅ Specifies noise ratio (20%)
-- ✅ Sets max steps (8)
-- ✅ Ready for grader integration
----
-## API Endpoints: Before vs After
-### Before (Day 1 - Placeholders)
-```python
-@app.post("/reset")
-def reset(task, seed=None):
-    return {"message": "reset endpoint placeholder", "task": task}
-    # ❌ Returns fake data
-@app.post("/step")
-def step(action):
-    valid, err = action.is_valid()
-    if not valid:
-        return JSONResponse(status_code=422, content={"error": err})
-    return {"message": "step endpoint placeholder", "action_received": ...}
-    # ❌ Returns fake data
-@app.get("/state")
-def state():
-    return {"message": "state endpoint placeholder"}
-    # ❌ No state management
-```
-### After (Day 2 - Real Implementation)
-```python
-@app.post("/reset")
-def reset(task: str, seed: int = None):
-    obs = env.reset(task_id=task, seed=seed)
-    return obs.model_dump()
-    # ✅ Returns REAL initial observation with logs + state
-@app.post("/step")
-def step(action: TriageAction):
-    valid, err = action.is_valid()
-    if not valid:
-        return JSONResponse(status_code=422, content={"error": err})
-    obs = env.step(action)
-    return obs.model_dump()
-    # ✅ Returns REAL observation + reward + feedback
-@app.get("/state")
-def state():
-    return env.state.model_dump()
-    # ✅ Returns REAL episode state
-```
----
-## 🎮 Full Task 1 Episode Example
-```
-POST /reset?task=single_crash&seed=42
-Response:
-{
-  "logs": [
-    {"timestamp": "2026-03-27T10:00:00Z", "level": "ERROR",
-     "service": "payment-service", "message": "NullPointerException: Cannot invoke..."},
-    {"timestamp": "2026-03-27T10:00:01Z", "level": "WARN",
-     "service": "api-gateway", "message": "error rate spike: 28.4%"}
-  ],
-  "system_state": {
-    "payment-service": {"status": "down", "error_rate": 0.92, "latency_p99_ms": 5000},
-    "api-gateway": {"status": "degraded", "error_rate": 0.28, "latency_p99_ms": 2100},
-    ...
-  },
-  "incident_id": "inc-001",
-  "task_id": "single_crash",
-  "step_count": 0,
-  "time_elapsed_seconds": 0,
-  "reward": 0.0,
-  "cumulative_score": 0.0,
-  "done": false
-}
----
-POST /step
-{
-  "action_type": "classify_severity",
-  "value": "P1",
-  "confidence": 0.95
-}
-Response:
-{
-  "logs": [...new logs from step 1...],
-  "system_state": {...updated state...},
-  "step_count": 1,
-  "reward": 0.30,  # ← Reward for correct severity!
-  "cumulative_score": 0.30,
-  "last_action_feedback": "Correct severity classification!",
-  "done": false
-}
----
-POST /step
-{
-  "action_type": "identify_root_cause",
-  "value": "payment-service",
-  "confidence": 0.9
-}
-Response:
-{
-  "logs": [...],
-  "reward": 0.35,  # ← Reward for correct root cause!
-  "cumulative_score": 0.65,
-  "last_action_feedback": "Correct root cause!",
-  "done": false
-}
----
-POST /step
-{
-  "action_type": "remediate",
-  "value": "restart:payment-service",
-  "confidence": 0.95
-}
-Response:
-{
-  "logs": [...service recovering...],
-  "reward": 0.25,  # ← Reward for correct remediation!
-  "cumulative_score": 0.90,
-  "last_action_feedback": "Correct remediation!",
-  "done": false
-}
----
-POST /step
-{
-  "action_type": "resolve",
-  "value": "resolved"
-}
-Response:
-{
-  "logs": [...all services healthy...],
-  "system_state": {all services up},
-  "reward": 0.10,  # ← Speed bonus!
-  "cumulative_score": 1.0,
-  "done": true
-}
-FINAL SCORE: 1.0 ✅ (Perfect!)
-```
----
-## 📈 Files Modified from Day 1
-### server/app.py
-**Changes:**
-- Added imports for `LogTriageEnvironment`
-- Instantiated `env = LogTriageEnvironment()` at module level
-- Updated `/reset` endpoint to wire to `env.reset()`
-- Updated `/step` endpoint to wire to `env.step()`
-- Updated `/state` endpoint to wire to `env.state`
-- Added proper error handling with status codes
----
-## ✅ Day 2 Checklist (From DAY2.md)
-| Item | Status |
-|------|--------|
-| `server/log_generator.py` working | ✅ |
-| `server/scenarios/single_crash.py` defined | ✅ |
-| `server/environment.py` implemented | ✅ |
-| `/reset` returns real observations | ✅ |
-| `/step` processes actions & returns rewards | ✅ |
-| `/state` returns episode state | ✅ |
-| Full Task 1 playable end-to-end | ✅ |
-| Git push completed | ✅ |
-**Completion: 100%** ✅
----
-## 🔄 Architecture Evolution
-### Day 1 (Skeleton)
-```
-Models (5 classes)
-    ↓
-FastAPI (7 endpoints - all placeholders)
-    ↓
-No runtime logic
-```
-### Day 2 (Brain)
-```
-Models (5 classes)
-    ↓
-LogTriageEnvironment class
-    ├── reset() - creates episodes
-    ├── step() - processes actions
-    ├── state - tracks episode
-    │
-    ├─ Uses → log_generator.py (synthetic logs)
-    │
-    └─ Uses → scenarios/single_crash.py (Task 1 data)
-        ├── Ground truth
-        ├── Signal templates
-        └── Step-by-step scenario
-    ↓
-FastAPI (7 endpoints - 3 wired, 4 still TODO)
-    ├── /reset - real reset logic
-    ├── /step - real step logic
-    ├── /state - real state access
-    ├── /tasks - task definitions (working)
-    ├── /health - health check (working)
-    └── /grader, /baseline (TODO Day 4-5)
-```
----
-## 📊 Progress Tracking
-```
-Day 1: ✅ 100% (Scaffold + Models + Endpoints stub)
-Day 2: ✅ 100% (Environment + Log Gen + Task 1 scenario)
-       = 40% of overall project ✅
-Day 3: ⏳ 0% (Tasks 2 & 3 scenarios - remaining)
-Day 4: ⏳ 0% (Graders - remaining)
-Day 5: ⏳ 0% (Baseline + Deployment - remaining)
-```
----
-## 🚀 What You Can Do Now
-### Full Task 1 Episode
-```bash
-python -m uvicorn server.app:app --port 7860
-# In another terminal
-curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
-curl -X POST "http://localhost:7860/step" \
-  -H "Content-Type: application/json" \
-  -d '{"action_type":"classify_severity","value":"P1","confidence":0.95}'
-# ... etc - full episode works!
-```
-### Play as an LLM Agent
-Use the `/reset` and `/step` endpoints to train a language model agent on your environment.
-### Validate Endpoint Correctness
-All endpoints now return real data (not placeholders).
----
-## 📚 Updated Documentation
-Files updated to reflect Day 2 completion:
-- ✅ Created **DAY2_STATUS.md** (this guide)
-- ✅ Updated **EXECUTIVE_SUMMARY.md** (new numbers)
-- 🔄 Will update other guides accordingly
----
-## 🎯 Next: Day 3
-### What Day 3 Requires
-1. **server/scenarios/cascading.py**
-   - Task 2: Database slowdown → upstream cascade
-   - Max steps: 12
-   - Noise ratio: 30%
-2. **server/scenarios/silent_degrade.py**
-   - Task 3: Slow degradation in 60% noise
-   - Max steps: 15
-   - Noise ratio: 60%
-3. **Test all 3 tasks** are playable
-### Effort Estimate
-**~3-4 hours** (similar to Day 2)
----
-## ✨ Key Insights
-### What Makes Day 2 Work
-✅ **Separation of Concerns**
-- log_generator handles log synthesis
-- scenarios define task data
-- environment orchestrates everything
-- app.py just calls environment
-✅ **Realistic Log Generation**
-- Noise templates for realism
-- Signal templates for incident patterns
-- Step-by-step signal injection
-- Deterministic with seeds
-✅ **Clean Reward Integration**
-- Shaped rewards (0.30 for severity, 0.35 for root cause, etc.)
-- Partial credit for directional correctness
-- Feedback strings for interpretability
-- Speed bonus for efficiency
-✅ **OpenEnv Compliance**
-- reset() → initial observation ✅
-- step() → (observation, reward, done, info) ✅
-- state property → episode state ✅
-- Typed models throughout ✅
----
-## 💡 Tips for Day 3
-**Build scenarios exactly like single_crash.py:**
-- Define GROUND_TRUTH
-- Define STEP_SIGNALS (error signals per step)
-- Specify noise_ratio for each task
-- Set max_steps in task metadata
-**The environment will automatically:**
-- Mix noise and signals
-- Generate logs per step
-- Calculate rewards
-- Manage state
-Just define the scenario data, environment handles the rest!
----
-## 🎊 Summary
-**Days 1-2: Fully Complete** ✅
-You now have:
-- ✅ Fully functional environment
-- ✅ Working log generation
-- ✅ Task 1 fully playable
-- ✅ Real endpoints with real data
-- ✅ Reward calculation
-- ✅ Episode state management
-**Total lines written: ~1,100**
-**Quality: Production-ready**
-**Tests: All manual tests pass**
-**Coverage: 1/3 tasks complete**
----
-Generated: 2026-03-27
-Project: LogTriageEnv (Meta × PyTorch Hackathon)
-Status: Days 1-2 COMPLETE (40%)
-Deadline: April 7, 2026, 11:59 PM IST