OGrohit commited on
Commit
4c76730
·
1 Parent(s): 58d207a

Day 6: inference.py (renamed from baseline.py), HF_TOKEN/API_BASE_URL/MODEL_NAME env vars, pyproject.toml for openenv validate

Browse files
.gitignore CHANGED
Binary files a/.gitignore and b/.gitignore differ
 
DAY3_STATUS.md DELETED
@@ -1,290 +0,0 @@
1
- # 🎯 DAY 3 STATUS — LogTriageEnv Complete
2
-
3
- **Status: ✅ 100% COMPLETE (Days 1-2-3 now complete!)**
4
- **Last Updated:** March 27, 2026
5
- **Overall Progress:** ▓▓▓░░ (60% of total project)
6
-
7
- ---
8
-
9
- ## 📊 Quick Status
10
-
11
- | Component | Status | Details |
12
- |-----------|--------|---------|
13
- | **Day 1 Work** | ✅ 100% | Models, API scaffold, config, docs |
14
- | **Day 2 Work** | ✅ 100% | Environment, log gen, Task 1 wired |
15
- | **Day 3 Work** | ✅ 100% | Tasks 2 & 3 scenarios + wiring |
16
- | **Task 1 (Easy)** | ✅ 100% | Single crash - FULLY PLAYABLE |
17
- | **Task 2 (Medium)** | ✅ 100% | Cascading failures - FULLY PLAYABLE |
18
- | **Task 3 (Hard)** | ✅ 100% | Silent degradation - FULLY PLAYABLE |
19
- | **Graders** | ⏳ 0% | Day 4 - not started |
20
- | **Baseline Agent** | ⏳ 0% | Day 5 - not started |
21
-
22
- ---
23
-
24
- ## ✅ What Was Completed in Day 3
25
-
26
- ### 1. **Task 2: Cascading Failure (Medium Difficulty)**
27
- **File:** `server/scenarios/cascading.py` (171 lines)
28
-
29
- ✅ **Scenario Definition:**
30
- - Database slowdown in user-db → exhausts auth-service connection pool → cascade to api-gateway
31
- - Surface logs show gateway errors loudly (symptom), but root cause is hidden (user-db)
32
- - Agent must trace backward through the cascade chain, not treat symptoms
33
-
34
- ✅ **Ground Truth:**
35
- ```
36
- Severity: P1
37
- Root Cause: user-db (NOT auth-service, NOT api-gateway)
38
- Remediation: kill-query:user-db OR restart:user-db
39
- Teams: dba-team, sre-team
40
- Max Steps: 12
41
- Noise: 30%
42
- ```
43
-
44
- ✅ **Step-by-Step Signal Plan (12 stages):**
45
- - Step 0-1: Gateway errors appear (symptoms only)
46
- - Step 2-3: Auth-service DB pressure becomes visible
47
- - Step 4-5: user-db slow queries exposed; circuit breaker opens
48
- - Step 6-7: Full cascade; all 3 services degraded/down
49
- - Step 8-11: Escalating alerts; root cause becomes unmistakable
50
-
51
- ✅ **System State Modeling:**
52
- - api-gateway: degrades from 8% error → 99% error
53
- - auth-service: degrades from healthy → down by step 6
54
- - user-db: shows latency increase from 2847ms → 10000ms
55
-
56
- ✅ **Integration:**
57
- - Wired to environment.py as `cascading_failure` task
58
- - Accessible via `/reset?task=cascading_failure`
59
- - Returns realistic logs with 30% noise injected
60
-
61
- ---
62
-
63
- ### 2. **Task 3: Silent Degradation (Hard Difficulty)**
64
- **File:** `server/scenarios/silent_degrade.py` (185 lines)
65
-
66
- ✅ **Scenario Definition:**
67
- - payment-db query latency slowly increases over time
68
- - No service crashes; error rate stays below P1 threshold (5%)
69
- - 60% of logs are irrelevant noise from other services
70
- - Agent must filter noise, identify subtle signal, and classify as P2 (not P1, not P3)
71
-
72
- ✅ **Ground Truth:**
73
- ```
74
- Severity: P2 (NOT P1, NOT P3 — nuanced judgment required)
75
- Root Cause: payment-db
76
- Remediation: flush-cache:payment-db OR kill-query:payment-db
77
- Teams: dba-team
78
- Max Steps: 15
79
- Noise: 60% (hardest noise ratio of all tasks)
80
- ```
81
-
82
- ✅ **Step-by-Step Signal Plan (15 stages):**
83
- - Step 0-2: Very subtle signals (payment-db latency 450ms → 890ms)
84
- - Step 3-5: Buffer cache degradation visible; error rate at 2.1%
85
- - Step 6-8: Latency 2200ms → 3100ms; still well below P1 threshold
86
- - Step 9-12: Approaching but not breaching timeout (4200ms → 4600ms)
87
- - Step 13-14: P1 breach imminent/breached (4950ms → payment error 5.1%)
88
-
89
- ✅ **Noise Characteristics:**
90
- - Most logs are from unrelated services (api-gateway, auth-service, etc.)
91
- - Signal is sparse — only 1-2 relevant logs per step
92
- - Requires agent to carefully read logs and filter signal from noise
93
-
94
- ✅ **System State Modeling:**
95
- - payment-db: latency increases 450ms → 4950ms, status stays "up" until step 3
96
- - payment-service: becomes slightly degraded from step 4 onward
97
- - All other services: remain in healthy state
98
-
99
- ✅ **Integration:**
100
- - Wired to environment.py as `silent_degradation` task
101
- - Accessible via `/reset?task=silent_degradation`
102
- - Returns realistic logs with 60% noise injected
103
-
104
- ---
105
-
106
- ### 3. **Environment Wiring (Updated)**
107
- **File:** `server/environment.py` (updated)
108
-
109
- ✅ **Imports Added:**
110
- ```python
111
- from server.scenarios import cascading
112
- from server.scenarios import silent_degrade
113
- ```
114
-
115
- ✅ **Task Registry Updated:**
116
- ```python
117
- TASK_MAX_STEPS = {
118
- "single_crash": 8,
119
- "cascading_failure": 12,
120
- "silent_degradation": 15,
121
- }
122
- ```
123
-
124
- ✅ **reset() Method Wired All 3 Tasks:**
125
- ```python
126
- if task_id == "single_crash":
127
- self._ground_truth = single_crash.GROUND_TRUTH
128
- elif task_id == "cascading_failure":
129
- self._ground_truth = cascading.GROUND_TRUTH
130
- elif task_id == "silent_degradation":
131
- self._ground_truth = silent_degrade.GROUND_TRUTH
132
- ```
133
-
134
- ✅ **_get_step_data() Extracts Scenario Data:**
135
- - Calls `scenario.get_step_data(step, base_time, rng)` for real logs
136
- - Calls `scenario.get_system_state(step, base_time)` for service status
137
- - All 3 tasks return deterministic logs based on ground truth
138
-
139
- ✅ **_get_alerts() Returns Scenario-Specific Alerts:**
140
- - Each scenario defines its own alert progression
141
- - Alerts evolve as cascade/degradation unfolds
142
-
143
- ---
144
-
145
- ## 🎮 All 3 Tasks Now Playable End-to-End
146
-
147
- ### **Task 1: Single Service Crash (Easy)**
148
- ```bash
149
- curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
150
- curl -X POST "http://localhost:7860/step" \
151
- -H "Content-Type: application/json" \
152
- -d '{"action_type":"classify_severity","value":"P1","confidence":0.95}'
153
- # Expected: +0.30 reward for correct severity
154
- ```
155
-
156
- ### **Task 2: Cascading Failure (Medium)**
157
- ```bash
158
- curl -X POST "http://localhost:7860/reset?task=cascading_failure&seed=42"
159
- curl -X POST "http://localhost:7860/step" \
160
- -H "Content-Type: application/json" \
161
- -d '{"action_type":"request_more_logs","value":"system_state","confidence":0.9}'
162
- # Agent must trace: gateway errors → auth-service → user-db (root cause)
163
- # Expected: +0.35 reward for identifying user-db (not gateway/auth-service)
164
- ```
165
-
166
- ### **Task 3: Silent Degradation (Hard)**
167
- ```bash
168
- curl -X POST "http://localhost:7860/reset?task=silent_degradation&seed=42"
169
- curl -X POST "http://localhost:7860/step" \
170
- -H "Content-Type: application/json" \
171
- -d '{"action_type":"classify_severity","value":"P2","confidence":0.85}'
172
- # Nuanced judgment: error rate is 2.1% (below P1 @ 5%) but trending toward breach
173
- # Expected: +0.30 reward for correct P2 (not P1, not P3)
174
- ```
175
-
176
- ---
177
-
178
- ## 📈 Scoring Distribution
179
-
180
- Each task has different difficulty → different expected agent score ranges:
181
-
182
- | Task | Difficulty | Max Score | Expected Range | Key Challenge |
183
- |------|-----------|-----------|-----------------|---------------|
184
- | **Single Crash** | Easy | 1.00 | 0.75–0.85 | Simple identification |
185
- | **Cascading** | Medium | 1.00 | 0.45–0.60 | Trace root cause, not symptoms |
186
- | **Silent Degrade** | Hard | 1.00 | 0.20–0.40 | Filter 60% noise, nuanced P2 judgment |
187
-
188
- ---
189
-
190
- ## 🔍 Key Metrics
191
-
192
- ### Code
193
- - **Total lines written (Days 1-3):** ~1,500 lines of Python
194
- - **Scenario files:** 3 complete (single_crash + cascading + silent_degrade)
195
- - **Scenario logic:** ~500 lines of step-by-step signal planning + system state modeling
196
-
197
- ### Documentation
198
- - **Status files:** Now consolidated (DAY1_STATUS, DAY2_STATUS, DAY3_STATUS merged → use this file + DAYS_1-2_SUMMARY)
199
- - **Total doc lines:** ~2,000+ across remaining guides
200
-
201
- ### Testing
202
- - **Endpoints wired:** 7/7 (all endpoints can now be called)
203
- - **Tasks playable:** 3/3 ✅
204
- - **Test cases needed:** Day 4 (grader logic tests)
205
-
206
- ---
207
-
208
- ## 📋 Files in Play
209
-
210
- ### **Core Code (Keep)**
211
- ```
212
- ✅ server/models.py (218 lines)
213
- ✅ server/app.py (7 endpoints)
214
- ✅ server/environment.py (environment logic)
215
- ✅ server/log_generator.py (synthetic logs)
216
- ✅ server/scenarios/single_crash.py (Task 1)
217
- ✅ server/scenarios/cascading.py (Task 2)
218
- ✅ server/scenarios/silent_degrade.py (Task 3)
219
- ```
220
-
221
- ### **Configuration (Keep)**
222
- ```
223
- ✅ openenv.yaml
224
- ✅ requirements.txt
225
- ✅ Dockerfile
226
- ```
227
-
228
- ### **Documentation (Use These)**
229
- ```
230
- ✅ README.md (main spec)
231
- ✅ EXECUTIVE_SUMMARY.md (overview for judges)
232
- ✅ DAYS_1-2_SUMMARY_FINAL.md (technical deep-dive, Days 1-2)
233
- ✅ STATUS.md (quick progress matrix)
234
- ✅ START_HERE_DAY2.md (navigation guide)
235
- ✅ FILE_INVENTORY.md (file listing)
236
- ✅ TEST_ENDPOINTS.md (curl examples)
237
- ✅ VISUAL_SUMMARY.md (architecture diagrams)
238
- ✅ DAY3_STATUS.md (this file — complete Day 3 status)
239
- ```
240
-
241
- ### **Removed Files (No Longer Needed)**
242
- ```
243
- ❌ DAY1.md (consolidated)
244
- ❌ DAY1_STATUS.md (consolidated)
245
- ❌ DAY2.md (consolidated)
246
- ❌ ANALYSIS_SUMMARY.md (redundant)
247
- ❌ COMPLETE_SUMMARY.md (redundant)
248
- ❌ etc.
249
- ```
250
-
251
- ---
252
-
253
- ## 🎯 What's Next (Day 4-5)
254
-
255
- ### **Day 4: Graders**
256
- - [ ] Implement grader logic (evaluation of agent actions)
257
- - [ ] Wire `/grader` endpoint
258
- - [ ] Validate scoring across all 3 tasks
259
-
260
- ### **Day 5: Baseline Agent**
261
- - [ ] Implement simple baseline agent
262
- - [ ] Wire `/baseline` endpoint
263
- - [ ] Deployment to Hugging Face
264
-
265
- ---
266
-
267
- ## 💡 Summary
268
-
269
- **Days 1-3 Complete:** All 3 tasks are now fully playable end-to-end with realistic scenario data.
270
-
271
- ✅ **Single Service Crash (Easy):** One service crashes → clear logs → straightforward triage
272
- ✅ **Cascading Failure (Medium):** DB slowdown cascades upstream → must trace root cause, not symptoms
273
- ✅ **Silent Degradation (Hard):** Slow creeping problem in 60% noise → nuanced P2 judgment required
274
-
275
- **Completion Status:**
276
- - 60% of total project complete (Days 1-3 of 5)
277
- - 3/3 tasks playable
278
- - All endpoints wired and functional
279
- - Ready for Day 4 grader implementation
280
-
281
- ---
282
-
283
- **Next Action:** Create Day 4 grader logic to evaluate agent performance across all 3 tasks.
284
-
285
- ---
286
-
287
- Generated: March 27, 2026
288
- Project: LogTriageEnv (Meta × PyTorch Hackathon)
289
- Deadline: April 7, 2026, 11:59 PM IST
290
- Status: **ON TRACK** ✅ (60% complete)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
DAYS_1-2-3-4_FINAL_STATUS.md DELETED
@@ -1,484 +0,0 @@
1
- # 🎯 DAYS 1-4 FINAL STATUS — LogTriageEnv Complete
2
-
3
- **Status: ✅ 100% COMPLETE (Days 1-4 now complete!)**
4
- **Last Updated:** March 28, 2026
5
- **Overall Progress:** ▓▓▓▓░ (80% of total project)
6
-
7
- ---
8
-
9
- ## 📊 Quick Status Summary
10
-
11
- | Component | Status | Details |
12
- |-----------|--------|---------|
13
- | **Day 1 Work** | ✅ 100% | Models, API scaffold, config, docs |
14
- | **Day 2 Work** | ✅ 100% | Environment, log gen, Task 1 wired |
15
- | **Day 3 Work** | ✅ 100% | Tasks 2 & 3 scenarios + wiring |
16
- | **Day 4 Work** | ✅ 100% | Graders, /grader endpoint, CLI tool |
17
- | **Task 1 (Easy)** | ✅ 100% | Single crash - FULLY PLAYABLE & GRADED |
18
- | **Task 2 (Medium)** | ✅ 100% | Cascading failures - FULLY PLAYABLE & GRADED |
19
- | **Task 3 (Hard)** | ✅ 100% | Silent degradation - FULLY PLAYABLE & GRADED |
20
- | **Baseline Agent** | ⏳ 0% | Day 5 - not started |
21
- | **Final Deployment** | ⏳ 0% | Day 5 - not started |
22
-
23
- ---
24
-
25
- ## ✅ What Was Completed in Day 4
26
-
27
- ### 1. **Grader Infrastructure**
28
- **Files Created:**
29
- - `server/graders/base_grader.py` (195 lines) — Abstract base interface
30
- - `server/graders/crash_grader.py` (330 lines) — Task 1 grader
31
- - `server/graders/cascade_grader.py` (360 lines) — Task 2 grader
32
- - `server/graders/noise_grader.py` (320 lines) — Task 3 grader
33
- - `server/graders/__init__.py` — Registry + scoring interface
34
-
35
- **Key Features:**
36
- ✅ Abstract `BaseGrader` class with helper methods for action evaluation
37
- ✅ Task-specific graders inherit from BaseGrader
38
- ✅ Each grader implements deterministic scoring logic
39
- ✅ Grader registry automatically dispatches to correct grader by task_id
40
- ✅ Helper methods: `_get_actions_of_type()`, `_was_action_taken()`, `_get_first_value()`, etc.
41
-
42
- ---
43
-
44
- ### 2. **Model Updates**
45
- **File:** `server/models.py`
46
-
47
- ✅ **Added to EpisodeState:**
48
- ```python
49
- action_history: list[dict] = Field(
50
- default_factory=list,
51
- description="Full action objects taken this episode (for grader evaluation)"
52
- )
53
- ```
54
-
55
- **Purpose:** Tracks complete action data (type, value, confidence, reasoning) for grader evaluation
56
-
57
- ---
58
-
59
- ### 3. **Environment Updates**
60
- **File:** `server/environment.py`
61
-
62
- ✅ **In step() method:**
63
- ```python
64
- self._state.action_history.append(action.model_dump())
65
- ```
66
-
67
- **Purpose:** Records full action object for each step taken
68
-
69
- ---
70
-
71
- ### 4. **API Endpoint: /grader**
72
- **File:** `server/app.py`
73
-
74
- ✅ **Endpoint Signature:**
75
- ```python
76
- @app.post("/grader")
77
- def grader():
78
- from server.graders import score_episode
79
- state = env.state
80
- result = score_episode(state.task_id, state)
81
- return result
82
- ```
83
-
84
- **Returns:**
85
- ```json
86
- {
87
- "score": 0.95,
88
- "task_id": "single_crash",
89
- "steps_taken": 4,
90
- "max_steps": 8,
91
- "resolved": true,
92
- "breakdown": {
93
- "severity": "+0.30 (correct: P1)",
94
- "root_cause": "+0.35 (correct: payment-service)",
95
- "remediation": "+0.25 (correct: restart:payment-service)",
96
- "speed": "+0.10 (resolved in 4 steps)"
97
- }
98
- }
99
- ```
100
-
101
- ---
102
-
103
- ### 5. **Grader Scoring Logic**
104
-
105
- #### **Task 1 (Single Crash) — CrashGrader**
106
- **Ground Truth:**
107
- - Severity: P1
108
- - Root Cause: payment-service
109
- - Remediation: restart:payment-service
110
- - Max Steps: 8
111
-
112
- **Scoring Breakdown:**
113
- - Correct severity (P1) → +0.30
114
- - Correct root cause (payment-service) → +0.35
115
- - Correct remediation (restart:payment-*) → +0.25
116
- - Speed bonus (resolved ≤ 5 steps) → +0.10
117
- - **Max Score:** 1.00
118
-
119
- **Penalties:**
120
- - Partial credit for close answers (P2 severity = +0.10, service family = +0.10)
121
- - Never resolved → -0.10
122
-
123
- ---
124
-
125
- #### **Task 2 (Cascading Failure) — CascadeGrader**
126
- **Ground Truth:**
127
- - Severity: P1
128
- - Root Cause: user-db (NOT api-gateway, NOT auth-service)
129
- - Remediation: kill-query:user-db OR restart:user-db
130
- - Max Steps: 12
131
-
132
- **Scoring Breakdown:**
133
- - Correct severity (P1) → +0.25
134
- - Correct root cause (user-db) → +0.40 (higher difficulty)
135
- - Correct remediation → +0.20
136
- - Speed bonus (resolved ≤ 7 steps) → +0.10
137
- - Avoiding symptom confusion → +0.05 (partial bonus)
138
- - **Max Score:** 1.00
139
-
140
- **Key Challenge:** Must trace root cause through cascade chain, not misidentify symptoms
141
-
142
- ---
143
-
144
- #### **Task 3 (Silent Degradation) — NoiseGrader**
145
- **Ground Truth:**
146
- - Severity: P2 (NOT P1, NOT P3)
147
- - Root Cause: payment-db
148
- - Remediation: flush-cache:payment-db OR kill-query:payment-db
149
- - Max Steps: 15
150
- - Noise Ratio: 60%
151
-
152
- **Scoring Breakdown:**
153
- - Correct severity (P2) → +0.35 (nuanced judgment)
154
- - Correct root cause (payment-db) → +0.30
155
- - Correct remediation → +0.20
156
- - Speed bonus (resolved ≤ 10 steps) → +0.10
157
- - Noise tolerance → +0.05 (partial bonus)
158
- - **Max Score:** 1.00
159
-
160
- **Key Challenge:** Filter 60% irrelevant logs; classify subtle P2 (not obvious P1/P3)
161
-
162
- ---
163
-
164
- ### 6. **Grader Validation CLI Tool**
165
- **File:** `scripts/run_grader.py` (133 lines)
166
-
167
- ✅ **Features:**
168
- - Simulates correct and wrong agents for each task
169
- - Runs full episode and calls official grader
170
- - Displays score breakdown and variance analysis
171
- - Proves grader returns VARYING scores
172
-
173
- **Usage Examples:**
174
- ```bash
175
- # Test single task with correct agent
176
- python scripts/run_grader.py --task single_crash --agent correct
177
-
178
- # Test single task with wrong agent
179
- python scripts/run_grader.py --task cascading_failure --agent wrong
180
-
181
- # Test all 3 tasks with both correct/wrong agents
182
- python scripts/run_grader.py --all
183
- ```
184
-
185
- **Expected Output:**
186
- ```
187
- ============================================================
188
- Task: single_crash
189
- Agent: correct
190
- Score: 0.95 [====================]
191
- Steps: 4/8
192
- Resolved: True
193
-
194
- Breakdown:
195
- severity +0.30 (correct: P1)
196
- root_cause +0.35 (correct: payment-service)
197
- remediation +0.25 (correct: restart:payment-service)
198
- speed +0.10 (resolved in 4 steps)
199
- ============================================================
200
- ```
201
-
202
- ---
203
-
204
- ## 🎮 All 3 Tasks Now Fully Playable & Graded
205
-
206
- ### **Complete Flow Example: Task 1**
207
-
208
- ```bash
209
- # 1. Reset episode
210
- curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
211
-
212
- # 2. Step 1: Classify severity
213
- curl -X POST "http://localhost:7860/step" \
214
- -H "Content-Type: application/json" \
215
- -d '{
216
- "action_type": "classify_severity",
217
- "value": "P1",
218
- "confidence": 0.95
219
- }'
220
-
221
- # 3. Step 2: Identify root cause
222
- curl -X POST "http://localhost:7860/step" \
223
- -H "Content-Type: application/json" \
224
- -d '{
225
- "action_type": "identify_root_cause",
226
- "value": "payment-service",
227
- "confidence": 0.90
228
- }'
229
-
230
- # 4. Step 3: Remediate
231
- curl -X POST "http://localhost:7860/step" \
232
- -H "Content-Type: application/json" \
233
- -d '{
234
- "action_type": "remediate",
235
- "value": "restart:payment-service",
236
- "confidence": 0.85
237
- }'
238
-
239
- # 5. Step 4: Resolve
240
- curl -X POST "http://localhost:7860/step" \
241
- -H "Content-Type: application/json" \
242
- -d '{
243
- "action_type": "resolve",
244
- "value": "resolved",
245
- "confidence": 1.00
246
- }'
247
-
248
- # 6. Get official grade
249
- curl -X POST "http://localhost:7860/grader"
250
-
251
- # Response:
252
- {
253
- "score": 0.95,
254
- "task_id": "single_crash",
255
- "steps_taken": 4,
256
- "max_steps": 8,
257
- "resolved": true,
258
- "breakdown": {
259
- "severity": "+0.30 (correct: P1)",
260
- "root_cause": "+0.35 (correct: payment-service)",
261
- "remediation": "+0.25 (correct: restart:payment-service)",
262
- "speed": "+0.10 (resolved in 4 steps)"
263
- }
264
- }
265
- ```
266
-
267
- ---
268
-
269
- ## 🔍 Verified: Graders Return VARYING Scores
270
-
271
- **Test Results (from run_grader.py --all):**
272
-
273
- | Task | Correct Agent | Wrong Agent | Variance | Status |
274
- |------|---------------|-------------|----------|--------|
275
- | Single Crash | **0.95** | 0.10 | 0.85 | ✅ GOOD |
276
- | Cascading Failure | **0.85** | 0.15 | 0.70 | ✅ GOOD |
277
- | Silent Degradation | **0.80** | 0.20 | 0.60 | ✅ GOOD |
278
-
279
- **Key Verification:**
280
- ✅ Graders DO NOT always return same score
281
- ✅ Correct agents score 0.80-0.95
282
- ✅ Wrong agents score 0.10-0.20
283
- ✅ Variance is high (0.60-0.85) — good discrimination
284
- ✅ No disqualification conditions triggered
285
-
286
- ---
287
-
288
- ## 📈 Scoring Distribution Summary
289
-
290
- | Task | Difficulty | Max | Range | Key Challenge |
291
- |------|-----------|-----|-------|---------------|
292
- | Single Crash | Easy | 1.00 | 0.75–0.95 | Simple identification |
293
- | Cascading | Medium | 1.00 | 0.45–0.85 | Trace root cause, not symptoms |
294
- | Silent Degrade | Hard | 1.00 | 0.20–0.80 | Filter 60% noise, nuanced P2 |
295
-
296
- ---
297
-
298
- ## 🏗️ Architecture Now Complete (Days 1-4)
299
-
300
- ```
301
- LogTriageEnv
302
- ├── server/
303
- │ ├── app.py (123 lines) — 8 endpoints
304
- │ │ ├── GET /health ✅
305
- │ │ ├── POST /reset ✅
306
- │ │ ├── POST /step ✅
307
- │ │ ├── GET /state ✅
308
- │ │ ├── GET /tasks ✅
309
- │ │ ├── POST /grader ✅ (NEW Day 4)
310
- │ │ ├── POST /baseline ⏳ (Day 5)
311
- │ │ └── + more...
312
- │ │
313
- │ ├── models.py (250+ lines)
314
- │ │ ├── LogLine ✅
315
- │ │ ├── ServiceStatus ✅
316
- │ │ ├── TriageAction ✅
317
- │ │ ├── Observation ✅
318
- │ │ └── EpisodeState ✅ (updated with action_history)
319
- │ │
320
- │ ├── environment.py (400+ lines)
321
- │ │ ├── LogTriageEnvironment class ✅
322
- │ │ ├── reset() — all 3 tasks ✅
323
- │ │ ├── step() — action processing ✅ (with action_history)
324
- │ │ ├── state() — current state ✅
325
- │ │ └── _get_alerts() ✅
326
- │ │
327
- │ ├── log_generator.py (280+ lines)
328
- │ │ ├── Synthetic log generation ✅
329
- │ │ ├── Scenario-aware logs ✅
330
- │ │ └── Noise injection ✅
331
- │ │
332
- │ ├── scenarios/ (3 files, 500+ lines total)
333
- │ │ ├── single_crash.py ✅
334
- │ │ ├── cascading.py ✅
335
- │ │ └── silent_degrade.py ✅
336
- │ │
337
- │ └── graders/ (5 files, 1200+ lines total) ✅ NEW Day 4
338
- │ ├── base_grader.py (195 lines)
339
- │ ├── crash_grader.py (330 lines)
340
- │ ├── cascade_grader.py (360 lines)
341
- │ ├── noise_grader.py (320 lines)
342
- │ └── __init__.py (registry)
343
-
344
- ├── scripts/
345
- │ ├── run_grader.py (133 lines) ✅ NEW Day 4
346
- │ └── baseline.py ⏳ (Day 5)
347
-
348
- ├── requirements.txt ✅
349
- ├── Dockerfile ✅
350
- ├── openenv.yaml ✅
351
- └── README.md + docs ✅
352
- ```
353
-
354
- ---
355
-
356
- ## 📋 Files Complete (Days 1-4)
357
-
358
- ### **Core Code (✅ Complete)**
359
- ```
360
- ✅ server/models.py (250+ lines)
361
- ✅ server/app.py (123 lines, 8 endpoints)
362
- ✅ server/environment.py (400+ lines)
363
- ✅ server/log_generator.py (280+ lines)
364
- ✅ server/scenarios/single_crash.py (Task 1)
365
- ✅ server/scenarios/cascading.py (Task 2)
366
- ✅ server/scenarios/silent_degrade.py (Task 3)
367
- ✅ server/graders/base_grader.py (Day 4)
368
- ✅ server/graders/crash_grader.py (Day 4)
369
- ✅ server/graders/cascade_grader.py (Day 4)
370
- ✅ server/graders/noise_grader.py (Day 4)
371
- ✅ server/graders/__init__.py (Day 4)
372
- ✅ scripts/run_grader.py (Day 4)
373
- ```
374
-
375
- ### **Configuration (✅ Complete)**
376
- ```
377
- ✅ openenv.yaml
378
- ✅ requirements.txt
379
- ✅ Dockerfile
380
- ```
381
-
382
- ### **Documentation (✅ Complete)**
383
- ```
384
- ✅ README.md (main spec)
385
- ✅ EXECUTIVE_SUMMARY.md (overview)
386
- ✅ DAYS_1-2_SUMMARY_FINAL.md (technical deep-dive)
387
- ✅ DAY3_STATUS.md (Day 3 completion)
388
- ✅ DAYS_1-2-3-4_FINAL_STATUS.md (this file)
389
- ✅ START_HERE_DAY2.md (navigation)
390
- ✅ FILE_INVENTORY.md (file listing)
391
- ✅ TEST_ENDPOINTS.md (curl examples)
392
- ✅ VISUAL_SUMMARY.md (architecture)
393
- ```
394
-
395
- ---
396
-
397
- ## 🎯 What's Next (Day 5)
398
-
399
- ### **Remaining Work:**
400
- - [ ] Implement baseline agent (`scripts/baseline.py`)
401
- - [ ] Wire `/baseline` endpoint in `app.py`
402
- - [ ] Deploy to Hugging Face Spaces
403
- - [ ] Final validation and submission
404
-
405
- ### **Day 5 Success Criteria:**
406
- ✅ Baseline agent achieves ≥0.50 avg score across all 3 tasks
407
- ✅ Deployed to HF Spaces with working API
408
- ✅ All 3 tasks playable via hosted endpoint
409
- ✅ Grader working live
410
-
411
- ---
412
-
413
- ## 💡 Key Achievements (Days 1-4)
414
-
415
- ### **Codebase:**
416
- - ~3,000 lines of Python written
417
- - 3 complete, deterministic task scenarios
418
- - 3 sophisticated graders with nuanced scoring
419
- - All 8 endpoints implemented and tested
420
-
421
- ### **Architecture:**
422
- - Fully functional OpenEnv-compliant environment
423
- - Modular scenario system
424
- - Pluggable grader registry
425
- - Deterministic reproducibility (seeded RNG)
426
-
427
- ### **Testing:**
428
- - Grader validation script with correct/wrong agent simulation
429
- - Verified: graders return VARYING scores (0.10-0.95)
430
- - All 3 tasks playable end-to-end
431
- - No disqualification conditions triggered
432
-
433
- ### **Documentation:**
434
- - Comprehensive status files
435
- - Technical deep-dives
436
- - Curl examples for all endpoints
437
- - Architecture diagrams
438
-
439
- ---
440
-
441
- ## 📊 Progress Timeline
442
-
443
- | Day | Deliverable | Status | Files |
444
- |-----|-------------|--------|-------|
445
- | **Day 1** | Models, API scaffold, Task 1 config | ✅ 100% | 5 files |
446
- | **Day 2** | Environment, log generator, Task 1 wired | ✅ 100% | +3 files |
447
- | **Day 3** | Tasks 2 & 3 complete, all wired | ✅ 100% | +2 files |
448
- | **Day 4** | Graders, /grader endpoint, validation CLI | ✅ 100% | +5 files |
449
- | **Day 5** | Baseline agent, deployment | ⏳ Pending | +2 files |
450
- | **Total** | Full submission-ready environment | ⏳ 80% | ~20 files |
451
-
452
- ---
453
-
454
- ## 🚀 Ready for Day 5
455
-
456
- **All prerequisites for Day 5 complete:**
457
- ✅ 3 tasks fully playable
458
- ✅ Graders fully functional
459
- ✅ /grader endpoint live
460
- ✅ Scoring proven to vary
461
-
462
- **Day 5 can proceed immediately to:**
463
- 1. Implement simple baseline agent
464
- 2. Wire to /baseline endpoint
465
- 3. Deploy to HF Spaces
466
-
467
- ---
468
-
469
- ## ✅ Disqualification Checks (All Passed)
470
-
471
- - ✅ Graders DO NOT always return same score
472
- - ✅ Graders HAVE logic (3 different graders, 3 different scoring)
473
- - ✅ Scores ALWAYS in [0.0, 1.0] range
474
- - ✅ /grader endpoint returns proper response
475
- - ✅ No external dependencies violated
476
- - ✅ Reproducible (seed support)
477
-
478
- ---
479
-
480
- Generated: March 28, 2026
481
- Project: LogTriageEnv (Meta × PyTorch Hackathon)
482
- Deadline: April 7, 2026, 11:59 PM IST
483
- Status: **ON TRACK** ✅ (80% complete, Day 5 ready)
484
- Estimated Completion: March 28, 2026 (Day 5)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
DAYS_1-2_SUMMARY_FINAL.md DELETED
@@ -1,282 +0,0 @@
1
- # FINAL SUMMARY — Days 1-2 Complete
2
-
3
- **Status:** ✅ **40% of Project Complete (Days 1-2 Done)**
4
- **Date:** March 27, 2026
5
- **Next:** Day 3 (Scenarios 2 & 3)
6
-
7
- ---
8
-
9
- ## Quick Summary
10
-
11
- ### ✅ What You've Built (Days 1-2)
12
-
13
- **Day 1:**
14
- - ✅ 5 Pydantic models (fully typed)
15
- - ✅ 7 FastAPI endpoints (all registered)
16
- - ✅ Configuration (openenv.yaml, requirements.txt)
17
- - ✅ Docker setup
18
- - ✅ Comprehensive documentation
19
-
20
- **Day 2:**
21
- - ✅ LogTriageEnvironment class (environment management)
22
- - ✅ Synthetic log generation engine (realistic logs)
23
- - ✅ Task 1 scenario (single_crash - easy task)
24
- - ✅ Wired 3/7 endpoints to real logic (/reset, /step, /state)
25
- - ✅ Full Task 1 playable end-to-end
26
-
27
- **Total:** ~1,100 lines of core code + 1,900 lines of documentation
28
-
29
- ---
30
-
31
- ## 📋 Files Created/Modified
32
-
33
- ### Day 1 (Skeleton)
34
- | File | Lines | Purpose |
35
- |------|-------|---------|
36
- | `server/models.py` | 218 | 5 Pydantic classes |
37
- | `server/app.py` | 101 | FastAPI app |
38
- | `openenv.yaml` | 38 | Environment spec |
39
- | `requirements.txt` | 6 | Dependencies |
40
- | `Dockerfile` | 16 | Containerization |
41
- | `README.md` | 533 | Documentation |
42
-
43
- ### Day 2 (Brain)
44
- | File | Lines | Purpose |
45
- |------|-------|---------|
46
- | `server/environment.py` | 250 | Core environment class |
47
- | `server/log_generator.py` | 400 | Synthetic log generation |
48
- | `server/scenarios/single_crash.py` | 150 | Task 1 scenario |
49
- | `server/app.py` | +50 | Wired endpoints |
50
-
51
- ---
52
-
53
- ## 🎯 What's Working Now
54
-
55
- ### Fully Playable
56
- ✅ **Task 1: Single Service Crash (Easy)**
57
- - Agent can reset, observe, act, and resolve
58
- - Full episode: 5 steps minimum to win
59
- - Reward calculation working
60
- - Episode state tracking
61
-
62
- ### Partially Working
63
- ✅ **3/7 Endpoints Wired:**
64
- - `/reset` - creates real episodes ✅
65
- - `/step` - processes actions & returns rewards ✅
66
- - `/state` - returns episode state ✅
67
- - `/health` - health check ✅
68
- - `/tasks` - task definitions ✅
69
-
70
- ❌ **4/7 Endpoints Still TODO:**
71
- - `/grader` - grading logic (Day 4)
72
- - `/baseline` - LLM baseline (Day 5)
73
-
74
- ---
75
-
76
- ## 📊 Progress Breakdown
77
-
78
- ```
79
- Day 1: Scaffold (40%)
80
- ├─ Models: ✅ 100%
81
- ├─ API endpoints: ✅ 100% (stubbed)
82
- ├─ Config: ✅ 100%
83
- └─ Docs: ✅ 100%
84
-
85
- Day 2: Environment & Task 1 (40%)
86
- ├─ Environment class: ✅ 100%
87
- ├─ Log generator: ✅ 100%
88
- ├─ Task 1 scenario: ✅ 100%
89
- ├─ Endpoints wired: ✅ 3/7 (42.8%)
90
- └─ Task 1 playable: ✅ 100%
91
-
92
- Day 3: Scenarios 2 & 3 (20%)
93
- ├─ Task 2 scenario: ⏳ 0%
94
- ├─ Task 3 scenario: ⏳ 0%
95
- └─ All 3 tasks playable: ⏳ 0%
96
-
97
- Days 4-5: Graders & Baseline (TODO)
98
- ├─ Graders: ⏳ 0%
99
- └─ Baseline agent: ⏳ 0%
100
-
101
- TOTAL: ✅ 40% Complete (Days 1-2)
102
- ```
103
-
104
- ---
105
-
106
- ## 🎮 How to Play Task 1
107
-
108
- ### Quick Test
109
- ```bash
110
- # Terminal 1: Start server
111
- python -m uvicorn server.app:app --port 7860
112
-
113
- # Terminal 2: Play episode
114
- curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
115
- curl -X POST "http://localhost:7860/step" \
116
- -H "Content-Type: application/json" \
117
- -d '{"action_type":"classify_severity","value":"P1","confidence":0.95}'
118
- curl -X POST "http://localhost:7860/step" \
119
- -H "Content-Type: application/json" \
120
- -d '{"action_type":"identify_root_cause","value":"payment-service","confidence":0.9}'
121
- curl -X POST "http://localhost:7860/step" \
122
- -H "Content-Type: application/json" \
123
- -d '{"action_type":"remediate","value":"restart:payment-service","confidence":0.95}'
124
- curl -X POST "http://localhost:7860/step" \
125
- -H "Content-Type: application/json" \
126
- -d '{"action_type":"resolve","value":"resolved"}'
127
- ```
128
-
129
- ### What Happens
130
- 1. `/reset` returns initial observation with crash logs
131
- 2. Each `/step` returns:
132
- - New logs (scenario escalates)
133
- - Reward (0.30 for severity, 0.35 for root cause, 0.25 for fix, 0.10 for speed)
134
- - Feedback ("Correct severity!" etc)
135
- - Cumulative score
136
- 3. Final episode score: 1.0 (perfect play)
137
-
138
- ---
139
-
140
- ## ✨ Key Features
141
-
142
- ### Log Generation
143
- - ✅ 7 services (api-gateway, auth, dbs, payment, notification, email)
144
- - ✅ Noise templates (realistic but irrelevant)
145
- - ✅ Signal templates (error patterns)
146
- - ✅ Step-by-step injection (escalating scenario)
147
- - ✅ Deterministic (reproducible with seed)
148
-
149
- ### Environment Management
150
- - ✅ Episode initialization
151
- - ✅ State tracking (step count, score, done)
152
- - ✅ Action validation
153
- - ✅ Reward calculation
154
- - ✅ Feedback generation
155
-
156
- ### Task 1 Scenario
157
- - ✅ Ground truth (correct answers)
158
- - ✅ 8-step episode maximum
159
- - ✅ 20% noise ratio
160
- - ✅ Single service crash
161
- - ✅ Clear error signals
162
-
163
- ---
164
-
165
- ## 📈 Code Quality
166
-
167
- | Aspect | Status |
168
- |--------|--------|
169
- | Type Safety | ✅ 100% (all typed) |
170
- | Validation | ✅ Full action validation |
171
- | Error Handling | ✅ Proper HTTP status codes |
172
- | Documentation | ✅ Comprehensive guides |
173
- | Testing | ✅ Manual tests pass |
174
- | Architecture | ✅ Clean separation |
175
- | Extensibility | ✅ Easy to add scenarios |
176
-
177
- ---
178
-
179
- ## 📚 Documentation Updated
180
-
181
- | Document | Status | Purpose |
182
- |----------|--------|---------|
183
- | DAY1_STATUS.md | 🔄 Renamed | Day 1 reference |
184
- | DAY2_STATUS.md | ✅ Created | Day 2 detailed guide |
185
- | DAYS_1-2_SUMMARY.md | ✅ Created | Days 1-2 overview |
186
- | EXECUTIVE_SUMMARY.md | ✅ Updated | Current progress |
187
- | README.md | ✅ Still valid | Official spec |
188
-
189
- ---
190
-
191
- ## 🚀 Next Steps (Day 3)
192
-
193
- ### Build Two More Scenarios
194
- 1. **cascading.py** (Task 2 - Medium)
195
- - Database slowdown → upstream cascade
196
- - 12 steps max
197
- - 30% noise
198
- - Agent must trace backward
199
-
200
- 2. **silent_degrade.py** (Task 3 - Hard)
201
- - Slow degradation in heavy noise
202
- - 15 steps max
203
- - 60% noise
204
- - Nuanced P2 judgment required
205
-
206
- ### Effort: ~3-4 hours (similar to Day 2)
207
-
208
- ---
209
-
210
- ## 💡 Architecture
211
-
212
- ```
213
- curl /reset?task=single_crash
214
-
215
- app.py: reset() endpoint
216
-
217
- environment.reset("single_crash")
218
-
219
- scenarios/single_crash.py: Load ground truth
220
-
221
- log_generator.py: Generate logs + state
222
-
223
- Return: TriageObservation
224
-
225
- ---
226
-
227
- curl /step -d '{"action_type":"...","value":"..."}'
228
-
229
- app.py: step() endpoint
230
-
231
- action.is_valid() - Validate
232
-
233
- environment.step(action)
234
- ├─ Check if correct (vs ground truth)
235
- ├─ Calculate reward
236
- ├─ Generate next logs (step N+1)
237
- └─ Update state
238
-
239
- Return: TriageObservation + reward + feedback
240
- ```
241
-
242
- ---
243
-
244
- ## ✅ Verification Checklist
245
-
246
- - [x] server/models.py — 5 classes, fully typed
247
- - [x] server/app.py — 7 endpoints, 3 wired
248
- - [x] server/environment.py — Complete class implementation
249
- - [x] server/log_generator.py — Synthetic logs working
250
- - [x] server/scenarios/single_crash.py — Task 1 defined
251
- - [x] /reset endpoint — Returns real observations
252
- - [x] /step endpoint — Returns real rewards
253
- - [x] /state endpoint — Returns real state
254
- - [x] Task 1 playable — Full episode works
255
- - [x] Documentation — DAY2_STATUS.md created
256
- - [x] Code pushed — Committed to GitHub
257
-
258
- ---
259
-
260
- ## 🎯 Summary
261
-
262
- **Days 1-2: ✅ 100% Complete**
263
-
264
- What's done:
265
- - Skeleton (Day 1): ✅
266
- - Environment (Day 2): ✅
267
- - Task 1 (Day 2): ✅
268
- - Endpoints wired (3/7): ✅
269
-
270
- What's next:
271
- - Tasks 2 & 3 (Day 3): ⏳
272
- - Graders (Day 4): ⏳
273
- - Baseline agent (Day 5): ⏳
274
-
275
- **Total Progress: 40% (2 of 5 days)**
276
-
277
- ---
278
-
279
- Generated: 2026-03-27
280
- Project: LogTriageEnv (Meta × PyTorch Hackathon)
281
- Deadline: April 7, 2026, 11:59 PM IST
282
- Status: ON TRACK ✅
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
EXECUTIVE_SUMMARY.md DELETED
@@ -1,347 +0,0 @@
1
- ~# 🚀 EXECUTIVE SUMMARY — LogTriageEnv Days 1-3
2
-
3
- **Status: ✅ 100% COMPLETE (Days 1-3) — ALL 3 TASKS FULLY PLAYABLE**
4
-
5
- ---
6
-
7
- ## What You've Built
8
-
9
- **LogTriageEnv** — An OpenEnv environment that teaches AI agents to be on-call SREs.
10
-
11
- **Days 1-3 Complete:** All 3 tasks (Single Crash, Cascading Failure, Silent Degradation) are now fully playable end-to-end!
12
-
13
- ```
14
- Agent receives → System logs from 7-service cluster
15
- Agent analyzes → Identifies root cause, severity, remediation
16
- Agent acts → Takes triage actions with confidence & reasoning
17
- Agent learns → Gets reward signal + feedback
18
- ```
19
-
20
- ---
21
-
22
- ## 📊 By The Numbers
23
-
24
- | Metric | Value |
25
- |--------|-------|
26
- | **Files Created** | 30+ |
27
- | **Folders Created** | 5 |
28
- | **Code Written** | ~1,100 lines (models + API + environment) |
29
- | **Documentation** | ~1,900 lines (README + guides) |
30
- | **Tests Written** | ~200 lines |
31
- | **Data Models** | 5 (all fully typed) |
32
- | **API Endpoints** | 7 (3 wired & working, 4 TODO) |
33
- | **Tasks Playable** | 3/3 (ALL COMPLETE) |
34
- | **Supporting Guides** | 9 reference documents |
35
- | **Completion %** | **60% (Days 1-3 Complete)** |
36
-
37
- ---
38
-
39
- ## ✅ What's Complete
40
-
41
- ### Core Files (Ready to Use)
42
- - ✅ `openenv.yaml` — Environment specification
43
- - ✅ `requirements.txt` — All dependencies
44
- - ✅ `Dockerfile` — Container definition
45
- - ✅ `server/models.py` — 5 Pydantic models, fully validated
46
- - ✅ `server/app.py` — FastAPI with 7 working endpoints
47
- - ✅ `README.md` — 533-line comprehensive guide
48
-
49
- ### Testing & Validation
50
- - ✅ `test_day1.py` — Automated validation (11 test cases)
51
- - ✅ `test_all.bat` — Windows batch runner
52
- - ✅ `TEST_ENDPOINTS.md` — 17 curl examples
53
-
54
- ### Documentation Suite
55
- - ✅ `DAY1_STATUS.md` — Detailed status report
56
- - ✅ `COMPLETE_SUMMARY.md` — Quick reference
57
- - ✅ `README_EXPLAINED.md` — README breakdown
58
- - ✅ `VISUAL_SUMMARY.md` — Diagrams and examples
59
- - ✅ `FILE_INVENTORY.md` — Complete file listing
60
-
61
- ---
62
-
63
- ## 🎯 Key Features Implemented
64
-
65
- ### 1. **Fully Typed Models** (218 lines)
66
- ```python
67
- ✅ LogLine — Single log entry
68
- ✅ ServiceStatus — Service health snapshot
69
- ✅ TriageAction — Agent decision (with validation!)
70
- ✅ TriageObservation — What agent sees after step
71
- ✅ EpisodeState — Episode tracking
72
- ```
73
-
74
- ### 2. **Smart Action Validation** ⭐ CRITICAL
75
- ```python
76
- TriageAction.is_valid() method:
77
- ✅ Validates severity (P1, P2, P3 only)
78
- ✅ Validates service names (7 valid services)
79
- ✅ Validates team names (4 valid teams)
80
- ✅ Validates remediation format (action:service)
81
- ✅ Returns proper error messages
82
- ✅ Used by /step endpoint to return 422 on invalid input
83
- ```
84
-
85
- ### 3. **FastAPI Server** (101 lines)
86
- ```
87
- ✅ /health Returns status
88
- ✅ /tasks Returns all 3 task definitions
89
- ✅ /step Validates action, returns 422 on error
90
- ✅ /reset Skeleton (wire Day 2)
91
- ✅ /state Skeleton (wire Day 2)
92
- ✅ /grader Skeleton (wire Day 4)
93
- ✅ /baseline Skeleton (wire Day 5)
94
- ```
95
-
96
- ### 4. **Three Escalating Tasks**
97
- ```
98
- ✅ Task 1: Single Service Crash (Easy)
99
- - One service down, clear logs
100
- - Expected score: 0.75–0.85
101
-
102
- ✅ Task 2: Cascading Failure (Medium)
103
- - DB slowdown → upstream cascade
104
- - Must trace to root, not symptoms
105
- - Expected score: 0.45–0.60
106
-
107
- ✅ Task 3: Silent Degradation (Hard)
108
- - Slow creeping problem in 60% noise
109
- - Nuanced P2 judgment required
110
- - Expected score: 0.20–0.40
111
- ```
112
-
113
- ---
114
-
115
- ## 📝 Documentation Provided
116
-
117
- Your hackathon judges will find:
118
-
119
- 1. **README.md** (533 lines)
120
- - Clear problem statement (why SRE triage matters)
121
- - Environment architecture (microservice topology)
122
- - Detailed action/observation spaces
123
- - Reward function with scoring table
124
- - All 3 tasks with success criteria
125
- - Complete API documentation
126
- - Setup and deployment instructions
127
- - Pre-submission checklist
128
-
129
- 2. **7 Supporting Guides**
130
- - Status report (what's done, what's left)
131
- - Summary reference (quick overview)
132
- - README explanation (section breakdown)
133
- - Visual guide (diagrams and examples)
134
- - File inventory (complete listing)
135
- - Test endpoints (copy-paste curl commands)
136
- - Original plan (DAY1.md reference)
137
-
138
- ---
139
-
140
- ## 🧪 Ready to Test
141
-
142
- ### Quick Tests (No Infrastructure Needed)
143
- ```bash
144
- python test_day1.py
145
- ```
146
- Tests model imports, validation logic, endpoint registration.
147
-
148
- ### Full Server Test
149
- ```bash
150
- pip install -r requirements.txt
151
- python -m uvicorn server.app:app --port 7860 --reload
152
- curl http://localhost:7860/health
153
- ```
154
-
155
- ### Docker Test
156
- ```bash
157
- docker build -t logtriage-env .
158
- docker run -p 7860:7860 logtriage-env
159
- curl http://localhost:7860/health
160
- ```
161
-
162
- ### Manual Endpoint Tests
163
- See `TEST_ENDPOINTS.md` for 17 ready-to-run curl commands covering:
164
- - Valid actions (8 examples)
165
- - Invalid actions (5 error examples)
166
- - All endpoints
167
-
168
- ---
169
-
170
- ## ⏳ What's Remaining
171
-
172
- Only 5% of work left:
173
-
174
- ### Verification (30 minutes)
175
- - [ ] Run `python test_day1.py`
176
- - [ ] Start server and test `/health` endpoint
177
- - [ ] Test `/step` with valid and invalid actions
178
- - [ ] Test Docker build
179
- - [ ] Test Docker run
180
-
181
- ### GitHub Push (5 minutes)
182
- ```bash
183
- git add .
184
- git commit -m "Day 1: Complete scaffold, models, endpoints, Dockerfile"
185
- git push origin main
186
- ```
187
-
188
- ### Day 2 & 3 (Implementation) ✅
189
- - [x] Create `server/environment.py` (LogTriageEnvironment class)
190
- - [x] Create `server/log_generator.py` (synthetic log generation)
191
- - [x] Create `server/scenarios/single_crash.py` (Task 1 scenario)
192
- - [x] Create `server/scenarios/cascading.py` (Task 2 scenario)
193
- - [x] Create `server/scenarios/silent_degrade.py` (Task 3 scenario)
194
- - [x] Wire `/reset` and `/step` endpoints to environment
195
- - [x] Test all 3 tasks end-to-end
196
-
197
- ---
198
-
199
- ## 📋 Pre-Push Checklist
200
-
201
- Before committing to GitHub, verify:
202
-
203
- - [ ] All files listed in FILE_INVENTORY.md exist locally
204
- - [ ] `test_day1.py` runs without import errors
205
- - [ ] No Python syntax errors in models.py or app.py
206
- - [ ] README.md is readable and complete
207
- - [ ] All 7 supporting guides are created
208
- - [ ] Dockerfile syntax is valid
209
- - [ ] requirements.txt has no circular dependencies
210
- - [ ] No hardcoded credentials or API keys in code
211
- - [ ] .gitignore includes Python artifacts
212
-
213
- ---
214
-
215
- ## 🎬 Recommended Next Steps
216
-
217
- ### Option A: Verify Everything Works (Recommended)
218
- 1. **Run tests** (5 min): `python test_day1.py`
219
- 2. **Start server** (2 min): `python -m uvicorn server.app:app --port 7860`
220
- 3. **Test endpoints** (3 min): `curl http://localhost:7860/health`
221
- 4. **Try Docker** (5 min): `docker build -t logtriage-env .`
222
- 5. **Push to GitHub** (2 min): `git push origin main`
223
-
224
- **Total: 17 minutes to verify everything works**
225
-
226
- ### Option B: Quick Push (Low Risk)
227
- - You have comprehensive test suite (`test_day1.py`)
228
- - Code is syntactically valid
229
- - Models are fully typed
230
- - Push and test on GitHub CI/CD
231
-
232
- ---
233
-
234
- ## 📊 Quality Metrics
235
-
236
- | Aspect | Status | Notes |
237
- |--------|--------|-------|
238
- | **Type Safety** | ✅ Excellent | All models fully typed with Pydantic |
239
- | **Validation** | ✅ Excellent | is_valid() catches all bad inputs |
240
- | **Error Handling** | ✅ Excellent | Returns 422 with detailed messages |
241
- | **Documentation** | ✅ Excellent | 1,900 lines across 8 documents |
242
- | **Test Coverage** | ✅ Good | 11 validation test cases |
243
- | **Code Structure** | ✅ Excellent | Clean separation of concerns |
244
- | **Extensibility** | ✅ Excellent | Easy to add Day 2 logic |
245
-
246
- ---
247
-
248
- ## 🏆 What Sets This Apart
249
-
250
- **For Hackathon Judges:**
251
-
252
- 1. **Problem Understanding** — Clear articulation of SRE triage challenge
253
- 2. **Technical Depth** — Sophisticated reward design, careful task design
254
- 3. **Production-Ready Code** — Type safety, validation, error handling
255
- 4. **Comprehensive Docs** — Anyone can understand and extend
256
- 5. **Testability** — Automated tests, curl examples, batch runners
257
- 6. **Multi-Week Plan** — Clear roadmap through Day 5
258
- 7. **OpenEnv Compliance** — Follows standard specification
259
-
260
- ---
261
-
262
- ## 💾 Git Commit Message (Ready to Use)
263
-
264
- ```
265
- Day 1 Complete: Scaffold, Models, Endpoints, Docker, Comprehensive Docs
266
-
267
- ✅ COMPLETED:
268
- - Full Pydantic models (LogLine, ServiceStatus, TriageAction, TriageObservation, EpisodeState)
269
- - TriageAction.is_valid() validates all 7 action types with detailed errors
270
- - FastAPI server with 7 endpoints (health, reset, step, state, tasks, grader, baseline)
271
- - Action validation integrated into /step endpoint (returns 422 on invalid)
272
- - Dockerfile for Python 3.11 containerization
273
- - openenv.yaml with 3 escalating tasks (easy, medium, hard)
274
- - Comprehensive 533-line README with all sections
275
- - 7 supporting documentation guides (1,900+ lines total)
276
- - Automated test suite (test_day1.py with 11 validation cases)
277
- - Windows batch test runner (test_all.bat)
278
- - 17 curl endpoint examples (TEST_ENDPOINTS.md)
279
-
280
- ✅ VERIFIED:
281
- - Models import without errors
282
- - FastAPI app imports without errors
283
- - All endpoints registered
284
- - Validation logic correct for 11 test cases
285
- - Pydantic model construction works
286
- - Dockerfile syntax valid
287
-
288
- ⏳ NEXT (Day 2):
289
- - Create server/environment.py (LogTriageEnvironment class)
290
- - Create server/log_generator.py (synthetic log generation)
291
- - Create server/scenarios/single_crash.py (Task 1 scenario)
292
- - Wire /reset and /step endpoints to real environment
293
- - Implement reset() and step() logic
294
-
295
- PROJECT STATUS: 95% complete, ready for testing & Day 2 implementation
296
- DEADLINE: April 7, 2026, 11:59 PM IST
297
- SUBMISSION: Meta × PyTorch Hackathon
298
- ```
299
-
300
- ---
301
-
302
- ## 🎯 Your Next Action
303
-
304
- **Choose one:**
305
-
306
- **A) Be Thorough (Recommended)**
307
- ```bash
308
- 1. python test_day1.py
309
- 2. pip install -r requirements.txt
310
- 3. python -m uvicorn server.app:app --port 7860 --reload
311
- 4. # In another terminal: curl http://localhost:7860/health
312
- 5. git push origin main
313
- ```
314
-
315
- **B) Quick Push**
316
- ```bash
317
- git add .
318
- git commit -m "Day 1 complete"
319
- git push origin main
320
- ```
321
-
322
- Either way, you're ready. The foundation is solid. 🚀
323
-
324
- ---
325
-
326
- ## 📞 Reference Guide
327
-
328
- | Need | File |
329
- |------|------|
330
- | Understand the project | README.md |
331
- | Know current status | DAY1_STATUS.md |
332
- | See what's done | COMPLETE_SUMMARY.md |
333
- | Understand README | README_EXPLAINED.md |
334
- | Visual diagrams | VISUAL_SUMMARY.md |
335
- | Test endpoints | TEST_ENDPOINTS.md |
336
- | File locations | FILE_INVENTORY.md |
337
- | Auto-validate | test_day1.py |
338
- | Original plan | DAY1.md |
339
-
340
- ---
341
-
342
- **Status:** ✅ ALL 3 TASKS PLAYABLE — READY FOR DAY 4
343
- **Completion:** 60%
344
- **Next Phase:** Day 4 Grader Implementation
345
- **Deadline:** April 7, 2026, 11:59 PM IST
346
-
347
- **All 3 tasks are fully functional. Next: Build grader logic to evaluate agent performance!** 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FILE_INVENTORY.md DELETED
@@ -1,377 +0,0 @@
1
- ~~# LogTriageEnv — Complete File Inventory
2
-
3
- ## 📂 Project Root Files
4
-
5
- ### Configuration & Setup
6
- | File | Lines | Status | Purpose |
7
- |------|-------|--------|---------|
8
- | `openenv.yaml` | 38 | ✅ | OpenEnv spec with 3 tasks, action/observation spaces, reward ranges |
9
- | `requirements.txt` | 6 | ✅ | All dependencies (fastapi, uvicorn, pydantic, openenv-core, requests, openai) |
10
- | `Dockerfile` | 16 | ✅ | Python 3.11 image, port 7860, uvicorn server |
11
- | `.gitignore` | Present | ✅ | Python ignore rules |
12
- | `LICENSE` | Present | ✅ | License file |
13
-
14
- ### Documentation (Main)
15
- | File | Lines | Status | Purpose |
16
- |------|-------|--------|---------|
17
- | `README.md` | 533 | ✅ | Comprehensive guide (overview, tasks, API, setup, deployment) |
18
- | `DAY1.md` | 595 | ✅ | Original Day 1 execution plan (reference) |
19
- | `DAY1_STATUS.md` | 336 | ✅ | **Detailed status report** (what's built, what's left) |
20
- | `COMPLETE_SUMMARY.md` | 240 | ✅ | **Quick reference** (summary, testing, next steps) |
21
- | `README_EXPLAINED.md` | 268 | ✅ | **README breakdown** (section-by-section explanation) |
22
- | `VISUAL_SUMMARY.md` | 437 | ✅ | **Visual guide** (diagrams, data flow, examples) |
23
- | `FILE_INVENTORY.md` | This | ✅ | **Complete file list** (what you're reading) |
24
- | `TEST_ENDPOINTS.md` | 172 | ✅ | **Curl command reference** (17 endpoint tests) |
25
-
26
- ### Test & Automation
27
- | File | Lines | Status | Purpose |
28
- |------|-------|--------|---------|
29
- | `test_day1.py` | 147 | ✅ | Automated Python validation (models, imports, validation logic) |
30
- | `test_all.bat` | 61 | ✅ | Windows batch test runner (dependencies, imports, tests) |
31
-
32
- ---
33
-
34
- ## 📁 server/ Directory (Core Implementation)
35
-
36
- ### Models & Configuration
37
- | File | Lines | Status | Purpose |
38
- |------|-------|--------|---------|
39
- | `server/__init__.py` | 0 | ✅ | Package marker |
40
- | `server/models.py` | 218 | ✅✨ | **Pydantic models** (LogLine, ServiceStatus, TriageAction, TriageObservation, EpisodeState) |
41
- | `server/requirements.txt` | Present | ✅ | Server-specific dependencies (if any) |
42
-
43
- ### API & Application
44
- | File | Lines | Status | Purpose |
45
- |------|-------|--------|---------|
46
- | `server/app.py` | 101 | ✅✨ | **FastAPI application** (7 endpoints: /health, /reset, /step, /state, /tasks, /grader, /baseline) |
47
-
48
- ### Environment & Simulation (Day 2+)
49
- | File | Lines | Status | Purpose |
50
- |------|-------|--------|---------|
51
- | `server/environment.py` | - | ⏳ | **Core class** LogTriageEnvironment (reset, step, state management) |
52
- | `server/log_generator.py` | - | ⏳ | Synthetic log generation (realistic service logs) |
53
-
54
- ### Scenarios (Day 2-3)
55
- | File | Lines | Status | Purpose |
56
- |------|-------|--------|---------|
57
- | `server/scenarios/__init__.py` | - | ⏳ | Package marker |
58
- | `server/scenarios/single_crash.py` | - | ⏳ | **Task 1** Single service crash scenario |
59
- | `server/scenarios/cascading.py` | - | ⏳ | **Task 2** Cascading failure scenario |
60
- | `server/scenarios/silent_degrade.py` | - | ⏳ | **Task 3** Silent degradation with noise scenario |
61
-
62
- ### Graders (Day 4)
63
- | File | Lines | Status | Purpose |
64
- |------|-------|--------|---------|
65
- | `server/graders/__init__.py` | - | ⏳ | Package marker |
66
- | `server/graders/base_grader.py` | - | ⏳ | Abstract base class for all graders |
67
- | `server/graders/crash_grader.py` | - | ⏳ | Task 1 grader (single crash scoring) |
68
- | `server/graders/cascade_grader.py` | - | ⏳ | Task 2 grader (cascading failure scoring) |
69
- | `server/graders/noise_grader.py` | - | ⏳ | Task 3 grader (silent degradation scoring) |
70
-
71
- ---
72
-
73
- ## 📁 scripts/ Directory (Utilities)
74
-
75
- | File | Lines | Status | Purpose |
76
- |------|-------|--------|---------|
77
- | `scripts/run_grader.py` | - | ⏳ | Manual grader testing CLI (Day 4) |
78
- | `scripts/validate_checklist.py` | - | ⏳ | Pre-submission validation script (Day 5) |
79
-
80
- ---
81
-
82
- ## 📁 Root-Level Support Files
83
-
84
- | File | Lines | Status | Purpose |
85
- |------|-------|--------|---------|
86
- | `baseline.py` | - | ⏳ | Baseline agent using GPT-4o-mini (Day 5) |
87
- | `.claude` | - | ✅ | Copilot session marker |
88
- | `.git/` | - | ✅ | Git repository |
89
- | `.gitignore` | - | ✅ | Git ignore rules |
90
-
91
- ---
92
-
93
- ## 📊 Summary Statistics
94
-
95
- ### Completed
96
- ```
97
- ✅ Core Files Written: 12 files
98
- ✅ Total Documentation: 1,900+ lines
99
- ✅ Code Lines: 500+ lines
100
- ✅ Tests: 200+ lines
101
- ✅ Examples: 200+ lines
102
- ```
103
-
104
- ### By Category
105
-
106
- **Configuration:** 3 files
107
- - openenv.yaml
108
- - requirements.txt
109
- - .gitignore
110
-
111
- **Documentation:** 8 files
112
- - README.md (main)
113
- - 7 supporting guides
114
-
115
- **Core Code:** 2 files
116
- - models.py (218 lines) ✨
117
- - app.py (101 lines) ✨
118
-
119
- **Tests:** 2 files
120
- - test_day1.py
121
- - test_all.bat
122
-
123
- **Infrastructure:** 2 files
124
- - Dockerfile
125
- - License
126
-
127
- **Folders Created:** 5
128
- - server/
129
- - server/scenarios/
130
- - server/graders/
131
- - scripts/
132
- - .git/
133
-
134
- ---
135
-
136
- ## 🎯 What Each File Does
137
-
138
- ### `openenv.yaml` (38 lines)
139
- **OpenEnv metadata specification**
140
- - Environment name and version
141
- - 3 task definitions (single_crash, cascading_failure, silent_degradation)
142
- - Action space (discrete, 7 action types)
143
- - Observation space (structured logs + state)
144
- - Reward range [-0.5, 1.0]
145
-
146
- ### `requirements.txt` (6 lines)
147
- **Python dependencies**
148
- - openenv-core>=0.2.2
149
- - fastapi>=0.104.0
150
- - uvicorn>=0.24.0
151
- - pydantic>=2.0.0
152
- - requests>=2.25.0
153
- - openai>=1.0.0
154
-
155
- ### `Dockerfile` (16 lines)
156
- **Container image definition**
157
- - Base: python:3.11-slim
158
- - Installs requirements
159
- - Copies source code
160
- - Exposes port 7860
161
- - Runs uvicorn server
162
-
163
- ### `server/models.py` (218 lines) ⭐ KEY FILE
164
- **5 Pydantic data models:**
165
-
166
- 1. **LogLine** (15 lines)
167
- - timestamp, level, service, request_id, message, latency_ms
168
-
169
- 2. **ServiceStatus** (10 lines)
170
- - name, status, error_rate, latency_p99_ms, last_updated
171
-
172
- 3. **TriageAction** (50 lines) ⭐ MOST IMPORTANT
173
- - action_type (7 types)
174
- - value (depends on type)
175
- - confidence (0.0–1.0)
176
- - reasoning (optional)
177
- - **is_valid() method** with full validation logic
178
-
179
- 4. **TriageObservation** (55 lines)
180
- - logs, system_state, incident_id, task_id, step_count, time_elapsed
181
- - active_alerts, reward, cumulative_score, done
182
- - last_action_feedback, invalid_action_error
183
-
184
- 5. **EpisodeState** (25 lines)
185
- - episode_id, task_id, step_count, max_steps, done, cumulative_score
186
- - actions_taken, correct_severity, correct_root_cause, correct_remediation
187
-
188
- ### `server/app.py` (101 lines) ⭐ KEY FILE
189
- **FastAPI application with 7 endpoints:**
190
-
191
- | Endpoint | Method | Status | Implementation |
192
- |----------|--------|--------|-----------------|
193
- | /health | GET | ✅ | Returns `{"status": "ok", ...}` |
194
- | /reset | POST | ⏳ | Placeholder (wire Day 2) |
195
- | /step | POST | ✅ | Validates action via `is_valid()`, returns 422 on error |
196
- | /state | GET | ⏳ | Placeholder (wire Day 2) |
197
- | /tasks | GET | ✅ | Returns all 3 tasks with full schemas |
198
- | /grader | POST | ⏳ | Placeholder (wire Day 4) |
199
- | /baseline | POST | ⏳ | Placeholder (wire Day 5) |
200
-
201
- **Key feature:** `/step` endpoint already validates actions!
202
- ```python
203
- valid, err = action.is_valid()
204
- if not valid:
205
- return JSONResponse(status_code=422, content={"error": err})
206
- ```
207
-
208
- ### `README.md` (533 lines) ⭐ CRUCIAL
209
- **Comprehensive documentation covering:**
210
-
211
- 1. Overview & Motivation (why SRE triage matters)
212
- 2. Environment Description (microservice topology, log examples)
213
- 3. Action Space (7 action types with value table)
214
- 4. Observation Space (logs + state + rewards)
215
- 5. Reward Function (detailed scoring: +0.30–+0.35 for correct decisions)
216
- 6. Tasks & Graders (3 tasks with success criteria and expected scores)
217
- 7. Episode Boundaries (when start/end, reproducibility)
218
- 8. API Endpoints (all 8 endpoints documented with examples)
219
- 9. Setup & Installation (clone, install, run locally)
220
- 10. Docker Usage (build and run instructions)
221
- 11. Hugging Face Spaces (deployment configuration)
222
- 12. Baseline Inference (template code for LLM baseline)
223
- 13. Baseline Scores (table of expected results, TBD)
224
- 14. OpenEnv Spec Compliance (checklist of requirements)
225
- 15. Pre-Submission Checklist (14 validation items)
226
- 16. Project Structure (complete folder map with descriptions)
227
-
228
- ### `test_day1.py` (147 lines)
229
- **Automated validation script that tests:**
230
- - Model imports (LogLine, ServiceStatus, TriageAction, TriageObservation, EpisodeState)
231
- - FastAPI app import
232
- - 11 TriageAction validation test cases
233
- - Pydantic model construction
234
- - Endpoint registration
235
-
236
- Run: `python test_day1.py`
237
-
238
- ### `TEST_ENDPOINTS.md` (172 lines)
239
- **Reference guide with 17 curl command examples:**
240
- - /health check
241
- - /tasks listing
242
- - 8 valid actions (classify, identify, remediate, escalate, resolve, ignore, request_logs)
243
- - 5 invalid actions (wrong severity, unknown service, bad format, etc.)
244
- - Expected responses for each
245
-
246
- ### `DAY1_STATUS.md` (336 lines)
247
- **Detailed status report explaining:**
248
- - What is LogTriageEnv
249
- - What has been built (file-by-file breakdown)
250
- - What each core file does
251
- - What's ready to test
252
- - What's remaining
253
- - Day 1 checklist status
254
- - How to test locally
255
- - Git commit template
256
-
257
- ### `COMPLETE_SUMMARY.md` (240 lines)
258
- **Quick-reference summary with:**
259
- - What you're building
260
- - Completion status table
261
- - Core models explanation
262
- - FastAPI endpoints
263
- - 3 tasks at a glance
264
- - Key achievements
265
- - How to proceed
266
-
267
- ### `README_EXPLAINED.md` (268 lines)
268
- **Detailed breakdown of README.md structure:**
269
- - Why README matters for hackathon
270
- - What each section explains
271
- - Key quotes and examples
272
- - Why this README stands out
273
- - How it becomes HF Space header
274
-
275
- ### `VISUAL_SUMMARY.md` (437 lines)
276
- **Visual reference guide with:**
277
- - ASCII diagrams of architecture
278
- - Data flow diagram
279
- - Task descriptions with visual examples
280
- - Pydantic models at a glance
281
- - Action validation examples (✅ vs 🚫)
282
- - File completion status table
283
- - Quick stats and numbers
284
- - What to do next steps
285
- - Day 2 todo list
286
-
287
- ### `FILE_INVENTORY.md` (This file)
288
- **Complete project file listing:**
289
- - All files with line counts and purposes
290
- - Status indicators (✅ ⏳)
291
- - Summary statistics
292
- - What each file does
293
-
294
- ---
295
-
296
- ## 📈 Progress Tracking
297
-
298
- ### Day 1 Complete
299
- ```
300
- ✅ openenv.yaml (spec)
301
- ✅ requirements.txt (dependencies)
302
- ✅ Dockerfile (containerization)
303
- ✅ server/models.py (data models)
304
- ✅ server/app.py (API endpoints)
305
- ✅ README.md (documentation)
306
- ✅ Folder structure (all directories created)
307
- ✅ Test suite (test_day1.py, test_all.bat)
308
- ✅ Documentation suite (5 supporting guides)
309
- ```
310
-
311
- ### Day 2 TODO
312
- ```
313
- ⏳ server/environment.py (core logic)
314
- ⏳ server/log_generator.py (log synthesis)
315
- ⏳ server/scenarios/single_crash.py (Task 1)
316
- ```
317
-
318
- ### Day 3-5 TODO
319
- ```
320
- ⏳ server/scenarios/cascading.py (Task 2)
321
- ⏳ server/scenarios/silent_degrade.py (Task 3)
322
- ⏳ server/graders/*.py (scoring logic)
323
- ⏳ baseline.py (LLM agent)
324
- ⏳ scripts/ (CLI tools)
325
- ```
326
-
327
- ---
328
-
329
- ## 🎓 How to Use This Inventory
330
-
331
- **When you need to:**
332
- - **Understand what's done:** Check the Status column (✅ = ready, ⏳ = pending)
333
- - **Find a file:** Use the File column
334
- - **Know the purpose:** Check the Purpose column
335
- - **See how long something is:** Check the Lines column
336
- - **Understand the big picture:** See Summary Statistics
337
- - **Know what to work on next:** Check Progress Tracking
338
-
339
- ---
340
-
341
- ## 📦 Total Project Size
342
-
343
- - **Core Code:** ~320 lines (models.py + app.py)
344
- - **Documentation:** ~1,900 lines (README + guides)
345
- - **Tests:** ~200 lines (validation + examples)
346
- - **Configuration:** ~60 lines (openenv.yaml + requirements)
347
- - **Automation:** ~100 lines (Dockerfile + batch)
348
-
349
- **Total (Day 1): ~2,600 lines of code, docs, and tests**
350
-
351
- ---
352
-
353
- ## ✅ Verification Checklist
354
-
355
- Use this to verify everything is present:
356
-
357
- - [ ] openenv.yaml exists and has 3 tasks
358
- - [ ] requirements.txt has all 6 dependencies
359
- - [ ] Dockerfile exists and is valid
360
- - [ ] server/models.py exists with 5 classes
361
- - [ ] server/app.py exists with 7 endpoints
362
- - [ ] README.md has all 16 sections
363
- - [ ] test_day1.py exists
364
- - [ ] test_all.bat exists
365
- - [ ] TEST_ENDPOINTS.md exists with 17 examples
366
- - [ ] DAY1_STATUS.md exists
367
- - [ ] COMPLETE_SUMMARY.md exists
368
- - [ ] README_EXPLAINED.md exists
369
- - [ ] VISUAL_SUMMARY.md exists
370
- - [ ] FILE_INVENTORY.md exists (this file)
371
- - [ ] All folders created (server/, scripts/, scenarios/, graders/)
372
-
373
- ---
374
-
375
- **Generated:** 2026-03-26
376
- **Project:** LogTriageEnv — Meta × PyTorch Hackathon
377
- **Status:** Day 1 Complete (95% ready, just needs testing & push)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -347,8 +347,8 @@ uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
347
  ### Run baseline inference
348
 
349
  ```bash
350
- export OPENAI_API_KEY=your_key_here
351
- python baseline.py
352
  ```
353
 
354
  ### Validate all 3 tasks manually
@@ -377,7 +377,7 @@ curl http://localhost:7860/health
377
  curl -X POST http://localhost:7860/reset
378
 
379
  # Run baseline inside container
380
- docker run -e OPENAI_API_KEY=your_key logtriage-env python baseline.py
381
  ```
382
 
383
  ---
@@ -395,7 +395,7 @@ The Space uses a Docker SDK with the following configuration:
395
  title: LogTriageEnv
396
  emoji: 🚨
397
  colorFrom: red
398
- colorTo: orange
399
  sdk: docker
400
  pinned: false
401
  tags:
@@ -409,10 +409,10 @@ tags:
409
 
410
  ## 12. Baseline Inference Script
411
 
412
- `baseline.py` uses the OpenAI API client to run `gpt-4o-mini` as a zero-shot agent against all 3 tasks and reports scores.
413
 
414
  ```python
415
- # baseline.py (structure)
416
  import os
417
  from openai import OpenAI
418
  import requests
@@ -457,19 +457,24 @@ if __name__ == "__main__":
457
 
458
  ## 13. Baseline Scores
459
 
460
- *(To be filled after implementation and baseline runs)*
461
 
462
- | Task | Difficulty | Baseline Score (gpt-4o-mini) |
463
  |---|---|---|
464
- | Single Service Crash | Easy | TBD |
465
- | Cascading Failure | Medium | TBD |
466
- | Silent Degradation | Hard | TBD |
467
- | **Average** | | **TBD** |
468
 
469
  Expected ranges based on design:
470
- - Single crash: 0.75–0.85
471
- - Cascading failure: 0.45–0.60
472
- - Silent degradation: 0.20–0.40
 
 
 
 
 
473
 
474
  ---
475
 
@@ -505,7 +510,7 @@ Expected ranges based on design:
505
  - [ ] `POST /grader` returns score in [0.0, 1.0]
506
  - [ ] `POST /baseline` completes and returns scores for all 3 tasks
507
  - [ ] HF Space URL responds to ping with 200
508
- - [ ] Baseline script runs end-to-end with `OPENAI_API_KEY` set
509
  - [ ] All 3 graders return varying scores (not constant)
510
  - [ ] README includes all required sections
511
  - [ ] `requirements.txt` is complete and pinned
@@ -520,7 +525,7 @@ logtriage-env/
520
  ├── openenv.yaml # OpenEnv metadata
521
  ├── Dockerfile # Container definition
522
  ├── requirements.txt # Top-level deps
523
- ├── baseline.py # Baseline inference script
524
 
525
  ├── server/
526
  │ ├── __init__.py
 
347
  ### Run baseline inference
348
 
349
  ```bash
350
+ export HF_TOKEN=your_key_here
351
+ python inference.py
352
  ```
353
 
354
  ### Validate all 3 tasks manually
 
377
  curl -X POST http://localhost:7860/reset
378
 
379
  # Run baseline inside container
380
+ docker run -e HF_TOKEN=your_key -e API_BASE_URL=https://api.groq.com/openai/v1 -e MODEL_NAME=llama-3.3-70b-versatile logtriage-env python inference.py
381
  ```
382
 
383
  ---
 
395
  title: LogTriageEnv
396
  emoji: 🚨
397
  colorFrom: red
398
+ colorTo: red
399
  sdk: docker
400
  pinned: false
401
  tags:
 
409
 
410
  ## 12. Baseline Inference Script
411
 
412
+ `inference.py` uses an OpenAI-compatible client with configurable provider settings to run `llama-3.3-70b-versatile` as a zero-shot agent against all 3 tasks and reports scores.
413
 
414
  ```python
415
+ # inference.py (structure)
416
  import os
417
  from openai import OpenAI
418
  import requests
 
457
 
458
  ## 13. Baseline Scores
459
 
460
+ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (`seed=42`):
461
 
462
+ | Task | Difficulty | Score |
463
  |---|---|---|
464
+ | Single Service Crash | Easy | 1.0000 |
465
+ | Cascading Failure | Medium | 0.6500 |
466
+ | Silent Degradation | Hard | 0.0000 |
467
+ | **Average** | | **0.5500** |
468
 
469
  Expected ranges based on design:
470
+ - Single crash: 0.75–0.85 → **Exceeded (1.0000)**
471
+ - Cascading failure: 0.45–0.60 → **Exceeded (0.6500)**
472
+ - Silent degradation: 0.20–0.40 → **Below range (0.0000 — see note)**
473
+
474
+ > **Note:** LLM-based scoring varies across runs due to non-deterministic model behavior.
475
+ > The Silent Degradation task is hardest — it requires distinguishing signal from 60% noise
476
+ > and making a nuanced P2 judgment (not an outage yet). Scores on this task can range
477
+ > from 0.0 to 0.55 depending on the model's log parsing on that specific run.
478
 
479
  ---
480
 
 
510
  - [ ] `POST /grader` returns score in [0.0, 1.0]
511
  - [ ] `POST /baseline` completes and returns scores for all 3 tasks
512
  - [ ] HF Space URL responds to ping with 200
513
+ - [ ] Baseline script runs end-to-end with `HF_TOKEN` set
514
  - [ ] All 3 graders return varying scores (not constant)
515
  - [ ] README includes all required sections
516
  - [ ] `requirements.txt` is complete and pinned
 
525
  ├── openenv.yaml # OpenEnv metadata
526
  ├── Dockerfile # Container definition
527
  ├── requirements.txt # Top-level deps
528
+ ├── inference.py # Baseline inference script
529
 
530
  ├── server/
531
  │ ├── __init__.py
START_HERE_DAY2.md DELETED
@@ -1,246 +0,0 @@
1
- # 📖 START HERE — Days 1-2 Complete Guide
2
-
3
- **Status:** ✅ **Days 1-2 COMPLETE — Task 1 Fully Playable**
4
- **Overall Progress:** 40% (2 of 5 days)
5
- **Last Updated:** March 27, 2026
6
-
7
- ---
8
-
9
- ## 🎯 Where to Start?
10
-
11
- ### If you have **2 minutes**:
12
- 👉 Read **STATUS.md** ← Quick status + which docs to read
13
-
14
- ### If you have **5 minutes**:
15
- 👉 Read **EXECUTIVE_SUMMARY.md** ← What's done, high-level overview
16
-
17
- ### If you have **10 minutes**:
18
- 👉 Read **DAYS_1-2_SUMMARY_FINAL.md** ← Clean summary of Days 1-2
19
-
20
- ### If you want **full details**:
21
- 👉 Read **DAYS_1-2_SUMMARY.md** ← Comprehensive Day 2 breakdown + examples
22
-
23
- ---
24
-
25
- ## 📁 Documentation by Purpose
26
-
27
- ### 🚀 **Quick Overview (2-5 min)**
28
- | File | Purpose | Read If |
29
- |------|---------|---------|
30
- | **STATUS.md** | Current status + doc guide | You want a quick check |
31
- | **EXECUTIVE_SUMMARY.md** | High-level completion status | You want an overview |
32
- | **DAYS_1-2_SUMMARY_FINAL.md** | Days 1-2 summary | You want a clean summary |
33
-
34
- ### 📚 **Detailed Technical (10-20 min)**
35
- | File | Purpose | Read If |
36
- |------|---------|---------|
37
- | **DAYS_1-2_SUMMARY.md** | Full Day 2 breakdown | You want to understand architecture |
38
- | **DAY1_STATUS.md** | Detailed Day 1 status | You want Day 1 details |
39
- | **DAY2_STATUS.md** | Detailed Day 2 status | You want Day 2 details |
40
- | **README.md** | Official spec (533 lines) | You want the complete reference |
41
-
42
- ### 🔧 **How-To Guides (5-15 min)**
43
- | File | Purpose | Read If |
44
- |------|---------|---------|
45
- | **TEST_ENDPOINTS.md** | 17 curl examples (all working!) | You want to test endpoints |
46
- | **VISUAL_SUMMARY.md** | Diagrams + architecture | You want visual understanding |
47
- | **README_EXPLAINED.md** | Line-by-line README breakdown | You want to understand README |
48
- | **FILE_INVENTORY.md** | Complete file listing | You want to know where everything is |
49
-
50
- ### 📋 **Reference (5-10 min)**
51
- | File | Purpose | Read If |
52
- |------|---------|---------|
53
- | **COMPLETE_SUMMARY.md** | Feature checklist | You want to see all features |
54
- | **WHAT_HAS_BEEN_DONE.md** | Completion summary | You want a summary |
55
- | **FINAL_CHECKLIST.md** | Pre-push verification | You want a checklist |
56
- | **ANALYSIS_SUMMARY.md** | Technical analysis | You want deep analysis |
57
-
58
- ---
59
-
60
- ## ✅ What's Done (Days 1-2)
61
-
62
- ### **Day 1: Skeleton (100% Complete)**
63
- ```
64
- ✅ Models (5 Pydantic classes, 218 lines)
65
- ✅ API endpoints (7 registered, 3+ wired)
66
- ✅ Configuration (openenv.yaml, requirements.txt)
67
- ✅ Docker setup
68
- ✅ Comprehensive documentation
69
- ```
70
-
71
- ### **Day 2: Environment (100% Complete)**
72
- ```
73
- ✅ LogTriageEnvironment class (250+ lines)
74
- ✅ Synthetic log generator (400+ lines)
75
- ✅ Task 1 scenario (150+ lines)
76
- ✅ Endpoints wired to real logic (/reset, /step, /state)
77
- ✅ Full Task 1 playable end-to-end
78
- ```
79
-
80
- ### **Total: 40% of Project**
81
- - ✅ Task 1 (Easy): PLAYABLE
82
- - ⏳ Task 2 (Medium): Not yet
83
- - ⏳ Task 3 (Hard): Not yet
84
-
85
- ---
86
-
87
- ## 🎮 Try It Now
88
-
89
- ### 1. Start Server
90
- ```bash
91
- python -m uvicorn server.app:app --port 7860
92
- ```
93
-
94
- ### 2. Run Full Episode (Copy-Paste From TEST_ENDPOINTS.md)
95
- ```bash
96
- # Reset (get initial observation)
97
- curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
98
-
99
- # Step 1: Classify severity
100
- curl -X POST "http://localhost:7860/step" \
101
- -H "Content-Type: application/json" \
102
- -d '{"action_type":"classify_severity","value":"P1","confidence":0.95}'
103
-
104
- # Step 2: Identify root cause
105
- curl -X POST "http://localhost:7860/step" \
106
- -H "Content-Type: application/json" \
107
- -d '{"action_type":"identify_root_cause","value":"payment-service","confidence":0.9}'
108
-
109
- # Step 3: Remediate
110
- curl -X POST "http://localhost:7860/step" \
111
- -H "Content-Type: application/json" \
112
- -d '{"action_type":"remediate","value":"restart:payment-service","confidence":0.95}'
113
-
114
- # Step 4: Resolve
115
- curl -X POST "http://localhost:7860/step" \
116
- -H "Content-Type: application/json" \
117
- -d '{"action_type":"resolve","value":"resolved"}'
118
- ```
119
-
120
- ### 3. Result
121
- ✅ Perfect episode score: **1.0**
122
- ✅ Rewards: 0.30 + 0.35 + 0.25 + 0.10 = 1.0
123
-
124
- ---
125
-
126
- ## 📊 Progress Status
127
-
128
- ```
129
- Day 1: ✅✅✅✅✅ (100% - Skeleton)
130
- Day 2: ✅✅✅✅✅ (100% - Environment)
131
- Day 3: ⏳⏳⏳⏳⏳ (0% - Scenarios 2 & 3)
132
- Day 4: ⏳⏳⏳⏳⏳ (0% - Graders)
133
- Day 5: ⏳⏳⏳⏳⏳ (0% - Baseline + Deploy)
134
-
135
- OVERALL: ▓▓░░░ 40% Complete
136
- ```
137
-
138
- ---
139
-
140
- ## 🎯 Key Files (Know These!)
141
-
142
- ### **Core Code**
143
- - `server/models.py` — 5 Pydantic classes
144
- - `server/app.py` — FastAPI endpoints
145
- - `server/environment.py` — Episode logic ⭐ NEW Day 2
146
- - `server/log_generator.py` — Synthetic logs ⭐ NEW Day 2
147
- - `server/scenarios/single_crash.py` — Task 1 ⭐ NEW Day 2
148
-
149
- ### **Configuration**
150
- - `openenv.yaml` — Environment spec
151
- - `requirements.txt` — Dependencies
152
- - `Dockerfile` — Container
153
-
154
- ### **Documentation** (Choose your favorite!)
155
- - **STATUS.md** ← Start here
156
- - **EXECUTIVE_SUMMARY.md** ← Overview
157
- - **DAYS_1-2_SUMMARY.md** ← Technical details
158
- - **TEST_ENDPOINTS.md** ← Copy-paste curl commands
159
-
160
- ---
161
-
162
- ## 💡 Key Concepts
163
-
164
- ### **Episode Flow**
165
- ```
166
- Agent → /reset → Observation (initial logs + state)
167
- Agent → /step (action) → Observation + reward + feedback
168
- ...repeat...
169
- Agent → /step (resolve) → done=true, episode complete
170
- ```
171
-
172
- ### **Reward System**
173
- - Severity classification: +0.30
174
- - Root cause identification: +0.35
175
- - Remediation action: +0.25
176
- - Speed bonus: +0.10
177
- - **Max score: 1.0**
178
-
179
- ### **Log Generation**
180
- - 7 microservices
181
- - Noise templates (realistic but irrelevant)
182
- - Signal templates (error patterns)
183
- - Step-by-step escalation
184
- - Deterministic (reproducible with seed)
185
-
186
- ---
187
-
188
- ## ❓ FAQ
189
-
190
- **Q: What's the difference between Day 1 and Day 2?**
191
- A: Day 1 = skeleton (models, API). Day 2 = logic (environment, logs, scenarios).
192
-
193
- **Q: Can I play Task 1 right now?**
194
- A: Yes! Run server, use curl commands from TEST_ENDPOINTS.md.
195
-
196
- **Q: What's the next step?**
197
- A: Day 3 = build Task 2 & Task 3 scenarios.
198
-
199
- **Q: Where's the full reference?**
200
- A: README.md (533 lines, complete spec).
201
-
202
- **Q: I just want to understand fast. Where do I start?**
203
- A: Read STATUS.md (2 min) → DAYS_1-2_SUMMARY_FINAL.md (5 min).
204
-
205
- **Q: I want the technical details.**
206
- A: Read DAYS_1-2_SUMMARY.md (full architecture + examples).
207
-
208
- ---
209
-
210
- ## 📞 Document Map
211
-
212
- ```
213
- Need quick status? → STATUS.md
214
- Need executive overview? → EXECUTIVE_SUMMARY.md
215
- Need clean summary? → DAYS_1-2_SUMMARY_FINAL.md
216
- Need technical details? → DAYS_1-2_SUMMARY.md
217
- Need Day 1 specifics? → DAY1_STATUS.md
218
- Need Day 2 specifics? → DAY2_STATUS.md
219
- Need to test endpoints? → TEST_ENDPOINTS.md
220
- Need to understand design? → VISUAL_SUMMARY.md
221
- Need full reference? → README.md
222
- Need file locations? → FILE_INVENTORY.md
223
- Need architecture diagram? → VISUAL_SUMMARY.md
224
- Need line-by-line README? → README_EXPLAINED.md
225
- ```
226
-
227
- ---
228
-
229
- ## ✨ TL;DR
230
-
231
- **Status:** ✅ Days 1-2 done (40% project complete)
232
-
233
- **What works:** Task 1 fully playable
234
-
235
- **How to test:** Run server, curl commands from TEST_ENDPOINTS.md
236
-
237
- **Next:** Build Task 2 & 3 scenarios (Day 3)
238
-
239
- **Read first:** STATUS.md or EXECUTIVE_SUMMARY.md
240
-
241
- ---
242
-
243
- Generated: March 27, 2026
244
- Project: LogTriageEnv (Meta × PyTorch Hackathon)
245
- Deadline: April 7, 2026, 11:59 PM IST
246
- Status: **ON TRACK** ✅
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
STATUS.md DELETED
@@ -1,260 +0,0 @@
1
- # 🎯 CURRENT STATUS — LogTriageEnv Days 1-3
2
-
3
- **Last Updated:** March 27, 2026
4
- **Status:** ✅ **Days 1-3 COMPLETE (100% of Days 1-3, 60% of total project)**
5
- **Overall Progress:** ▓▓▓░░ (60%)
6
-
7
- ---
8
-
9
- ## 📊 Quick Status
10
-
11
- | Component | Status | Details |
12
- |-----------|--------|---------|
13
- | **Day 1 Work** | ✅ 100% | Models, API scaffold, config, docs |
14
- | **Day 2 Work** | ✅ 100% | Environment, log gen, Task 1 scenario |
15
- | **Day 3 Work** | ✅ 100% | Tasks 2 & 3 scenarios + wiring |
16
- | **Task 1 (Easy)** | ✅ 100% | Single crash - fully playable |
17
- | **Task 2 (Medium)** | ✅ 100% | Cascading failures - fully playable |
18
- | **Task 3 (Hard)** | ✅ 100% | Silent degradation - fully playable |
19
- | **Graders** | ⏳ 0% | Day 4 - not started |
20
- | **Baseline Agent** | ⏳ 0% | Day 5 - not started |
21
-
22
- ---
23
-
24
- ## 📁 Documentation Guide
25
-
26
- ### 📖 START HERE
27
- **For quick understanding of what's been done:**
28
-
29
- 1. **EXECUTIVE_SUMMARY.md** (3 min read)
30
- - High-level status
31
- - What's complete
32
- - By-the-numbers
33
-
34
- 2. **DAYS_1-2_SUMMARY.md** (10 min read)
35
- - Detailed Day 2 breakdown
36
- - Architecture evolution
37
- - Full episode example
38
-
39
- 3. **DAYS_1-2_SUMMARY_FINAL.md** (5 min read)
40
- - Clean summary
41
- - Playable tasks
42
- - Progress tracking
43
-
44
- ---
45
-
46
- ### 🔍 DETAILED REFERENCES
47
-
48
- | File | Purpose |
49
- |------|---------|
50
- | **DAY3_STATUS.md** | Day 3 detailed status | Understanding Day 3 (cascading, silent degrade) |
51
- | **README.md** | Official spec | Understanding what the project is |
52
- | **README_EXPLAINED.md** | Breakdown of README | Line-by-line understanding |
53
- | **COMPLETE_SUMMARY.md** | Feature overview | Architecture and features |
54
- | **FILE_INVENTORY.md** | File listing | Where everything is |
55
- | **VISUAL_SUMMARY.md** | Architecture diagrams | Visual understanding |
56
- | **TEST_ENDPOINTS.md** | 17 curl examples | Testing endpoints |
57
- | **START_HERE.md** | Navigation guide | Which docs to read |
58
-
59
- ---
60
-
61
- ### 📋 PROGRESS TRACKING
62
-
63
- | File | Purpose |
64
- |------|---------|
65
- | **ANALYSIS_SUMMARY.md** | Technical analysis |
66
- | **WHAT_HAS_BEEN_DONE.md** | Completion summary |
67
- | **FINAL_CHECKLIST.md** | Pre-push verification |
68
-
69
- ---
70
-
71
- ## ✅ What's Actually Done
72
-
73
- ### Core Code (1,100+ lines)
74
- ```
75
- ✅ server/models.py (218 lines)
76
- - 5 Pydantic classes (all typed)
77
- - Full validation
78
-
79
- ✅ server/app.py (101+ lines)
80
- - 7 FastAPI endpoints
81
- - 3 wired to real logic
82
- - 4 still TODO
83
-
84
- ✅ server/environment.py (250+ lines)
85
- - LogTriageEnvironment class
86
- - Episode management
87
- - Reward calculation
88
- - State tracking
89
-
90
- ✅ server/log_generator.py (400+ lines)
91
- - Synthetic log generation
92
- - Noise/signal templates
93
- - Deterministic with seeds
94
- - 7-service cluster
95
-
96
- ✅ server/scenarios/single_crash.py (150+ lines)
97
- - Task 1: Single service crash
98
- - Ground truth definition
99
- - Error signal templates
100
- - Step-by-step scenario
101
- ```
102
-
103
- ### Configuration (40+ lines)
104
- ```
105
- ✅ openenv.yaml - Environment specification
106
- ✅ requirements.txt - Dependencies
107
- ✅ Dockerfile - Containerization
108
- ```
109
-
110
- ### Documentation (1,900+ lines)
111
- ```
112
- ✅ README.md (533 lines)
113
- ✅ EXECUTIVE_SUMMARY.md
114
- ✅ DAY1_STATUS.md
115
- ✅ DAY2_STATUS.md
116
- ✅ DAYS_1-2_SUMMARY.md
117
- ✅ + 8 more guides
118
- ```
119
-
120
- ---
121
-
122
- ## 🎮 What's Playable Now
123
-
124
- ### Task 1: Single Service Crash ✅
125
-
126
- **Difficulty:** Easy
127
- **Episode Length:** 5-8 steps
128
- **Scenario:** payment-service crashes, agent must triage
129
-
130
- **Play it:**
131
- ```bash
132
- # Terminal 1
133
- python -m uvicorn server.app:app --port 7860
134
-
135
- # Terminal 2
136
- # (See TEST_ENDPOINTS.md for full curl examples)
137
- curl -X POST "http://localhost:7860/reset?task=single_crash&seed=42"
138
- curl -X POST "http://localhost:7860/step" \
139
- -H "Content-Type: application/json" \
140
- -d '{"action_type":"classify_severity","value":"P1","confidence":0.95}'
141
- # ... and so on
142
- ```
143
-
144
- **Expected Output:**
145
- ```
146
- Step 0: Observation with crash logs
147
- Step 1: Reward 0.30 (severity correct)
148
- Step 2: Reward 0.35 (root cause correct)
149
- Step 3: Reward 0.25 (remediation correct)
150
- Step 4: Reward 0.10 (speed bonus)
151
- Final: Score 1.0 ✅ (perfect play)
152
- ```
153
-
154
- ---
155
-
156
- ## 📈 Progress Timeline
157
-
158
- ```
159
- Day 1 ✅ (Complete)
160
- ├─ Models & validation
161
- ├─ FastAPI scaffold
162
- ├─ Config & Docker
163
- └─ Comprehensive docs
164
-
165
- Day 2 ✅ (Complete)
166
- ├─ Environment class
167
- ├─ Log generation
168
- ├─ Task 1 scenario
169
- └─ Endpoints wired (3/7)
170
-
171
- Day 3 ✅ (Complete)
172
- ├─ Task 2 scenario (cascading)
173
- ├─ Task 3 scenario (silent degrade)
174
- ├─ All tasks wired
175
- └─ Full testing ready
176
-
177
- Day 4 ⏳ (Next)
178
- ├─ Grader logic
179
- └─ Evaluation
180
-
181
- Day 5 ⏳ (TBD)
182
- ├─ Baseline agent
183
- └─ Deployment
184
-
185
- 60% COMPLETE ✅
186
- ```
187
-
188
- ---
189
-
190
- ## 🎯 Commands to Remember
191
-
192
- ### Run the Server
193
- ```bash
194
- python -m uvicorn server.app:app --port 7860
195
- ```
196
-
197
- ### Test Task 1
198
- ```bash
199
- # See TEST_ENDPOINTS.md for 17 different curl examples
200
- # Or use START_HERE.md for navigation
201
- ```
202
-
203
- ### Check Completion
204
- - **Day 1:** ✅ 100% (see DAY1_STATUS.md)
205
- - **Day 2:** ✅ 100% (see DAY2_STATUS.md)
206
- - **Day 3:** ⏳ 0% (TODO)
207
-
208
- ---
209
-
210
- ## 💡 Key Points
211
-
212
- ✅ **What's Working:**
213
- - Full environment logic (all 3 tasks)
214
- - Log generation (3 scenarios with proper noise)
215
- - Reward calculation (per-task ground truth)
216
- - All 3 tasks playable end-to-end
217
- - Clean architecture
218
-
219
- ⏳ **What's Next:**
220
- - Grader implementation (Day 4)
221
- - Baseline agent (Day 5)
222
-
223
- ❌ **Not Needed Yet:**
224
- - Deployment (Day 5)
225
- - LLM integration (Day 5)
226
-
227
- ---
228
-
229
- ## 📞 Quick Reference
230
-
231
- **Questions?**
232
- - What's the project? → **README.md**
233
- - What was built? → **DAYS_1-2_SUMMARY.md**
234
- - How do I test? → **TEST_ENDPOINTS.md**
235
- - Where's the code? → **FILE_INVENTORY.md**
236
- - How does it work? → **VISUAL_SUMMARY.md**
237
- - Line-by-line? → **README_EXPLAINED.md**
238
-
239
- ---
240
-
241
- ## ✨ Summary
242
-
243
- **Status: ✅ Days 1-3 Complete, All 3 Tasks Playable**
244
-
245
- - ✅ Environment fully functional with all 3 scenarios
246
- - ✅ Log generation working (with noise injection)
247
- - ✅ All 3 tasks playable (easy, medium, hard)
248
- - ✅ All endpoints wired (7/7)
249
- - ✅ All documentation updated
250
-
251
- **Next:** Build Day 4 grader logic
252
-
253
- **Overall Progress:** 60% ✅ (3 of 5 days complete)
254
-
255
- ---
256
-
257
- Generated: March 27, 2026
258
- Project: LogTriageEnv (Meta × PyTorch Hackathon)
259
- Deadline: April 7, 2026, 11:59 PM IST
260
- Status: **ON TRACK** ✅ (60% complete — all 3 tasks playable)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TEST_ENDPOINTS.md DELETED
@@ -1,302 +0,0 @@
1
- # Day 1 Testing Guide — Curl Commands
2
-
3
- ## Prerequisites
4
- ```bash
5
- pip install -r requirements.txt
6
- python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
7
- ```
8
-
9
- Leave the server running and open a new terminal for these tests.
10
-
11
- ---
12
-
13
- ## Test 1: Health Check
14
- ```bash
15
- curl http://localhost:7860/health
16
- ```
17
-
18
- **Expected Response:**
19
- ```json
20
- {
21
- "status": "ok",
22
- "environment": "logtriage-env",
23
- "version": "1.0.0"
24
- }
25
- ```
26
-
27
- ---
28
-
29
- ## Test 2: Get All Tasks
30
- ```bash
31
- curl http://localhost:7860/tasks
32
- ```
33
-
34
- **Expected Response:** JSON with 3 tasks (single_crash, cascading_failure, silent_degradation) including action schemas.
35
-
36
- ---
37
-
38
- ## Test 3: Valid Step Action (Classify Severity)
39
- ```bash
40
- curl -X POST http://localhost:7860/step \
41
- -H "Content-Type: application/json" \
42
- -d '{
43
- "action_type": "classify_severity",
44
- "value": "P1",
45
- "confidence": 0.95,
46
- "reasoning": "High error rate detected"
47
- }'
48
- ```
49
-
50
- **Expected Response:** 200 OK
51
- ```json
52
- {
53
- "message": "step endpoint placeholder",
54
- "action_received": {
55
- "action_type": "classify_severity",
56
- "value": "P1",
57
- "confidence": 0.95,
58
- "reasoning": "High error rate detected"
59
- }
60
- }
61
- ```
62
-
63
- ---
64
-
65
- ## Test 4: Valid Step Action (Root Cause)
66
- ```bash
67
- curl -X POST http://localhost:7860/step \
68
- -H "Content-Type: application/json" \
69
- -d '{
70
- "action_type": "identify_root_cause",
71
- "value": "user-db",
72
- "confidence": 0.8
73
- }'
74
- ```
75
-
76
- **Expected Response:** 200 OK with action received
77
-
78
- ---
79
-
80
- ## Test 5: Valid Step Action (Remediate)
81
- ```bash
82
- curl -X POST http://localhost:7860/step \
83
- -H "Content-Type: application/json" \
84
- -d '{
85
- "action_type": "remediate",
86
- "value": "restart:payment-service",
87
- "confidence": 0.9
88
- }'
89
- ```
90
-
91
- **Expected Response:** 200 OK with action received
92
-
93
- ---
94
-
95
- ## Test 6: Valid Step Action (Escalate)
96
- ```bash
97
- curl -X POST http://localhost:7860/step \
98
- -H "Content-Type: application/json" \
99
- -d '{
100
- "action_type": "escalate",
101
- "value": "dba-team",
102
- "confidence": 0.85
103
- }'
104
- ```
105
-
106
- **Expected Response:** 200 OK with action received
107
-
108
- ---
109
-
110
- ## Test 7: Valid Step Action (Resolve)
111
- ```bash
112
- curl -X POST http://localhost:7860/step \
113
- -H "Content-Type: application/json" \
114
- -d '{
115
- "action_type": "resolve",
116
- "value": "resolved"
117
- }'
118
- ```
119
-
120
- **Expected Response:** 200 OK with action received
121
-
122
- ---
123
-
124
- ## Test 8: Valid Step Action (Ignore Noise)
125
- ```bash
126
- curl -X POST http://localhost:7860/step \
127
- -H "Content-Type: application/json" \
128
- -d '{
129
- "action_type": "ignore",
130
- "value": "noise"
131
- }'
132
- ```
133
-
134
- **Expected Response:** 200 OK with action received
135
-
136
- ---
137
-
138
- ## Test 9: Valid Step Action (Request More Logs)
139
- ```bash
140
- curl -X POST http://localhost:7860/step \
141
- -H "Content-Type: application/json" \
142
- -d '{
143
- "action_type": "request_more_logs",
144
- "value": "all",
145
- "confidence": 0.5
146
- }'
147
- ```
148
-
149
- **Expected Response:** 200 OK with action received
150
-
151
- ---
152
-
153
- ## Test 10: INVALID Action - Wrong Severity
154
- ```bash
155
- curl -X POST http://localhost:7860/step \
156
- -H "Content-Type: application/json" \
157
- -d '{
158
- "action_type": "classify_severity",
159
- "value": "P5"
160
- }'
161
- ```
162
-
163
- **Expected Response:** 422 Unprocessable Entity
164
- ```json
165
- {
166
- "error": "classify_severity value must be one of {'P1', 'P2', 'P3'}"
167
- }
168
- ```
169
-
170
- ---
171
-
172
- ## Test 11: INVALID Action - Unknown Service
173
- ```bash
174
- curl -X POST http://localhost:7860/step \
175
- -H "Content-Type: application/json" \
176
- -d '{
177
- "action_type": "identify_root_cause",
178
- "value": "unknown-service"
179
- }'
180
- ```
181
-
182
- **Expected Response:** 422 Unprocessable Entity
183
- ```json
184
- {
185
- "error": "identify_root_cause value must be one of {...}"
186
- }
187
- ```
188
-
189
- ---
190
-
191
- ## Test 12: INVALID Action - Bad Remediate Format
192
- ```bash
193
- curl -X POST http://localhost:7860/step \
194
- -H "Content-Type: application/json" \
195
- -d '{
196
- "action_type": "remediate",
197
- "value": "invalid:payment-service"
198
- }'
199
- ```
200
-
201
- **Expected Response:** 422 Unprocessable Entity
202
- ```json
203
- {
204
- "error": "remediate prefix must be one of {...}"
205
- }
206
- ```
207
-
208
- ---
209
-
210
- ## Test 13: INVALID Action - Bad Escalate Team
211
- ```bash
212
- curl -X POST http://localhost:7860/step \
213
- -H "Content-Type: application/json" \
214
- -d '{
215
- "action_type": "escalate",
216
- "value": "marketing-team"
217
- }'
218
- ```
219
-
220
- **Expected Response:** 422 Unprocessable Entity
221
- ```json
222
- {
223
- "error": "escalate value must be one of {...}"
224
- }
225
- ```
226
-
227
- ---
228
-
229
- ## Test 14: Reset Endpoint
230
- ```bash
231
- curl -X POST http://localhost:7860/reset \
232
- -H "Content-Type: application/json" \
233
- -d '{
234
- "task": "single_crash"
235
- }'
236
- ```
237
-
238
- **Expected Response:** 200 OK
239
- ```json
240
- {
241
- "message": "reset endpoint placeholder",
242
- "task": "single_crash"
243
- }
244
- ```
245
-
246
- ---
247
-
248
- ## Test 15: State Endpoint
249
- ```bash
250
- curl http://localhost:7860/state
251
- ```
252
-
253
- **Expected Response:** 200 OK
254
- ```json
255
- {
256
- "message": "state endpoint placeholder"
257
- }
258
- ```
259
-
260
- ---
261
-
262
- ## Test 16: Grader Endpoint
263
- ```bash
264
- curl -X POST http://localhost:7860/grader
265
- ```
266
-
267
- **Expected Response:** 200 OK
268
- ```json
269
- {
270
- "message": "grader endpoint placeholder",
271
- "score": 0.0
272
- }
273
- ```
274
-
275
- ---
276
-
277
- ## Test 17: Baseline Endpoint
278
- ```bash
279
- curl -X POST http://localhost:7860/baseline
280
- ```
281
-
282
- **Expected Response:** 200 OK
283
- ```json
284
- {
285
- "message": "baseline endpoint placeholder"
286
- }
287
- ```
288
-
289
- ---
290
-
291
- ## Summary
292
-
293
- **Tests 1-9, 14-17:** Should all return 200 OK ✅
294
- **Tests 10-13:** Should all return 422 with error message ✅
295
-
296
- If all pass, your Day 1 is complete! Push to GitHub:
297
-
298
- ```bash
299
- git add .
300
- git commit -m "Day 1 complete: models, endpoints, Docker, tests, README"
301
- git push origin main
302
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
VISUAL_SUMMARY.md DELETED
@@ -1,419 +0,0 @@
1
- # 🎯 LogTriageEnv — Day 1 Summary (Visual)
2
-
3
- ## What You're Building
4
-
5
- ```
6
- ┌─────────────────────────────────────────────────────────────────┐
7
- │ LogTriageEnv │
8
- │ SRE Incident Triage Simulation Environment │
9
- │ │
10
- │ Agent: On-call SRE receiving live system logs │
11
- │ Goal: Diagnose, classify severity, find root cause, remediate │
12
- │ Setting: 7-service microservice cluster with failures │
13
- │ │
14
- │ [Agent] → reads logs → takes action → gets observation+reward│
15
- └─────────────────────────────────────────────────────────────────┘
16
- ```
17
-
18
- ---
19
-
20
- ## Architecture
21
-
22
- ```
23
- ┌─────────────────────────────────────────────────────────────────┐
24
- │ FastAPI Server │
25
- │ (server/app.py) │
26
- ├─────────────────────────────────────────────────────────────────┤
27
- │ │
28
- │ ┌─────────────────────────────────────────────────────────┐ │
29
- │ │ GET /health → {"status": "ok"} ✅ │ │
30
- │ │ GET /tasks → all 3 task definitions ✅ │ │
31
- │ │ POST /reset → initial observation ⏳ │ │
32
- │ │ POST /step → validate & step forward ✅ │ │
33
- │ │ GET /state → episode state ⏳ │ │
34
- │ │ POST /grader → task score ⏳ │ │
35
- │ │ POST /baseline → run gpt-4o-mini ⏳ │ │
36
- │ └─────────────────────────────────────────────────────────┘ │
37
- │ │
38
- ├─────────────────────────────────────────────────────────────────┤
39
- │ LogTriageEnvironment │
40
- │ (server/environment.py) │
41
- │ ⏳ Day 2 │
42
- ├─────────────────────────────────────────────────────────────────┤
43
- │ │
44
- │ Scenarios: Graders: Log Generator: │
45
- │ • single_crash ✅ • crash_grader • log_generator.py │
46
- │ • cascading ⏳ • cascade_grader ⏳ Day 2 │
47
- │ • silent_degrade ⏳ • noise_grader │
48
- │ ⏳ Day 2-3 ⏳ Day 4 │
49
- └─────────────────────────────────────────────────────────────────┘
50
- ```
51
-
52
- ---
53
-
54
- ## Data Flow
55
-
56
- ```
57
- ┌──────────────┐
58
- │ Episode │
59
- │ Start │
60
- └──────┬───────┘
61
- │ reset(task_id)
62
-
63
- ┌─────────────────────────────────────────┐
64
- │ Initial Observation │
65
- │ { │
66
- │ logs: [LogLine, ...], │
67
- │ system_state: {service: Status, ...}, │
68
- │ incident_id, task_id, step_count, │
69
- │ reward: 0.0, done: false │
70
- │ } │
71
- └──────┬───────────────────────────────────┘
72
-
73
-
74
- ┌──────────────────────────────────┐
75
- │ Agent Decision │
76
- │ (LLM reads observation) │
77
- └───���──┬───────────────────────────┘
78
- │ step(action)
79
-
80
- ┌──────────────────────────────────────────────┐
81
- │ Action: TriageAction │
82
- │ { │
83
- │ action_type: "classify_severity", │
84
- │ value: "P1", │
85
- │ confidence: 0.95, │
86
- │ reasoning: "High error rate detected" │
87
- │ } │
88
- │ │
89
- │ ✅ Validated by is_valid() method │
90
- │ 🚫 If invalid → 422 error │
91
- └──────┬───────────────────────────────────────┘
92
-
93
-
94
- ┌──────────────────────────────────────────────┐
95
- │ Next Observation + Reward │
96
- │ { │
97
- │ logs: [new batch], │
98
- │ system_state: [updated], │
99
- │ reward: 0.30, │
100
- │ cumulative_score: 0.30, │
101
- │ last_action_feedback: "Good decision", │
102
- │ done: false │
103
- │ } │
104
- └──────┬───────────────────────────────────────┘
105
-
106
- ├─→ If done=true → Episode ends
107
-
108
- └─→ If done=false → Back to Agent Decision
109
- ```
110
-
111
- ---
112
-
113
- ## Three Tasks
114
-
115
- ### Task 1: Single Service Crash
116
- ```
117
- Scenario:
118
- payment-service crashes → returns HTTP 500
119
- Logs show: NullPointerException stack trace
120
- All other services healthy
121
-
122
- Agent must:
123
- ✅ Classify as P1
124
- ✅ Identify payment-service as root cause
125
- ✅ Remediate with restart:payment-service
126
- ✅ Resolve
127
-
128
- Difficulty: EASY (clear logs, no tracing needed)
129
- Max Steps: 8
130
- Expected Score: 0.75–0.85 (frontier LLM should handle)
131
- ```
132
-
133
- ### Task 2: Cascading Failure
134
- ```
135
- Scenario:
136
- user-db slow query (2847ms)
137
- → auth-service connection pool exhausts
138
- → api-gateway starts returning timeouts
139
- Surface symptoms: api-gateway errors loudest
140
- Hidden root cause: database
141
-
142
- Agent must:
143
- ✅ NOT treat api-gateway as root (it's symptom)
144
- ✅ Trace backward to user-db (real root)
145
- ✅ Apply correct fix at root (kill-query or restart)
146
- ✅ Bonus: avoid fixing symptoms first
147
-
148
- Difficulty: MEDIUM (requires multi-hop reasoning)
149
- Max Steps: 12
150
- Expected Score: 0.45–0.60 (requires logic)
151
- ```
152
-
153
- ### Task 3: Silent Degradation
154
- ```
155
- Scenario:
156
- payment-db latency slowly increases: 450ms → 620ms → 890ms → 1200ms
157
- No service is down
158
- Error rate: 2.1% (below 5% P1 threshold)
159
- Logs: 60% noise (routine checks, unrelated warnings)
160
-
161
- Agent must:
162
- ✅ Classify as P2 (NOT P1, NOT P3 — nuanced judgment!)
163
- ✅ Identify payment-db as root cause
164
- ✅ Recommend preventive action (flush-cache or escalate to DBA)
165
- ✅ Ignore noise logs (don't escalate spuriously)
166
-
167
- Difficulty: HARD (noise filtering, temporal reasoning, nuance)
168
- Max Steps: 15
169
- Expected Score: 0.20–0.40 (even strong models struggle)
170
- ```
171
-
172
- ---
173
-
174
- ## Pydantic Models at a Glance
175
-
176
- ```python
177
- LogLine(
178
- timestamp: str, # "2025-03-25T14:32:01Z"
179
- level: Literal["DEBUG", "INFO", "WARN", "ERROR", "FATAL"],
180
- service: str, # "api-gateway"
181
- request_id: Optional[str], # "req-9f2a"
182
- message: str, # "upstream timeout from auth-service"
183
- latency_ms: Optional[int] # 30002
184
- )
185
-
186
- ServiceStatus(
187
- name: str, # "api-gateway"
188
- status: Literal["up", "degraded", "down"],
189
- error_rate: float, # 0.342
190
- latency_p99_ms: int, # 2500
191
- last_updated: str # ISO timestamp
192
- )
193
-
194
- TriageAction( ⭐ MOST CRITICAL
195
- action_type: Literal[
196
- "classify_severity", # value: P1|P2|P3
197
- "identify_root_cause", # value: service-name
198
- "escalate", # value: team-name
199
- "remediate", # value: action:service
200
- "request_more_logs", # value: service|all
201
- "resolve", # value: "resolved"
202
- "ignore" # value: "noise"
203
- ],
204
- value: str,
205
- confidence: float, # 0.0–1.0
206
- reasoning: str,
207
-
208
- def is_valid() -> (bool, str) # ✅ Validates all types!
209
- )
210
-
211
- TriageObservation(
212
- logs: list[LogLine],
213
- system_state: dict[str, ServiceStatus],
214
- incident_id: str,
215
- task_id: str,
216
- step_count: int,
217
- time_elapsed_seconds: int,
218
- active_alerts: list[str],
219
- reward: float,
220
- cumulative_score: float,
221
- done: bool,
222
- last_action_feedback: str,
223
- invalid_action_error: Optional[str]
224
- )
225
-
226
- EpisodeState(
227
- episode_id: str,
228
- task_id: str,
229
- step_count: int,
230
- max_steps: int,
231
- done: bool,
232
- cumulative_score: float,
233
- actions_taken: list[str],
234
- correct_severity: Optional[str],
235
- correct_root_cause: Optional[str],
236
- correct_remediation: bool
237
- )
238
- ```
239
-
240
- ---
241
-
242
- ## Action Validation Examples
243
-
244
- ```python
245
- # ✅ VALID Actions
246
-
247
- action = TriageAction(
248
- action_type="classify_severity",
249
- value="P1" # ✅ Valid (P1, P2, P3)
250
- )
251
- is_valid, err = action.is_valid() # (True, "")
252
-
253
- action = TriageAction(
254
- action_type="identify_root_cause",
255
- value="user-db" # ✅ Valid service name
256
- )
257
- is_valid, err = action.is_valid() # (True, "")
258
-
259
- action = TriageAction(
260
- action_type="remediate",
261
- value="restart:payment-service" # ✅ Valid format: action:service
262
- )
263
- is_valid, err = action.is_valid() # (True, "")
264
-
265
- # 🚫 INVALID Actions
266
-
267
- action = TriageAction(
268
- action_type="classify_severity",
269
- value="P5" # ❌ Invalid (only P1, P2, P3)
270
- )
271
- is_valid, err = action.is_valid()
272
- # (False, "classify_severity value must be one of {'P1', 'P2', 'P3'}")
273
-
274
- action = TriageAction(
275
- action_type="remediate",
276
- value="invalid:payment-service" # ❌ Invalid prefix
277
- )
278
- is_valid, err = action.is_valid()
279
- # (False, "remediate prefix must be one of {'restart', 'rollback', 'scale', 'flush-cache', 'kill-query'}")
280
- ```
281
-
282
- ---
283
-
284
- ## File Completion Status
285
-
286
- ```
287
- ✅ COMPLETE (Day 1)
288
- ├── openenv.yaml (38 lines) — Spec metadata
289
- ├── requirements.txt (6 lines) — Dependencies
290
- ├── Dockerfile (16 lines) — Container image
291
- ├── README.md (533 lines)— Documentation
292
- ├── server/models.py (218 lines)— Pydantic models ⭐
293
- ├── server/app.py (101 lines)— FastAPI server ⭐
294
- ├── server/__init__.py (0 lines) — Package marker
295
- ├── test_day1.py (147 lines)— Automated tests
296
- ├── test_all.bat (61 lines) — Windows batch runner
297
- ├── TEST_ENDPOINTS.md (172 lines)— Curl examples
298
- ├── DAY1_STATUS.md (336 lines)— Detailed status
299
- ├── COMPLETE_SUMMARY.md (240 lines)— Quick summary
300
- ├── README_EXPLAINED.md (268 lines)— README breakdown
301
- └── Folder structure ✅ Created
302
-
303
- ⏳ PLACEHOLDER (Day 2+)
304
- ├── server/environment.py — LogTriageEnvironment class
305
- ├── server/log_generator.py — Synthetic log generation
306
- ├── server/scenarios/single_crash.py — Task 1 scenario
307
- ├── server/scenarios/cascading.py — Task 2 scenario
308
- ├── server/scenarios/silent_degrade.py — Task 3 scenario
309
- ├── server/graders/base_grader.py — Grader base class
310
- ├── server/graders/crash_grader.py — Task 1 grader
311
- ├── server/graders/cascade_grader.py — Task 2 grader
312
- ├── server/graders/noise_grader.py — Task 3 grader
313
- ├── baseline.py — LLM baseline agent
314
- ├── scripts/run_grader.py — Manual grader testing
315
- └── scripts/validate_checklist.py — Pre-submission validation
316
- ```
317
-
318
- ---
319
-
320
- ## Quick Stats
321
-
322
- ```
323
- Day 1 Completion:
324
- ├── Lines of core code: 357 lines (models + app)
325
- ├── API endpoints: 7 endpoints (all registered)
326
- ├── Data models: 5 Pydantic classes (fully typed)
327
- ├── Validation logic: 1 method with 7 branches (is_valid)
328
- ├── Tasks defined: 3 tasks (8, 12, 15 step budgets)
329
- ├── Documentation: 1,280+ lines across 5 files
330
- ├── Tests/examples: 200+ lines
331
-
332
- ├── What works:
333
- │ ✅ Model imports
334
- │ ✅ FastAPI app import
335
- │ ✅ Action validation (11 test cases)
336
- │ ✅ Pydantic construction
337
- │ ✅ Endpoint registration
338
-
339
- ├── What needs testing:
340
- │ 🧪 Server startup
341
- │ 🧪 Curl endpoints
342
- │ 🧪 Docker build
343
- │ 🧪 Docker run
344
-
345
- └── Estimated completion: 95% ready for push
346
- ```
347
-
348
- ---
349
-
350
- ## What to Do Now
351
-
352
- ```
353
- ┌─────────────────────────────────────────────────────────────────┐
354
- │ STEP 1: Test Locally │
355
- │ python test_day1.py │
356
- │ → Should see 11 validation tests pass │
357
- ├─────────────────────────────────────────────────────────────────┤
358
- │ STEP 2: Start Server │
359
- │ pip install -r requirements.txt │
360
- │ python -m uvicorn server.app:app --port 7860 --reload │
361
- ├────────────────────���────────────────────────────────────────────┤
362
- │ STEP 3: Test Endpoints (new terminal) │
363
- │ curl http://localhost:7860/health │
364
- │ → See {"status": "ok", ...} │
365
- ├─────────────────────────────────────────────────────────────────┤
366
- │ STEP 4: Test Docker │
367
- │ docker build -t logtriage-env . │
368
- │ docker run -p 7860:7860 logtriage-env │
369
- │ curl http://localhost:7860/health │
370
- ├─────────────────────────────────────────────────────────────────┤
371
- │ STEP 5: Push to GitHub │
372
- │ git add . │
373
- │ git commit -m "Day 1: Complete" │
374
- │ git push origin main │
375
- └─────────────────────────────────────────────────────────────────┘
376
- ```
377
-
378
- ---
379
-
380
- ## Next: Day 2
381
-
382
- ```
383
- Day 2 Todo:
384
- 1. Create server/environment.py
385
- - LogTriageEnvironment class
386
- - reset() and step() methods
387
- - Episode management
388
-
389
- 2. Create server/log_generator.py
390
- - Realistic microservice logs
391
- - Error patterns
392
- - Noise injection
393
-
394
- 3. Create server/scenarios/single_crash.py
395
- - Task 1 scenario generator
396
- - payment-service crash
397
- - Clear error logs
398
-
399
- 4. Wire app.py endpoints
400
- - @app.post("/reset") → environment.reset()
401
- - @app.post("/step") → environment.step()
402
- - @app.get("/state") → environment.get_state()
403
-
404
- Then endpoints become real! 🚀
405
- ```
406
-
407
- ---
408
-
409
- ## Bottom Line
410
-
411
- ✅ **You have built the skeleton for a sophisticated RL environment**
412
- ✅ **All data models are fully typed and validated**
413
- ✅ **All API endpoints are stubbed and registered**
414
- ✅ **Documentation is comprehensive**
415
- ✅ **Code is ready for extension**
416
-
417
- 🎯 **Next:** Test locally, push to GitHub, then implement Day 2 logic.
418
-
419
- Good luck! 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
action.json DELETED
Binary file (138 Bytes)
 
baseline.py → inference.py RENAMED
@@ -1,21 +1,21 @@
1
  """
2
- Baseline inference script for LogTriageEnv.
3
- Uses an LLM agent to play all 3 tasks and produce reproducible scores.
 
 
 
 
 
4
 
5
  Usage:
6
- # Set API key as environment variable (never hardcode)
7
- export GROQ_API_KEY=your_key_here # Linux/Mac
8
- set GROQ_API_KEY=your_key_here # Windows CMD
9
- $env:GROQ_API_KEY="your_key_here" # Windows PowerShell
10
-
11
- python baseline.py
12
-
13
- Environment variables:
14
- GROQ_API_KEY - Groq API key (primary)
15
- NVIDIA_API_KEY - NVIDIA NIM API key (fallback)
16
- OPENROUTER_API_KEY - OpenRouter API key (fallback)
17
- OPENAI_API_KEY - OpenAI API key (fallback)
18
- ENV_URL - Base URL of deployed environment (default: http://localhost:7860)
19
  """
20
  from __future__ import annotations
21
  import os
@@ -24,38 +24,21 @@ import time
24
  import requests
25
  from openai import OpenAI
26
 
27
- # ─── PROVIDER CONFIG change PROVIDER to switch. Nothing else changes. ───────
28
-
29
- PROVIDER = "groq" # options: "groq", "nvidia", "openrouter", "openai"
30
-
31
- PROVIDERS = {
32
- "groq": {
33
- "base_url": "https://api.groq.com/openai/v1",
34
- "api_key_env": "GROQ_API_KEY",
35
- "model": "llama-3.3-70b-versatile",
36
- },
37
- "nvidia": {
38
- "base_url": "https://integrate.api.nvidia.com/v1",
39
- "api_key_env": "NVIDIA_API_KEY",
40
- "model": "openai/gpt-oss-20b",
41
- },
42
- "openrouter": {
43
- "base_url": "https://openrouter.ai/api/v1",
44
- "api_key_env": "OPENROUTER_API_KEY",
45
- "model": "meta-llama/llama-3.1-8b-instruct:free",
46
- },
47
- "openai": {
48
- "base_url": None,
49
- "api_key_env": "OPENAI_API_KEY",
50
- "model": "gpt-4o-mini",
51
- },
52
- }
53
 
54
  # ─── ENVIRONMENT CONFIG ───────────────────────────────────────────────────────
55
 
56
  ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
57
  TASKS = ["single_crash", "cascading_failure", "silent_degradation"]
58
- MAX_STEPS_PER_TASK = {"single_crash": 8, "cascading_failure": 12, "silent_degradation": 15}
 
 
 
 
59
  SEED = 42 # fixed seed for reproducibility
60
 
61
  # ─── SYSTEM PROMPT ─────────────────────────────────────────────────────────────
@@ -83,33 +66,39 @@ Value rules by action_type:
83
  - resolve: value must be "resolved"
84
  - ignore: value must be "noise"
85
 
 
 
 
 
 
86
  Strategy:
87
- 1. Read all log lines carefully
88
- 2. Look at system_state for service health (error_rate, latency_p99_ms, status)
89
- 3. Identify which service is the ROOT CAUSE (not just a symptom)
90
- 4. Classify severity based on actual impact:
91
- - P1: service down or error rate > 5% (customer impact)
92
- - P2: degraded performance, trending toward P1 (no outage yet)
93
- - P3: warning, no immediate impact
94
- 5. Apply the correct fix to the ROOT CAUSE service, not symptom services
95
- 6. Once you have classified, identified root cause, and remediated — resolve the incident
96
 
97
  IMPORTANT: Respond with ONLY the JSON object. No explanation, no markdown, no backticks."""
98
 
99
 
100
  def _build_user_prompt(obs: dict) -> str:
101
- """Convert observation dict to a prompt string for the LLM."""
102
  lines = []
103
 
104
- # System state summary
105
  lines.append("=== SYSTEM STATE ===")
 
106
  for svc, status in obs.get("system_state", {}).items():
107
  if isinstance(status, dict):
108
  s = status.get("status", "unknown")
109
  er = status.get("error_rate", 0)
110
  lat = status.get("latency_p99_ms", 0)
111
  if s != "up" or er > 0.01 or lat > 200:
112
- lines.append(f" {svc}: {s} | error_rate={er:.1%} | latency_p99={lat}ms")
 
 
 
113
  lines.append("")
114
 
115
  # Active alerts
@@ -117,55 +106,53 @@ def _build_user_prompt(obs: dict) -> str:
117
  if alerts:
118
  lines.append("=== ACTIVE ALERTS ===")
119
  for alert in alerts:
120
- lines.append(f" {alert}")
121
  lines.append("")
122
 
123
- # Log lines
124
  lines.append("=== LOG LINES ===")
125
  for log in obs.get("logs", []):
126
  if isinstance(log, dict):
127
- ts = log.get("timestamp", "")[-8:] # just time part
128
  level = log.get("level", "INFO")
129
  svc = log.get("service", "unknown")
130
  msg = log.get("message", "")
131
  lines.append(f" [{ts}] {level:<5} {svc:<25} {msg}")
132
  lines.append("")
133
 
134
- # Episode context
135
- lines.append(f"Step: {obs.get('step_count', 0)} | "
136
- f"Task: {obs.get('task_id', '')} | "
137
- f"Time elapsed: {obs.get('time_elapsed_seconds', 0)}s")
 
138
 
139
  # Feedback from last action
140
  feedback = obs.get("last_action_feedback", "")
141
- if feedback and feedback != "Incident detected. Analyze the logs and take action.":
142
- lines.append(f"Last action feedback: {feedback}")
143
 
144
  lines.append("")
145
- lines.append("Based on the above, what is your next triage action? Respond with JSON only.")
146
  return "\n".join(lines)
147
 
148
 
149
  def _parse_action(response_text: str) -> dict | None:
150
- """Parse LLM response into action dict. Returns None if parsing fails."""
151
  text = response_text.strip()
152
 
153
- # Strip markdown code blocks if present
154
  if text.startswith("```"):
155
  lines = text.split("\n")
156
- text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
157
 
158
  try:
159
  action = json.loads(text)
160
- # Validate required fields
161
  if "action_type" not in action or "value" not in action:
162
  return None
163
- # Ensure confidence and reasoning exist
164
  action.setdefault("confidence", 0.8)
165
  action.setdefault("reasoning", "")
166
  return action
167
  except json.JSONDecodeError:
168
- # Try to extract JSON from text
169
  import re
170
  match = re.search(r'\{[^{}]+\}', text, re.DOTALL)
171
  if match:
@@ -176,16 +163,12 @@ def _parse_action(response_text: str) -> dict | None:
176
  return None
177
 
178
 
179
- def _get_fallback_action(obs: dict, step: int) -> dict:
180
- """
181
- Fallback action when LLM fails to produce valid JSON.
182
- Uses simple heuristics to make a reasonable action.
183
- """
184
  system_state = obs.get("system_state", {})
185
- task_id = obs.get("task_id", "")
186
 
187
- # Find the most degraded service
188
- worst_service = None
189
  worst_error_rate = 0
190
  for svc, status in system_state.items():
191
  if isinstance(status, dict):
@@ -194,24 +177,27 @@ def _get_fallback_action(obs: dict, step: int) -> dict:
194
  worst_error_rate = er
195
  worst_service = svc
196
 
197
- if step == 0:
198
- return {"action_type": "classify_severity", "value": "P1", "confidence": 0.5, "reasoning": "fallback"}
199
- elif step == 1 and worst_service:
200
- return {"action_type": "identify_root_cause", "value": worst_service, "confidence": 0.5, "reasoning": "fallback"}
201
- elif step == 2 and worst_service:
202
- return {"action_type": "remediate", "value": f"restart:{worst_service}", "confidence": 0.5, "reasoning": "fallback"}
 
 
 
 
 
203
  else:
204
- return {"action_type": "resolve", "value": "resolved", "confidence": 0.5, "reasoning": "fallback"}
 
205
 
206
 
207
- def run_task(client: OpenAI, model: str, task_id: str, seed: int = 42) -> dict:
208
- """
209
- Run one complete episode for a given task.
210
- Returns dict with score, steps, and breakdown.
211
- """
212
  print(f"\n Running task: {task_id}...")
213
 
214
- # Reset environment
215
  try:
216
  resp = requests.post(
217
  f"{ENV_URL}/reset",
@@ -221,47 +207,44 @@ def run_task(client: OpenAI, model: str, task_id: str, seed: int = 42) -> dict:
221
  resp.raise_for_status()
222
  obs = resp.json()
223
  except Exception as e:
224
- print(f" ERROR: Failed to reset environment: {e}")
225
  return {"score": 0.0, "error": str(e), "task_id": task_id}
226
 
227
  max_steps = MAX_STEPS_PER_TASK.get(task_id, 10)
228
  conversation_history = []
229
- steps_taken = 0
230
  done = obs.get("done", False)
 
231
 
232
  while not done and steps_taken < max_steps:
233
- # Build prompt from observation
234
  user_prompt = _build_user_prompt(obs)
235
-
236
- # Add to conversation history (keep last 4 exchanges for context)
237
  conversation_history.append({"role": "user", "content": user_prompt})
 
 
238
  if len(conversation_history) > 8:
239
  conversation_history = conversation_history[-8:]
240
 
241
  # Call LLM
242
  try:
243
  response = client.chat.completions.create(
244
- model=model,
245
  messages=[
246
  {"role": "system", "content": SYSTEM_PROMPT},
247
  ] + conversation_history,
248
  max_tokens=200,
249
- temperature=0, # deterministic
250
  )
251
- response_text = response.choices[0].message.content
252
  conversation_history.append({"role": "assistant", "content": response_text})
253
-
254
- # Parse action
255
  action = _parse_action(response_text)
256
  if action is None:
257
- print(f" Step {steps_taken}: LLM parse failed, using fallback")
258
- action = _get_fallback_action(obs, steps_taken)
259
-
260
  except Exception as e:
261
- print(f" Step {steps_taken}: LLM call failed ({e}), using fallback")
262
- action = _get_fallback_action(obs, steps_taken)
263
 
264
- # Take action in environment
265
  try:
266
  step_resp = requests.post(
267
  f"{ENV_URL}/step",
@@ -273,18 +256,17 @@ def run_task(client: OpenAI, model: str, task_id: str, seed: int = 42) -> dict:
273
  done = obs.get("done", False)
274
  reward = obs.get("reward", 0.0)
275
  feedback = obs.get("last_action_feedback", "")
276
-
277
  print(f" Step {steps_taken}: {action['action_type']}({action['value']}) "
278
- f"-> reward={reward:+.2f} | {feedback[:60]}")
279
-
280
  except Exception as e:
281
- print(f" Step {steps_taken}: Environment step failed: {e}")
282
  break
283
 
284
  steps_taken += 1
285
- time.sleep(0.1) # small delay to avoid rate limits
286
 
287
- # Get official grader score
288
  try:
289
  grader_resp = requests.post(f"{ENV_URL}/grader", timeout=30)
290
  grader_resp.raise_for_status()
@@ -292,11 +274,11 @@ def run_task(client: OpenAI, model: str, task_id: str, seed: int = 42) -> dict:
292
  score = grader_result.get("score", 0.0)
293
  breakdown = grader_result.get("breakdown", {})
294
  except Exception as e:
295
- print(f" ERROR: Grader call failed: {e}")
296
  score = obs.get("cumulative_score", 0.0)
297
  breakdown = {}
298
 
299
- print(f" Final score: {score:.4f} ({steps_taken} steps)")
300
  return {
301
  "task_id": task_id,
302
  "score": score,
@@ -306,88 +288,83 @@ def run_task(client: OpenAI, model: str, task_id: str, seed: int = 42) -> dict:
306
 
307
 
308
  def main():
309
- """Run baseline agent against all 3 tasks and report scores."""
310
 
311
- # ── Setup provider ─────────────────────────────────────────────────────────
312
- provider_config = PROVIDERS[PROVIDER]
313
- api_key = os.environ.get(provider_config["api_key_env"])
314
- model = provider_config["model"]
315
- base_url = provider_config["base_url"]
316
-
317
- if not api_key:
318
  raise ValueError(
319
- f"API key not found. Set environment variable: {provider_config['api_key_env']}\n"
320
- f" Windows PowerShell: $env:{provider_config['api_key_env']}='your_key'\n"
321
- f" Windows CMD: set {provider_config['api_key_env']}=your_key"
322
  )
323
 
324
- # Build OpenAI-compatible client
325
- client_kwargs = {"api_key": api_key}
326
- if base_url:
327
- client_kwargs["base_url"] = base_url
328
- client = OpenAI(**client_kwargs)
329
 
330
  print("=" * 60)
331
  print("LogTriageEnv — Baseline Inference Script")
332
  print("=" * 60)
333
- print(f"Provider: {PROVIDER}")
334
- print(f"Model: {model}")
335
- print(f"Environment: {ENV_URL}")
336
- print(f"Seed: {SEED}")
337
- print(f"Tasks: {', '.join(TASKS)}")
338
  print("=" * 60)
339
 
340
- # ── Verify environment is running ──────────────────────────────────────────
341
  try:
342
  health = requests.get(f"{ENV_URL}/health", timeout=10)
343
  health.raise_for_status()
344
- print(f"Environment health: OK")
345
  except Exception as e:
346
  raise RuntimeError(
347
  f"Environment not responding at {ENV_URL}\n"
348
- f"Start it with: python -m uvicorn server.app:app --port 7860\n"
349
  f"Error: {e}"
350
  )
351
 
352
- # ── Run all tasks ──────────────────────────────────────────────────────────
353
  results = []
 
 
354
  for task_id in TASKS:
355
- result = run_task(client, model, task_id, seed=SEED)
356
  results.append(result)
357
 
358
- # ── Print final report ─────────────────────────────────────────────────────
 
 
359
  print("\n" + "=" * 60)
360
  print("BASELINE RESULTS")
361
  print("=" * 60)
362
 
363
- total_score = 0.0
364
  for result in results:
365
  task = result["task_id"]
366
  score = result["score"]
367
  steps = result["steps_taken"]
368
- total_score += score
369
- bar = "#" * int(score * 20) + "-" * (20 - int(score * 20))
370
  print(f"{task:<25} {score:.4f} [{bar}] ({steps} steps)")
371
- if result.get("breakdown"):
372
- for k, v in result["breakdown"].items():
373
- print(f" {k:<20} {v}")
374
 
375
- avg_score = total_score / len(TASKS)
376
  print("-" * 60)
377
- print(f"{'AVERAGE':<25} {avg_score:.4f}")
 
378
  print("=" * 60)
379
 
380
- # ── Machine-readable output ────────────────────────────────────────────────
381
  output = {
382
- "provider": PROVIDER,
383
- "model": model,
384
  "seed": SEED,
385
  "results": results,
386
- "average_score": round(avg_score, 4),
 
387
  }
388
- print("\nJSON Output (for /baseline endpoint):")
389
  print(json.dumps(output, indent=2))
390
-
391
  return output
392
 
393
 
 
1
  """
2
+ inference.py — Baseline Inference Script for LogTriageEnv
3
+ ==========================================================
4
+ MANDATORY environment variables:
5
+ API_BASE_URL The API endpoint for the LLM
6
+ (default: https://router.huggingface.co/v1)
7
+ MODEL_NAME The model identifier to use for inference
8
+ HF_TOKEN Your Hugging Face / API key
9
 
10
  Usage:
11
+ # Set environment variables
12
+ $env:API_BASE_URL="https://api.groq.com/openai/v1" # or HF router
13
+ $env:MODEL_NAME="llama-3.3-70b-versatile" # or any model
14
+ $env:HF_TOKEN="your-api-key-here"
15
+
16
+ python inference.py
17
+
18
+ Runtime: < 20 minutes on vcpu=2, memory=8gb
 
 
 
 
 
19
  """
20
  from __future__ import annotations
21
  import os
 
24
  import requests
25
  from openai import OpenAI
26
 
27
+ # ─── MANDATORY ENV VARIABLES (as required by hackathon spec) ──────────────────
28
+
29
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
30
+ MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct")
31
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("GROQ_API_KEY") # HF_TOKEN is primary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  # ─── ENVIRONMENT CONFIG ───────────────────────────────────────────────────────
34
 
35
  ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
36
  TASKS = ["single_crash", "cascading_failure", "silent_degradation"]
37
+ MAX_STEPS_PER_TASK = {
38
+ "single_crash": 8,
39
+ "cascading_failure": 12,
40
+ "silent_degradation": 15,
41
+ }
42
  SEED = 42 # fixed seed for reproducibility
43
 
44
  # ─── SYSTEM PROMPT ─────────────────────────────────────────────────────────────
 
66
  - resolve: value must be "resolved"
67
  - ignore: value must be "noise"
68
 
69
+ Severity classification rules:
70
+ - P1: service DOWN or error rate > 5% — immediate customer impact
71
+ - P2: degraded performance, trending toward P1 — no outage yet
72
+ - P3: warning only, no immediate impact
73
+
74
  Strategy:
75
+ 1. Read all log lines carefully — identify ERROR and FATAL lines first
76
+ 2. Check system_state for each service (error_rate, latency_p99_ms, status)
77
+ 3. Find the ROOT CAUSE service (where the problem STARTED, not where it SPREAD)
78
+ 4. Classify severity based on actual current impact
79
+ 5. Apply fix to ROOT CAUSE service, not symptom services
80
+ 6. After classify + identify + remediate call resolve
 
 
 
81
 
82
  IMPORTANT: Respond with ONLY the JSON object. No explanation, no markdown, no backticks."""
83
 
84
 
85
  def _build_user_prompt(obs: dict) -> str:
86
+ """Convert observation dict into LLM prompt."""
87
  lines = []
88
 
89
+ # System state — only show services with issues
90
  lines.append("=== SYSTEM STATE ===")
91
+ shown_any = False
92
  for svc, status in obs.get("system_state", {}).items():
93
  if isinstance(status, dict):
94
  s = status.get("status", "unknown")
95
  er = status.get("error_rate", 0)
96
  lat = status.get("latency_p99_ms", 0)
97
  if s != "up" or er > 0.01 or lat > 200:
98
+ lines.append(f" {svc}: status={s} | error_rate={er:.1%} | latency_p99={lat}ms")
99
+ shown_any = True
100
+ if not shown_any:
101
+ lines.append(" All services appear healthy")
102
  lines.append("")
103
 
104
  # Active alerts
 
106
  if alerts:
107
  lines.append("=== ACTIVE ALERTS ===")
108
  for alert in alerts:
109
+ lines.append(f" {alert}")
110
  lines.append("")
111
 
112
+ # Log lines — show all of them
113
  lines.append("=== LOG LINES ===")
114
  for log in obs.get("logs", []):
115
  if isinstance(log, dict):
116
+ ts = log.get("timestamp", "")[-8:]
117
  level = log.get("level", "INFO")
118
  svc = log.get("service", "unknown")
119
  msg = log.get("message", "")
120
  lines.append(f" [{ts}] {level:<5} {svc:<25} {msg}")
121
  lines.append("")
122
 
123
+ # Context
124
+ step = obs.get("step_count", 0)
125
+ task = obs.get("task_id", "")
126
+ elapsed = obs.get("time_elapsed_seconds", 0)
127
+ lines.append(f"Step: {step} | Task: {task} | Time elapsed: {elapsed}s")
128
 
129
  # Feedback from last action
130
  feedback = obs.get("last_action_feedback", "")
131
+ if feedback and "Incident detected" not in feedback:
132
+ lines.append(f"Last feedback: {feedback}")
133
 
134
  lines.append("")
135
+ lines.append("Respond with JSON only.")
136
  return "\n".join(lines)
137
 
138
 
139
  def _parse_action(response_text: str) -> dict | None:
140
+ """Parse LLM response into action dict."""
141
  text = response_text.strip()
142
 
143
+ # Strip markdown code blocks
144
  if text.startswith("```"):
145
  lines = text.split("\n")
146
+ text = "\n".join(lines[1:-1] if lines[-1].strip() == "```" else lines[1:])
147
 
148
  try:
149
  action = json.loads(text)
 
150
  if "action_type" not in action or "value" not in action:
151
  return None
 
152
  action.setdefault("confidence", 0.8)
153
  action.setdefault("reasoning", "")
154
  return action
155
  except json.JSONDecodeError:
 
156
  import re
157
  match = re.search(r'\{[^{}]+\}', text, re.DOTALL)
158
  if match:
 
163
  return None
164
 
165
 
166
+ def _get_fallback_action(obs: dict, step: int, actions_taken: list) -> dict:
167
+ """Fallback when LLM fails — use simple heuristics."""
 
 
 
168
  system_state = obs.get("system_state", {})
 
169
 
170
+ # Find worst service
171
+ worst_service = "payment-service"
172
  worst_error_rate = 0
173
  for svc, status in system_state.items():
174
  if isinstance(status, dict):
 
177
  worst_error_rate = er
178
  worst_service = svc
179
 
180
+ action_types_taken = [a.get("action_type") for a in actions_taken]
181
+
182
+ if "classify_severity" not in action_types_taken:
183
+ return {"action_type": "classify_severity", "value": "P1",
184
+ "confidence": 0.5, "reasoning": "fallback"}
185
+ elif "identify_root_cause" not in action_types_taken:
186
+ return {"action_type": "identify_root_cause", "value": worst_service,
187
+ "confidence": 0.5, "reasoning": "fallback"}
188
+ elif "remediate" not in action_types_taken:
189
+ return {"action_type": "remediate", "value": f"restart:{worst_service}",
190
+ "confidence": 0.5, "reasoning": "fallback"}
191
  else:
192
+ return {"action_type": "resolve", "value": "resolved",
193
+ "confidence": 0.5, "reasoning": "fallback"}
194
 
195
 
196
+ def run_task(client: OpenAI, task_id: str, seed: int = 42) -> dict:
197
+ """Run one complete episode for a task. Returns score + breakdown."""
 
 
 
198
  print(f"\n Running task: {task_id}...")
199
 
200
+ # Reset
201
  try:
202
  resp = requests.post(
203
  f"{ENV_URL}/reset",
 
207
  resp.raise_for_status()
208
  obs = resp.json()
209
  except Exception as e:
210
+ print(f" ERROR: Reset failed: {e}")
211
  return {"score": 0.0, "error": str(e), "task_id": task_id}
212
 
213
  max_steps = MAX_STEPS_PER_TASK.get(task_id, 10)
214
  conversation_history = []
215
+ actions_taken = []
216
  done = obs.get("done", False)
217
+ steps_taken = 0
218
 
219
  while not done and steps_taken < max_steps:
 
220
  user_prompt = _build_user_prompt(obs)
 
 
221
  conversation_history.append({"role": "user", "content": user_prompt})
222
+
223
+ # Keep conversation history bounded
224
  if len(conversation_history) > 8:
225
  conversation_history = conversation_history[-8:]
226
 
227
  # Call LLM
228
  try:
229
  response = client.chat.completions.create(
230
+ model=MODEL_NAME,
231
  messages=[
232
  {"role": "system", "content": SYSTEM_PROMPT},
233
  ] + conversation_history,
234
  max_tokens=200,
235
+ temperature=0,
236
  )
237
+ response_text = response.choices[0].message.content or ""
238
  conversation_history.append({"role": "assistant", "content": response_text})
 
 
239
  action = _parse_action(response_text)
240
  if action is None:
241
+ print(f" Step {steps_taken}: parse failed, using fallback")
242
+ action = _get_fallback_action(obs, steps_taken, actions_taken)
 
243
  except Exception as e:
244
+ print(f" Step {steps_taken}: LLM error ({e}), using fallback")
245
+ action = _get_fallback_action(obs, steps_taken, actions_taken)
246
 
247
+ # Step environment
248
  try:
249
  step_resp = requests.post(
250
  f"{ENV_URL}/step",
 
256
  done = obs.get("done", False)
257
  reward = obs.get("reward", 0.0)
258
  feedback = obs.get("last_action_feedback", "")
259
+ actions_taken.append(action)
260
  print(f" Step {steps_taken}: {action['action_type']}({action['value']}) "
261
+ f" reward={reward:+.2f} | {feedback[:50]}")
 
262
  except Exception as e:
263
+ print(f" Step {steps_taken}: environment error: {e}")
264
  break
265
 
266
  steps_taken += 1
267
+ time.sleep(0.2) # avoid rate limits
268
 
269
+ # Get grader score
270
  try:
271
  grader_resp = requests.post(f"{ENV_URL}/grader", timeout=30)
272
  grader_resp.raise_for_status()
 
274
  score = grader_result.get("score", 0.0)
275
  breakdown = grader_result.get("breakdown", {})
276
  except Exception as e:
277
+ print(f" ERROR: Grader failed: {e}")
278
  score = obs.get("cumulative_score", 0.0)
279
  breakdown = {}
280
 
281
+ print(f" Score: {score:.4f} ({steps_taken} steps)")
282
  return {
283
  "task_id": task_id,
284
  "score": score,
 
288
 
289
 
290
  def main():
291
+ """Run baseline agent on all 3 tasks and report scores."""
292
 
293
+ # Validate env vars
294
+ if not API_KEY:
 
 
 
 
 
295
  raise ValueError(
296
+ "API key not found. Set HF_TOKEN environment variable:\n"
297
+ " PowerShell: $env:HF_TOKEN='your-key'\n"
298
+ " CMD: set HF_TOKEN=your-key"
299
  )
300
 
301
+ # Build client
302
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
 
 
 
303
 
304
  print("=" * 60)
305
  print("LogTriageEnv — Baseline Inference Script")
306
  print("=" * 60)
307
+ print(f"API_BASE_URL: {API_BASE_URL}")
308
+ print(f"MODEL_NAME: {MODEL_NAME}")
309
+ print(f"ENV_URL: {ENV_URL}")
310
+ print(f"Seed: {SEED}")
 
311
  print("=" * 60)
312
 
313
+ # Verify environment
314
  try:
315
  health = requests.get(f"{ENV_URL}/health", timeout=10)
316
  health.raise_for_status()
317
+ print("Environment: OK")
318
  except Exception as e:
319
  raise RuntimeError(
320
  f"Environment not responding at {ENV_URL}\n"
321
+ f"Start with: python -m uvicorn server.app:app --port 7860\n"
322
  f"Error: {e}"
323
  )
324
 
325
+ # Run all tasks
326
  results = []
327
+ start_time = time.time()
328
+
329
  for task_id in TASKS:
330
+ result = run_task(client, task_id, seed=SEED)
331
  results.append(result)
332
 
333
+ elapsed = time.time() - start_time
334
+
335
+ # Print report
336
  print("\n" + "=" * 60)
337
  print("BASELINE RESULTS")
338
  print("=" * 60)
339
 
340
+ total = 0.0
341
  for result in results:
342
  task = result["task_id"]
343
  score = result["score"]
344
  steps = result["steps_taken"]
345
+ total += score
346
+ bar = "" * int(score * 20) + "" * (20 - int(score * 20))
347
  print(f"{task:<25} {score:.4f} [{bar}] ({steps} steps)")
348
+ for k, v in result.get("breakdown", {}).items():
349
+ print(f" {k:<20} {v}")
 
350
 
351
+ avg = total / len(TASKS)
352
  print("-" * 60)
353
+ print(f"{'AVERAGE':<25} {avg:.4f}")
354
+ print(f"{'RUNTIME':<25} {elapsed:.1f}s")
355
  print("=" * 60)
356
 
357
+ # JSON output
358
  output = {
359
+ "api_base_url": API_BASE_URL,
360
+ "model_name": MODEL_NAME,
361
  "seed": SEED,
362
  "results": results,
363
+ "average_score": round(avg, 4),
364
+ "runtime_seconds": round(elapsed, 1),
365
  }
366
+ print("\nJSON Output:")
367
  print(json.dumps(output, indent=2))
 
368
  return output
369
 
370
 
pyproject.toml ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "logtriage-env"
3
+ version = "1.0.0"
4
+ description = "An OpenEnv environment where an AI agent acts as an on-call SRE diagnosing incidents from log data"
5
+ requires-python = ">=3.10"
6
+ dependencies = [
7
+ "fastapi>=0.110.0",
8
+ "uvicorn>=0.27.0",
9
+ "pydantic>=2.5.0",
10
+ "python-dotenv>=1.0.0",
11
+ "groq>=0.5.0",
12
+ "openenv-core>=0.2.0",
13
+ ]
14
+
15
+ [build-system]
16
+ requires = ["setuptools>=61.0"]
17
+ build-backend = "setuptools.build_meta"
18
+
19
+ [tool.setuptools]
20
+ package-dir = {"" = "."}
21
+ packages = ["server", "server.graders", "server.scenarios"]
22
+
23
+ [project.scripts]
24
+ server = "server.app:main"
server/app.py CHANGED
@@ -118,34 +118,26 @@ def baseline():
118
  """
119
  Run the baseline inference script against all 3 tasks.
120
  Returns scores for each task produced by the LLM agent.
121
- Note: Requires GROQ_API_KEY (or other provider key) to be set.
122
  """
123
  import subprocess
124
  import sys
125
  import json as json_lib
126
 
127
  try:
128
- # Pass through all current env vars, plus GROQ_API_KEY if set
129
- env = os.environ.copy()
130
- groq_key = os.environ.get("GROQ_API_KEY", "")
131
- if not groq_key:
132
- # Try to read from process that started the server
133
- pass
134
-
135
  result = subprocess.run(
136
- [sys.executable, "baseline.py"],
137
  capture_output=True,
138
  text=True,
139
- timeout=300, # 5 minute timeout
140
  cwd=os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
141
- env=env,
142
  )
143
 
144
  if result.returncode != 0:
145
  return JSONResponse(
146
  status_code=500,
147
  content={
148
- "error": "Baseline script failed",
149
  "stderr": result.stderr[-500:] if result.stderr else "",
150
  }
151
  )
@@ -154,7 +146,7 @@ def baseline():
154
  output_lines = result.stdout.strip().split("\n")
155
  json_start = None
156
  for i, line in enumerate(output_lines):
157
- if line.strip() == "JSON Output (for /baseline endpoint):":
158
  json_start = i + 1
159
  break
160
 
@@ -165,10 +157,14 @@ def baseline():
165
  return {"message": "Baseline completed", "output": result.stdout[-1000:]}
166
 
167
  except subprocess.TimeoutExpired:
168
- return JSONResponse(status_code=504, content={"error": "Baseline timed out after 5 minutes"})
169
  except Exception as e:
170
  return JSONResponse(status_code=500, content={"error": str(e)})
171
 
172
 
 
 
 
 
173
  if __name__ == "__main__":
174
- uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=True)
 
118
  """
119
  Run the baseline inference script against all 3 tasks.
120
  Returns scores for each task produced by the LLM agent.
121
+ Note: Requires HF_TOKEN (or GROQ_API_KEY) to be set.
122
  """
123
  import subprocess
124
  import sys
125
  import json as json_lib
126
 
127
  try:
 
 
 
 
 
 
 
128
  result = subprocess.run(
129
+ [sys.executable, "inference.py"],
130
  capture_output=True,
131
  text=True,
132
+ timeout=1200, # 20 minute timeout (matches spec)
133
  cwd=os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
 
134
  )
135
 
136
  if result.returncode != 0:
137
  return JSONResponse(
138
  status_code=500,
139
  content={
140
+ "error": "Inference script failed",
141
  "stderr": result.stderr[-500:] if result.stderr else "",
142
  }
143
  )
 
146
  output_lines = result.stdout.strip().split("\n")
147
  json_start = None
148
  for i, line in enumerate(output_lines):
149
+ if line.strip() == "JSON Output:":
150
  json_start = i + 1
151
  break
152
 
 
157
  return {"message": "Baseline completed", "output": result.stdout[-1000:]}
158
 
159
  except subprocess.TimeoutExpired:
160
+ return JSONResponse(status_code=504, content={"error": "Inference timed out (20min limit)"})
161
  except Exception as e:
162
  return JSONResponse(status_code=500, content={"error": str(e)})
163
 
164
 
165
+ def main():
166
+ uvicorn.run("server.app:app", host="0.0.0.0", port=7860, reload=False)
167
+
168
+
169
  if __name__ == "__main__":
170
+ main()
test_all.bat DELETED
@@ -1,71 +0,0 @@
1
- @echo off
2
- REM =========================================================================
3
- REM Day 1 Test & Verification Script for LogTriageEnv
4
- REM =========================================================================
5
- REM This script runs all Day 1 tests and verifies the project is ready
6
-
7
- echo =========================================================================
8
- echo LogTriageEnv — Day 1 Verification Script
9
- echo =========================================================================
10
-
11
- REM Test 1: Python Tests
12
- echo.
13
- echo [TEST 1] Running Python validation tests...
14
- python test_day1.py
15
- if %ERRORLEVEL% NEQ 0 (
16
- echo ❌ Python tests failed!
17
- exit /b 1
18
- )
19
-
20
- REM Test 2: Install dependencies
21
- echo.
22
- echo [TEST 2] Installing dependencies from requirements.txt...
23
- pip install -q -r requirements.txt
24
- if %ERRORLEVEL% NEQ 0 (
25
- echo ❌ Pip install failed!
26
- exit /b 1
27
- )
28
- echo ✅ Dependencies installed
29
-
30
- REM Test 3: Check FastAPI can import
31
- echo.
32
- echo [TEST 3] Checking FastAPI imports...
33
- python -c "from fastapi import FastAPI; from uvicorn import run; print('✅ FastAPI and Uvicorn OK')"
34
- if %ERRORLEVEL% NEQ 0 (
35
- echo ❌ FastAPI/Uvicorn import failed!
36
- exit /b 1
37
- )
38
-
39
- REM Test 4: Check Pydantic models
40
- echo.
41
- echo [TEST 4] Testing Pydantic models...
42
- python -c "from server.models import TriageAction, TriageObservation; print('✅ Models imported')"
43
- if %ERRORLEVEL% NEQ 0 (
44
- echo ❌ Models import failed!
45
- exit /b 1
46
- )
47
-
48
- echo.
49
- echo =========================================================================
50
- echo ✅ ALL TESTS PASSED!
51
- echo =========================================================================
52
- echo.
53
- echo Next steps:
54
- echo.
55
- echo 1. START THE SERVER:
56
- echo python -m uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
57
- echo.
58
- echo 2. TEST ENDPOINTS (open another terminal):
59
- echo curl http://localhost:7860/health
60
- echo curl http://localhost:7860/tasks
61
- echo.
62
- echo 3. TEST DOCKER BUILD:
63
- echo docker build -t logtriage-env .
64
- echo docker run -p 7860:7860 logtriage-env
65
- echo.
66
- echo 4. PUSH TO GITHUB:
67
- echo git add .
68
- echo git commit -m "Day 1: scaffold, models.py, app skeleton, Dockerfile"
69
- echo git push origin main
70
- echo.
71
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_day1.py DELETED
@@ -1,130 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Day 1 Test Script — Verify all endpoints and models work
4
- """
5
- import sys
6
- import json
7
- from pathlib import Path
8
-
9
- # Add server to path
10
- sys.path.insert(0, str(Path(__file__).parent))
11
-
12
- print("=" * 70)
13
- print("DAY 1 TEST SUITE — LogTriageEnv")
14
- print("=" * 70)
15
-
16
- # Test 1: Import models
17
- print("\n[TEST 1] Importing models...")
18
- try:
19
- from server.models import TriageAction, TriageObservation, EpisodeState, LogLine, ServiceStatus
20
- print("✅ All models imported successfully")
21
- except Exception as e:
22
- print(f"❌ Import failed: {e}")
23
- sys.exit(1)
24
-
25
- # Test 2: Import FastAPI app
26
- print("\n[TEST 2] Importing FastAPI app...")
27
- try:
28
- from server.app import app
29
- print("✅ FastAPI app imported successfully")
30
- except Exception as e:
31
- print(f"❌ App import failed: {e}")
32
- sys.exit(1)
33
-
34
- # Test 3: Test TriageAction validation
35
- print("\n[TEST 3] Testing TriageAction.is_valid()...")
36
- test_cases = [
37
- ({"action_type": "classify_severity", "value": "P1"}, True, "Valid P1"),
38
- ({"action_type": "classify_severity", "value": "P5"}, False, "Invalid P5"),
39
- ({"action_type": "identify_root_cause", "value": "user-db"}, True, "Valid root cause"),
40
- ({"action_type": "identify_root_cause", "value": "invalid-service"}, False, "Invalid service"),
41
- ({"action_type": "remediate", "value": "restart:payment-service"}, True, "Valid remediate"),
42
- ({"action_type": "remediate", "value": "invalid:payment-service"}, False, "Invalid remediate action"),
43
- ({"action_type": "escalate", "value": "sre-team"}, True, "Valid escalate"),
44
- ({"action_type": "escalate", "value": "invalid-team"}, False, "Invalid team"),
45
- ({"action_type": "resolve", "value": "resolved"}, True, "Valid resolve"),
46
- ({"action_type": "resolve", "value": "not-resolved"}, False, "Invalid resolve"),
47
- ({"action_type": "ignore", "value": "noise"}, True, "Valid ignore"),
48
- ]
49
-
50
- passed = 0
51
- failed = 0
52
-
53
- for test_data, expected_valid, description in test_cases:
54
- try:
55
- action = TriageAction(**test_data)
56
- is_valid, error = action.is_valid()
57
-
58
- if is_valid == expected_valid:
59
- print(f" ✅ {description}: {test_data}")
60
- passed += 1
61
- else:
62
- print(f" ❌ {description}: expected {expected_valid}, got {is_valid}")
63
- failed += 1
64
- except Exception as e:
65
- print(f" ❌ {description}: Exception: {e}")
66
- failed += 1
67
-
68
- print(f"\nValidation tests: {passed} passed, {failed} failed")
69
-
70
- # Test 4: Test Pydantic model construction
71
- print("\n[TEST 4] Testing Pydantic model construction...")
72
- try:
73
- log = LogLine(
74
- timestamp="2025-03-25T14:32:01Z",
75
- level="ERROR",
76
- service="api-gateway",
77
- request_id="req-123",
78
- message="Service timeout",
79
- latency_ms=5000
80
- )
81
- print(f"✅ LogLine created: {log.service}")
82
-
83
- service_status = ServiceStatus(
84
- name="api-gateway",
85
- status="degraded",
86
- error_rate=0.34,
87
- latency_p99_ms=2500,
88
- last_updated="2025-03-25T14:32:01Z"
89
- )
90
- print(f"✅ ServiceStatus created: {service_status.name}")
91
-
92
- observation = TriageObservation(
93
- logs=[log],
94
- system_state={"api-gateway": service_status},
95
- incident_id="inc-001",
96
- task_id="single_crash",
97
- step_count=0,
98
- time_elapsed_seconds=0
99
- )
100
- print(f"✅ TriageObservation created: {observation.incident_id}")
101
- except Exception as e:
102
- print(f"❌ Model construction failed: {e}")
103
- sys.exit(1)
104
-
105
- # Test 5: FastAPI endpoint structure
106
- print("\n[TEST 5] Checking FastAPI endpoints...")
107
- endpoints = ["/health", "/reset", "/step", "/state", "/tasks", "/grader", "/baseline"]
108
- from fastapi.routing import APIRoute
109
-
110
- app_endpoints = [route.path for route in app.routes if isinstance(route, APIRoute)]
111
- print(f"Registered endpoints: {app_endpoints}")
112
-
113
- for endpoint in endpoints:
114
- if endpoint in app_endpoints:
115
- print(f" ✅ {endpoint} exists")
116
- else:
117
- print(f" ❌ {endpoint} missing")
118
-
119
- print("\n" + "=" * 70)
120
- print("✅ ALL TESTS PASSED — Day 1 Ready for Verification")
121
- print("=" * 70)
122
- print("\nNext steps:")
123
- print("1. Start server: python -m uvicorn server.app:app --host 0.0.0.0 --port 7860")
124
- print("2. Test endpoints with curl (see below)")
125
- print("3. Build Docker: docker build -t logtriage-env .")
126
- print("4. Verify Docker works: docker run -p 7860:7860 logtriage-env")
127
- print("\nExample curl tests:")
128
- print(" curl http://localhost:7860/health")
129
- print(" curl http://localhost:7860/tasks")
130
- print(" curl -X POST http://localhost:7860/reset -H 'Content-Type: application/json'")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
uv.lock ADDED
The diff for this file is too large to render. See raw diff