OGrohit commited on
Commit
a73cf2e
Β·
1 Parent(s): 9d56115

Updated README

Browse files
Files changed (1) hide show
  1. README.md +17 -18
README.md CHANGED
@@ -202,9 +202,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
202
  - `remediate("restart:payment-service")` taken β†’ 0.25
203
  - Resolved within 8 steps β†’ +0.10 speed bonus
204
 
205
- **Grader score:** sum of above, normalized to [0.0, 1.0]. Deterministic β€” same scenario seed produces identical grader output.
206
 
207
- **Expected baseline score:** 0.75–0.85 (frontier LLM should solve this reliably)
208
 
209
  ---
210
 
@@ -221,9 +221,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
221
  - Did NOT first remediate a symptom service β†’ +0.10 ordering bonus
222
  - Resolved within 12 steps β†’ +0.10 speed bonus
223
 
224
- **Grader score:** [0.0, 1.0]. Penalizes agents that treat symptoms rather than root cause.
225
 
226
- **Expected baseline score:** 0.45–0.60 (requires multi-hop reasoning)
227
 
228
  ---
229
 
@@ -241,9 +241,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
241
  - Resolved/escalated within 15 steps β†’ +0.10 speed bonus
242
  - Correctly ignored noise actions (no spurious `escalate` calls) β†’ +0.10
243
 
244
- **Grader score:** [0.0, 1.0]. This task is designed to challenge frontier models β€” requires temporal reasoning across steps, noise filtering, and nuanced severity judgment.
245
 
246
- **Expected baseline score:** 0.20–0.40 (even strong models struggle here)
247
 
248
  ---
249
 
@@ -449,12 +449,12 @@ The script prints a per-task score bar and returns a JSON block with full breakd
449
  "model_name": "llama-3.3-70b-versatile",
450
  "seed": 42,
451
  "results": [
452
- { "task_id": "single_crash", "score": 1.0, "steps_taken": 5, "breakdown": {} },
453
- { "task_id": "cascading_failure", "score": 0.65, "steps_taken": 9, "breakdown": {} },
454
- { "task_id": "silent_degradation", "score": 1.0, "steps_taken": 12, "breakdown": {} }
455
  ],
456
- "average_score": 0.8833,
457
- "runtime_seconds": 45.2
458
  }
459
  ```
460
 
@@ -466,14 +466,13 @@ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (
466
 
467
  | Task | Difficulty | Score |
468
  |---|---|---|
469
- | Single Service Crash | Easy | 1.0000 |
470
  | Cascading Failure | Medium | 0.6500 |
471
- | Silent Degradation | Hard | 1.0000 |
472
- | **Average** | | **0.8833** |
473
 
474
- > **Note:** Silent Degradation (Hard) requires distinguishing signal from 60% noise
475
- > and making a nuanced P2 judgment. The model successfully filtered noise and identified
476
- > `payment-db` as root cause with `flush-cache:payment-db` remediation.
477
 
478
  ---
479
 
@@ -506,7 +505,7 @@ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (
506
  - [ ] `POST /reset` returns valid observation
507
  - [ ] `POST /step` with valid action returns observation + reward
508
  - [ ] `GET /tasks` returns all 3 tasks with action schema
509
- - [ ] `POST /grader` returns score in [0.0, 1.0]
510
  - [ ] `POST /baseline` completes and returns scores for all 3 tasks
511
  - [ ] HF Space URL responds to ping with 200
512
  - [ ] Baseline script runs end-to-end with `HF_TOKEN` set
 
202
  - `remediate("restart:payment-service")` taken β†’ 0.25
203
  - Resolved within 8 steps β†’ +0.10 speed bonus
204
 
205
+ **Grader score:** sum of above, normalized to (0.0, 1.0). Deterministic β€” same scenario seed produces identical grader output.
206
 
207
+ **Expected baseline score:** 0.70–0.85 (frontier LLM should solve this reliably)
208
 
209
  ---
210
 
 
221
  - Did NOT first remediate a symptom service β†’ +0.10 ordering bonus
222
  - Resolved within 12 steps β†’ +0.10 speed bonus
223
 
224
+ **Grader score:** (0.0, 1.0). Penalizes agents that treat symptoms rather than root cause.
225
 
226
+ **Expected baseline score:** 0.50–0.65 (requires multi-hop reasoning)
227
 
228
  ---
229
 
 
241
  - Resolved/escalated within 15 steps β†’ +0.10 speed bonus
242
  - Correctly ignored noise actions (no spurious `escalate` calls) β†’ +0.10
243
 
244
+ **Grader score:** (0.0, 1.0). This task is designed to challenge frontier models β€” requires temporal reasoning across steps, noise filtering, and nuanced severity judgment.
245
 
246
+ **Expected baseline score:** 0.45–0.60 (even strong models struggle here)
247
 
248
  ---
249
 
 
449
  "model_name": "llama-3.3-70b-versatile",
450
  "seed": 42,
451
  "results": [
452
+ { "task_id": "single_crash", "score": 0.9999, "steps_taken": 4, "breakdown": {} },
453
+ { "task_id": "cascading_failure", "score": 0.65, "steps_taken": 7, "breakdown": {} },
454
+ { "task_id": "silent_degradation", "score": 0.55, "steps_taken": 5, "breakdown": {} }
455
  ],
456
+ "average_score": 0.7333,
457
+ "runtime_seconds": 97.4
458
  }
459
  ```
460
 
 
466
 
467
  | Task | Difficulty | Score |
468
  |---|---|---|
469
+ | Single Service Crash | Easy | 0.9999 |
470
  | Cascading Failure | Medium | 0.6500 |
471
+ | Silent Degradation | Hard | 0.5500 |
472
+ | **Average** | | **0.7333** |
473
 
474
+ > **Note:** Scores are clamped to the open interval (0, 1) β€” strictly between 0 and 1.
475
+ > A score of exactly 1.0 or 0.0 would fail Phase 2 validation.
 
476
 
477
  ---
478
 
 
505
  - [ ] `POST /reset` returns valid observation
506
  - [ ] `POST /step` with valid action returns observation + reward
507
  - [ ] `GET /tasks` returns all 3 tasks with action schema
508
+ - [ ] `POST /grader` returns score in (0.0, 1.0) β€” strictly between 0 and 1
509
  - [ ] `POST /baseline` completes and returns scores for all 3 tasks
510
  - [ ] HF Space URL responds to ping with 200
511
  - [ ] Baseline script runs end-to-end with `HF_TOKEN` set