Spaces:
Running
Running
Updated README
Browse files
README.md
CHANGED
|
@@ -202,9 +202,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
|
|
| 202 |
- `remediate("restart:payment-service")` taken β 0.25
|
| 203 |
- Resolved within 8 steps β +0.10 speed bonus
|
| 204 |
|
| 205 |
-
**Grader score:** sum of above, normalized to
|
| 206 |
|
| 207 |
-
**Expected baseline score:** 0.
|
| 208 |
|
| 209 |
---
|
| 210 |
|
|
@@ -221,9 +221,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
|
|
| 221 |
- Did NOT first remediate a symptom service β +0.10 ordering bonus
|
| 222 |
- Resolved within 12 steps β +0.10 speed bonus
|
| 223 |
|
| 224 |
-
**Grader score:**
|
| 225 |
|
| 226 |
-
**Expected baseline score:** 0.
|
| 227 |
|
| 228 |
---
|
| 229 |
|
|
@@ -241,9 +241,9 @@ The reward function provides **dense, shaped signal** across the full trajectory
|
|
| 241 |
- Resolved/escalated within 15 steps β +0.10 speed bonus
|
| 242 |
- Correctly ignored noise actions (no spurious `escalate` calls) β +0.10
|
| 243 |
|
| 244 |
-
**Grader score:**
|
| 245 |
|
| 246 |
-
**Expected baseline score:** 0.
|
| 247 |
|
| 248 |
---
|
| 249 |
|
|
@@ -449,12 +449,12 @@ The script prints a per-task score bar and returns a JSON block with full breakd
|
|
| 449 |
"model_name": "llama-3.3-70b-versatile",
|
| 450 |
"seed": 42,
|
| 451 |
"results": [
|
| 452 |
-
{ "task_id": "single_crash", "score":
|
| 453 |
-
{ "task_id": "cascading_failure", "score": 0.65, "steps_taken":
|
| 454 |
-
{ "task_id": "silent_degradation", "score":
|
| 455 |
],
|
| 456 |
-
"average_score": 0.
|
| 457 |
-
"runtime_seconds":
|
| 458 |
}
|
| 459 |
```
|
| 460 |
|
|
@@ -466,14 +466,13 @@ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (
|
|
| 466 |
|
| 467 |
| Task | Difficulty | Score |
|
| 468 |
|---|---|---|
|
| 469 |
-
| Single Service Crash | Easy |
|
| 470 |
| Cascading Failure | Medium | 0.6500 |
|
| 471 |
-
| Silent Degradation | Hard |
|
| 472 |
-
| **Average** | | **0.
|
| 473 |
|
| 474 |
-
> **Note:**
|
| 475 |
-
>
|
| 476 |
-
> `payment-db` as root cause with `flush-cache:payment-db` remediation.
|
| 477 |
|
| 478 |
---
|
| 479 |
|
|
@@ -506,7 +505,7 @@ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (
|
|
| 506 |
- [ ] `POST /reset` returns valid observation
|
| 507 |
- [ ] `POST /step` with valid action returns observation + reward
|
| 508 |
- [ ] `GET /tasks` returns all 3 tasks with action schema
|
| 509 |
-
- [ ] `POST /grader` returns score in
|
| 510 |
- [ ] `POST /baseline` completes and returns scores for all 3 tasks
|
| 511 |
- [ ] HF Space URL responds to ping with 200
|
| 512 |
- [ ] Baseline script runs end-to-end with `HF_TOKEN` set
|
|
|
|
| 202 |
- `remediate("restart:payment-service")` taken β 0.25
|
| 203 |
- Resolved within 8 steps β +0.10 speed bonus
|
| 204 |
|
| 205 |
+
**Grader score:** sum of above, normalized to (0.0, 1.0). Deterministic β same scenario seed produces identical grader output.
|
| 206 |
|
| 207 |
+
**Expected baseline score:** 0.70β0.85 (frontier LLM should solve this reliably)
|
| 208 |
|
| 209 |
---
|
| 210 |
|
|
|
|
| 221 |
- Did NOT first remediate a symptom service β +0.10 ordering bonus
|
| 222 |
- Resolved within 12 steps β +0.10 speed bonus
|
| 223 |
|
| 224 |
+
**Grader score:** (0.0, 1.0). Penalizes agents that treat symptoms rather than root cause.
|
| 225 |
|
| 226 |
+
**Expected baseline score:** 0.50β0.65 (requires multi-hop reasoning)
|
| 227 |
|
| 228 |
---
|
| 229 |
|
|
|
|
| 241 |
- Resolved/escalated within 15 steps β +0.10 speed bonus
|
| 242 |
- Correctly ignored noise actions (no spurious `escalate` calls) β +0.10
|
| 243 |
|
| 244 |
+
**Grader score:** (0.0, 1.0). This task is designed to challenge frontier models β requires temporal reasoning across steps, noise filtering, and nuanced severity judgment.
|
| 245 |
|
| 246 |
+
**Expected baseline score:** 0.45β0.60 (even strong models struggle here)
|
| 247 |
|
| 248 |
---
|
| 249 |
|
|
|
|
| 449 |
"model_name": "llama-3.3-70b-versatile",
|
| 450 |
"seed": 42,
|
| 451 |
"results": [
|
| 452 |
+
{ "task_id": "single_crash", "score": 0.9999, "steps_taken": 4, "breakdown": {} },
|
| 453 |
+
{ "task_id": "cascading_failure", "score": 0.65, "steps_taken": 7, "breakdown": {} },
|
| 454 |
+
{ "task_id": "silent_degradation", "score": 0.55, "steps_taken": 5, "breakdown": {} }
|
| 455 |
],
|
| 456 |
+
"average_score": 0.7333,
|
| 457 |
+
"runtime_seconds": 97.4
|
| 458 |
}
|
| 459 |
```
|
| 460 |
|
|
|
|
| 466 |
|
| 467 |
| Task | Difficulty | Score |
|
| 468 |
|---|---|---|
|
| 469 |
+
| Single Service Crash | Easy | 0.9999 |
|
| 470 |
| Cascading Failure | Medium | 0.6500 |
|
| 471 |
+
| Silent Degradation | Hard | 0.5500 |
|
| 472 |
+
| **Average** | | **0.7333** |
|
| 473 |
|
| 474 |
+
> **Note:** Scores are clamped to the open interval (0, 1) β strictly between 0 and 1.
|
| 475 |
+
> A score of exactly 1.0 or 0.0 would fail Phase 2 validation.
|
|
|
|
| 476 |
|
| 477 |
---
|
| 478 |
|
|
|
|
| 505 |
- [ ] `POST /reset` returns valid observation
|
| 506 |
- [ ] `POST /step` with valid action returns observation + reward
|
| 507 |
- [ ] `GET /tasks` returns all 3 tasks with action schema
|
| 508 |
+
- [ ] `POST /grader` returns score in (0.0, 1.0) β strictly between 0 and 1
|
| 509 |
- [ ] `POST /baseline` completes and returns scores for all 3 tasks
|
| 510 |
- [ ] HF Space URL responds to ping with 200
|
| 511 |
- [ ] Baseline script runs end-to-end with `HF_TOKEN` set
|