Spaces:

OGrohit
/

logtriage-env

Running

OGrohit commited on Mar 30

Commit

1972eae

1 Parent(s): 4c76730

Update baseline scores, fix inference.py, remove junk file

Files changed (2) hide show

README.md CHANGED Viewed

@@ -463,18 +463,12 @@ Scores produced by `inference.py` using `llama-3.3-70b-versatile` via Groq API (
 |---|---|---|
 | Single Service Crash | Easy | 1.0000 |
 | Cascading Failure | Medium | 0.6500 |
-| Silent Degradation | Hard | 0.0000 |
-| **Average** | | **0.5500** |
-Expected ranges based on design:
-- Single crash: 0.75–0.85 → **Exceeded (1.0000)**
-- Cascading failure: 0.45–0.60 → **Exceeded (0.6500)**
-- Silent degradation: 0.20–0.40 → **Below range (0.0000 — see note)**
-> **Note:** LLM-based scoring varies across runs due to non-deterministic model behavior.
-> The Silent Degradation task is hardest — it requires distinguishing signal from 60% noise
-> and making a nuanced P2 judgment (not an outage yet). Scores on this task can range
-> from 0.0 to 0.55 depending on the model's log parsing on that specific run.
 ---

 |---|---|---|
 | Single Service Crash | Easy | 1.0000 |
 | Cascading Failure | Medium | 0.6500 |
+| Silent Degradation | Hard | 1.0000 |
+| **Average** | | **0.8833** |
+> **Note:** Silent Degradation (Hard) requires distinguishing signal from 60% noise
+> and making a nuanced P2 judgment. The model successfully filtered noise and identified
+> `payment-db` as root cause with `flush-cache:payment-db` remediation.
 ---

inference.py CHANGED Viewed

@@ -258,7 +258,7 @@ def run_task(client: OpenAI, task_id: str, seed: int = 42) -> dict:
             feedback = obs.get("last_action_feedback", "")
             actions_taken.append(action)
             print(f"    Step {steps_taken}: {action['action_type']}({action['value']}) "
-                  f"→ reward={reward:+.2f} | {feedback[:50]}")
         except Exception as e:
             print(f"    Step {steps_taken}: environment error: {e}")
             break
@@ -343,7 +343,7 @@ def main():
         score = result["score"]
         steps = result["steps_taken"]
         total += score
-        bar = "█" * int(score * 20) + "░" * (20 - int(score * 20))
         print(f"{task:<25} {score:.4f}  [{bar}]  ({steps} steps)")
         for k, v in result.get("breakdown", {}).items():
             print(f"  {k:<20} {v}")

             feedback = obs.get("last_action_feedback", "")
             actions_taken.append(action)
             print(f"    Step {steps_taken}: {action['action_type']}({action['value']}) "
+                  f"-> reward={reward:+.2f} | {feedback[:50]}")
         except Exception as e:
             print(f"    Step {steps_taken}: environment error: {e}")
             break
         score = result["score"]
         steps = result["steps_taken"]
         total += score
+        bar = "#" * int(score * 20) + "-" * (20 - int(score * 20))
         print(f"{task:<25} {score:.4f}  [{bar}]  ({steps} steps)")
         for k, v in result.get("breakdown", {}).items():
             print(f"  {k:<20} {v}")