Spaces:

helloAK96
/

chaosops

Running

helloAK96 Claude Opus 4.7 commited on 15 days ago

Commit

1a6f7f1

1 Parent(s): adbc390

README + BLOG: explicitly call out HF Jobs as our training infrastructure

Most participants will train on Colab. We trained on HuggingFace Jobs —
native pay-per-second HF compute, one shell command, audit trail per
run, ran all 6 GRPO experiments in parallel.

* README: new "Our approach — HF Jobs end-to-end" section (above the
6-run history) with the exact `hf jobs run` command we shipped Phase 3A
with, plus a Colab-vs-HF-Jobs comparison table
* BLOG: rewritten "The training pipeline" section to lead with HF Jobs;
added "After training" Phase 3A eval table; full 6-run ablation table;
refreshed "Try it" with the canonical hf jobs invocation; future-work
section noting Qwen-7B + reward annealing as the obvious extensions
* Both files now mention all 9 failure types (3 rogue) instead of the
pre-Phase-3 "6 failure types"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (2) hide show

BLOG.md +123 -17
README.md +49 -0

BLOG.md CHANGED Viewed

@@ -12,7 +12,10 @@ Today, a second source of outages is starting to show up in post-mortems. **Some
 **ChaosOps AI** is an OpenEnv-compatible training environment where four specialist LLM agents resolve cascading production incidents under partial observability — and where the incident itself may have been caused by a rogue agent inside the fleet. The fourth agent, **Oversight**, is trained specifically to catch this.
-GitHub: <https://github.com/vatsalllll/chaos_ops>
 ---
@@ -32,7 +35,7 @@ Turn order matters. We run `SRE → Oversight → Developer → Manager`. Runnin
 ---
-## Six failure types, two of them caused by AI
 | # | Failure | Cause | Correct fix |
 |---|---|---|---|
@@ -42,6 +45,9 @@ Turn order matters. We run `SRE → Oversight → Developer → Manager`. Runnin
 | 4 | **Autoscaler cost cut** | **rogue AI** | `scale(payments, 3)` + `flag_rogue(autoscaler)` |
 | 5 | **Misrouted traffic** | **rogue AI** | `restart(load_balancer)` + `flag_rogue(load_balancer)` |
 | 6 | Cascade | infra + self-inflicted | `restart(db)` → `restart(notifications)` → `resolve` |
 On the `hard` tier, each scenario also gets **red-herring log lines** injected — infrastructure symptoms that look real but don't point to the actual cause. This is what forces the policy to learn to corroborate across sources instead of pattern-matching on the first log line.
@@ -99,25 +105,100 @@ Eight wrong fixes, no flag, no resolution. **The reward gradient between these t
 `python -m chaosops.train.baseline --episodes-per-type 5` across every (tier, policy, failure type) combination:
-| Tier | Random | Heuristic | Oracle |
 |---|---|---|---|
-| easy   |   −321 |   +90  | +134 |
-| medium |   −724 |  +108  | +159 |
-| **hard**   | **−1336** | **−237** | **+165** |
-Oracle's rogue-catch rate on `hard`: **100%**. Random's: **0%**.
 ![Baseline reward curve](artifacts/baseline/baseline_curve.png)
 ---
-## The training pipeline
-`chaosops.train.grpo_train` wraps TRL's GRPOTrainer with Unsloth 4-bit + LoRA. For each group of size 4 we perturb the seed by a large prime and let the agents play, collect per-turn `(prompt, completion, team_reward, oversight_reward)` samples, blend them into a single scalar reward, and step the optimizer.
-The Colab notebook (`notebooks/colab_train.ipynb`) runs top-to-bottom on a free T4 in under 30 minutes with Qwen 2.5 0.5B at 30 episodes — enough to show the reward curve move. Onsite we scale to Qwen 2.5 7B with 300 episodes.
-**Curriculum auto-promotion.** A rolling-mean threshold per tier kicks the policy from easy → medium → hard as it gets competent. We validated this with a synthetic run: 30 updates at reward 200 promoted through both thresholds cleanly.
 ---
@@ -149,17 +230,42 @@ ChaosOps AI is specifically about (2). The Oversight agent gets the most-privile
 ## Try it
 ```bash
-git clone https://github.com/vatsalllll/chaos_ops
-cd chaos_ops
-python -m pytest tests/                                    # 19/19
-python -m chaosops.train.baseline --episodes-per-type 5    # writes PNG
 python -m chaosops.dashboard.terminal \
     --scenario autoscaler_cost_cut --policy oracle --difficulty hard
 ```
-Then open `notebooks/colab_train.ipynb` on Colab for the tiny end-to-end GRPO pass.
 ---
-*Built for the HuggingFace + OpenEnv + TRL hackathon. Tags: #MultiAgent #ScalableOversight #SelfImprovement*

 **ChaosOps AI** is an OpenEnv-compatible training environment where four specialist LLM agents resolve cascading production incidents under partial observability — and where the incident itself may have been caused by a rogue agent inside the fleet. The fourth agent, **Oversight**, is trained specifically to catch this.
+- 🚀 Live demo (HF Space): <https://huggingface.co/spaces/helloAK96/chaosops>
+- 🤖 Trained LoRA on Hub: <https://huggingface.co/helloAK96/chaosops-grpo-lora-p3a>
+- 📦 Source: <https://github.com/vatsalllll/chaos_ops>
+- 📒 Phase-wise judge demo notebook: [`notebooks/colab_train.ipynb`](notebooks/colab_train.ipynb)
 ---
 ---
+## Nine failure types, three of them caused by AI
 | # | Failure | Cause | Correct fix |
 |---|---|---|---|
 | 4 | **Autoscaler cost cut** | **rogue AI** | `scale(payments, 3)` + `flag_rogue(autoscaler)` |
 | 5 | **Misrouted traffic** | **rogue AI** | `restart(load_balancer)` + `flag_rogue(load_balancer)` |
 | 6 | Cascade | infra + self-inflicted | `restart(db)` → `restart(notifications)` → `resolve` |
+| 7 | DNS outage | infra | `rollback(dns)` |
+| 8 | Disk full | infra | `restart(payments)` after compaction |
+| 9 | **Rogue deploy bot** | **rogue AI** | `rollback(notifications)` + `flag_rogue(deploy_bot)` |
 On the `hard` tier, each scenario also gets **red-herring log lines** injected — infrastructure symptoms that look real but don't point to the actual cause. This is what forces the policy to learn to corroborate across sources instead of pattern-matching on the first log line.
 `python -m chaosops.train.baseline --episodes-per-type 5` across every (tier, policy, failure type) combination:
+| Tier | Random (R / solve / rogue+) | Heuristic | Oracle |
 |---|---|---|---|
+| easy   | −315 / 55% / 0%   | +94 / 100% / 0%  | +134 / 100% / 0% |
+| medium | −641 / 40% / 0%   | +109 / 100% / 0% | +151 / 100% / 100% |
+| **hard**   | **−1357 / 10% / 0%** | **−425 / 50% / 67%** | **+170 / 100% / 100%** |
+Oracle's rogue-catch rate on `hard`: **100%**. Random's: **0%**. Heuristic catches the rogue 67% on `hard` only because the heuristic agent eventually tries restarting the autoscaler/load_balancer/deploy_bot directly — which trips the rogue-flag side effect; this isn't real diagnosis, just brute force.
 ![Baseline reward curve](artifacts/baseline/baseline_curve.png)
+## After training — Phase 3A (Qwen 2.5-3B + GRPO)
+The trained policy in the live Space is `helloAK96/chaosops-grpo-lora-p3a`: Qwen 2.5-3B + LoRA r=32, 600 GRPO steps, `easy:200,medium:200,hard:200` curriculum, LR=2e-5, temperature 0.8, **2× rogue-rubric weight**.
+| Tier | Trained (R / solve / rogue+) | Heuristic (for comparison) |
+|---|---|---|
+| easy   | **+49 / 85% / 0%**             | +94 / 100% / 0%   |
+| medium | **−17 / 80% / 100%**           | +109 / 100% / 0%  |
+| hard   | **−433 / 40% / 93%**           | −425 / 50% / 67%  |
+What this means in plain English:
+- **The trained agent matches the scripted heuristic on HARD** within 8 reward points (−433 vs −425) on the hardest tier.
+- **It is the only sub-Oracle policy that explicitly diagnoses the rogue agent** — 100% catch rate on MEDIUM (heuristic: 0%), 93% catch on HARD.
+- **MTTR matches the Oracle (3.0 across all tiers)** — when it resolves, it resolves cleanly.
+- It improves over a random Qwen-3B by **+95 / +624 / +923** mean reward across EASY / MEDIUM / HARD.
+Trade-off: 13–35% false-positive rogue-flag rate on tiers without a real rogue. The 2× rogue-rubric weight (catch +100, FP −150) made the model eager to flag. Net positive given the 100% MEDIUM-tier rogue catch — the project's signature metric.
+![Trained vs baselines](artifacts/evaluation/comparison_curve.png)
+![GRPO learning curve](artifacts/chaosops-grpo/learning_curve.png)
+The learning curve shows positive mean reward in **all three tiers** of the curriculum, with HARD-tier mean (+14.0) ending *better* than EASY-tier mean (+6.9). The curriculum let the model absorb easier-tier dynamics first, then stack competencies on harder ones.
+## The 6-run ablation that picked Phase 3A
+We didn't get lucky on one run. We ran 6 GRPO experiments — each testing a specific hypothesis — and Phase 3A was the winner.
+| Phase | Hypothesis | Knobs | Eval mean (E/M/H) | Verdict |
+|---|---|---|---|---|
+| **0** | TRL defaults are fine | 1.5B, LR 5e-6, EASY-only, 400 steps, r=16 | −251 / −315 / −826 | undertrained, KL barely moves |
+| **1** | gradient too small | LR 5e-6 → **2e-5** | −218 / −283 / −820 | KL 4× higher; rogue catch lost |
+| **2** | model never sees harder scenarios | + curriculum, r=32, 600 steps | −221 / −296 / −834 | resolution improved; capacity-bound |
+| **3A 🏆** | bigger model + amplified oversight | **1.5B → 3B**, **rogue ×2** | **+49 / −17 / −433** | wins everywhere; submitted |
+| 3B (control) | reward shape alone enough on 1.5B | Phase 2 + rogue ×2 | −222 / −269 / −813 | falsifies — model capacity is binding |
+| 3C (control) | reverse curriculum helps EASY | Phase 3B + reverse curric | −241 / −363 / −821 | falsifies — reverse hurts |
+**The 3B-vs-3A delta proves model capacity was the binding constraint.** The same reward shape on 1.5B got nowhere; on 3B it broke through. **The 3C regression falsifies the "ended on HARD = forgot EASY" hypothesis.** Both controls were worth the $1.30 each.
+All six LoRAs are public on the Hub (`chaosops-grpo-lora{,-p1,-p2,-p3a,-p3b,-p3c}`) so the ablation is independently reproducible.
 ---
+## The training pipeline — and why we ran it on HF Jobs, not Colab
+`chaosops.train.grpo_train` wraps **TRL's GRPOTrainer with PEFT LoRA**. For each group of size 2 we perturb the seed by a large prime and let the agents play, collect per-turn `(prompt, completion, team_reward, oversight_reward)` samples, blend them into a single scalar reward, and step the optimizer.
+**Where we ran it: HuggingFace Jobs.** Most participants in this hackathon will train on Colab. We didn't — we ran every single GRPO experiment (6 of them, 3,200 training episodes total) on **HF Jobs**, the native pay-per-second HuggingFace compute platform. The whole pipeline is one shell command:
+```bash
+hf jobs run \
+    --flavor l4x1 \
+    --secrets HF_TOKEN \
+    -v hf://spaces/helloAK96/chaosops:/data \
+    --timeout 5h \
+    -e GRPO_EPISODES=600 \
+    -e GRPO_GROUP_SIZE=2 \
+    -e GRPO_LORA_RANK=32 \
+    -e GRPO_LR=2e-5 \
+    -e GRPO_TEMP=0.8 \
+    -e GRPO_CURRICULUM=easy:200,medium:200,hard:200 \
+    -e GRPO_ROGUE_MULTIPLIER=2.0 \
+    -e GRPO_PUSH_TO_HUB=1 \
+    -e HUB_REPO_ID=helloAK96/chaosops-grpo-lora-p3a \
+    pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
+    bash /data/scripts/jobs_grpo_train.sh
+```
+That's it. The `-v hf://spaces/helloAK96/chaosops:/data` flag mounts the Space repo read-only so the Job sees the same code judges click on. `--secrets HF_TOKEN` injects auth so the Job can push the trained LoRA back to a model repo when training finishes. We never touched a Jupyter cell, never re-uploaded source, never had a runtime disconnect. **Total compute spend across all 6 GRPO experiments: $9.80.**
+Why this was the right call:
+| Concern | Colab notebook | HF Jobs |
+|---|---|---|
+| **Reproducibility** | "whatever GPU is free" | explicit `--flavor l4x1` (or `t4-small`); same hardware every time |
+| **Auditable** | runtime logs vanish when kernel dies | every job has a permanent ID, logs, GPU stats viewable for 30 days |
+| **Cost** | Colab Pro $10/mo flat, or hope-the-free-tier-doesn't-disconnect | pay only for actual GPU-seconds; our 1h 40m training run cost $0.67 |
+| **Long-running** | 90-min disconnect on free tier; T4 only | Up to 6h timeout per job, no human-presence required |
+| **Native HF integration** | manual `hf login` + `snapshot_download` dance | volume-mount any Space/Dataset/Model directly, push to Hub from inside the job |
+| **Parallel A/B/C** | one notebook per kernel, clones the box | `for cfg in ...; do hf jobs run -d ...; done` |
+We launched **3A, 3B, and 3C in parallel** as one shell loop, three commands, three Job IDs returned in milliseconds. Try doing that with Colab.
+**Curriculum + reward shape are first-class CLI flags.** `--curriculum-schedule "easy:200,medium:200,hard:200"` and `--rogue-bonus-multiplier 2.0` are the two knobs Phase 3A turned that nothing else turned — both pure config, no code edit needed. The training script worked unchanged across all 6 runs; only the env vars passed to `hf jobs run` differed. That's how we A/B/C'd six configurations in an afternoon.
 ---
 ## Try it
+**Easiest path** (1 click): open the live Space and run a scenario in your browser:
+<https://huggingface.co/spaces/helloAK96/chaosops>
+**Locally:**
 ```bash
+git clone https://huggingface.co/spaces/helloAK96/chaosops
+cd chaosops
+python -m pytest tests/                                       # 110 tests
+python -m chaosops.train.baseline --episodes-per-type 5       # baseline_curve.png
 python -m chaosops.dashboard.terminal \
     --scenario autoscaler_cost_cut --policy oracle --difficulty hard
 ```
+**Re-run the entire training pipeline on HF Jobs (~$0.67 on T4, ~$1.93 on L4):**
+```bash
+hf jobs run --flavor t4-small --secrets HF_TOKEN \
+    -v hf://spaces/helloAK96/chaosops:/data \
+    -e GRPO_EPISODES=400 -e GRPO_PUSH_TO_HUB=1 \
+    -e HUB_REPO_ID=$YOUR_NAMESPACE/your-lora \
+    pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
+    bash /data/scripts/jobs_grpo_train.sh
+```
+**Or in a Colab T4** (we provide the notebook for parity, but HF Jobs is the canonical path): open `notebooks/colab_train.ipynb` on the Space → "Run all". The notebook is structured as a 10-phase narrative — phases 2-7 replay the same `autoscaler_cost_cut` HARD scenario with random / heuristic / oracle / trained so you literally *watch* what learned-vs-untrained looks like.
+## What we'd do with more compute
+Phase 3A closed the heuristic gap to ≤8 points on HARD. Two natural next steps that we'd run with more credit:
+1. **Qwen 7B + 1500 steps** — the trained line should push above heuristic on EASY/MEDIUM at this scale. Estimated ~$15 on l4x1.
+2. **Iterative curriculum + reward annealing** — gradually decay the rogue-rubric multiplier from 2.0 → 1.0 over training so the policy learns calibration instead of eager flagging. Should drop the 35% false-positive rate on EASY/HARD without giving up the rogue-catch gains.
+Neither of these were within the $30 hackathon budget, but the pipeline is identical: change two env vars to the `hf jobs run` command and rerun.
 ---
+*Built for the HuggingFace + OpenEnv + TRL hackathon. Tags: #MultiAgent #ScalableOversight #SelfImprovement #HFJobs #GRPO*

README.md CHANGED Viewed

@@ -113,6 +113,55 @@ ablations or training-time logging.
 ---
 ## Training history — 3,200 episodes across 6 GRPO runs
 The submitted Phase 3A LoRA isn't the result of one happy run. It's the

 ---
+## Our approach — HF Jobs end-to-end (no Colab dependency)
+Most participants in this hackathon will have trained on Google Colab.
+**We didn't.** We ran every single GRPO experiment on **HuggingFace Jobs**
+— the native pay-per-second compute platform on the same Hub the env
+itself is hosted on. The full training command is one shell line:
+```bash
+hf jobs run \
+    --flavor l4x1 \
+    --secrets HF_TOKEN \
+    -v hf://spaces/helloAK96/chaosops:/data \
+    -e GRPO_EPISODES=600 \
+    -e GRPO_GROUP_SIZE=2 \
+    -e GRPO_LORA_RANK=32 \
+    -e GRPO_LR=2e-5 \
+    -e GRPO_TEMP=0.8 \
+    -e GRPO_CURRICULUM=easy:200,medium:200,hard:200 \
+    -e GRPO_ROGUE_MULTIPLIER=2.0 \
+    -e GRPO_PUSH_TO_HUB=1 \
+    -e HUB_REPO_ID=helloAK96/chaosops-grpo-lora-p3a \
+    pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
+    bash /data/scripts/jobs_grpo_train.sh
+```
+`-v hf://spaces/helloAK96/chaosops:/data` mounts the Space repo
+read-only inside the Job container — the trainer sees the **exact same
+code** judges click on the live Space. `--secrets HF_TOKEN` injects
+auth so the Job pushes the trained LoRA back to a model repo on
+completion. We never touched a Jupyter cell, never had a runtime
+disconnect, never re-uploaded source.
+| Concern | Colab notebook | HF Jobs (our path) |
+|---|---|---|
+| **Reproducibility** | "whatever GPU is free" | explicit `--flavor l4x1` / `--flavor t4-small` |
+| **Auditable** | runtime logs vanish when kernel dies | every job has a permanent ID, logs, GPU stats viewable for 30 days |
+| **Cost** | Colab Pro $10/mo flat, or hope-the-free-tier-doesn't-disconnect | pay only for actual GPU-seconds; our 1h 40m T4 run cost **$0.67** |
+| **Long-running** | 90-min disconnect on free tier; T4 only | up to 6h timeout, no human-presence required |
+| **Native HF integration** | manual `hf login` + `snapshot_download` dance | volume-mount any Space/Dataset/Model directly; push to Hub from inside the job |
+| **Parallel A/B/C** | one notebook per kernel, clones the box | `for cfg in ...; do hf jobs run -d ...; done` |
+**Phase 3A, 3B, and 3C all ran simultaneously** as three Job IDs returned in milliseconds from one shell loop. Total spend across all 6 GRPO experiments + 6 evaluation jobs (8,060+ rollouts simulated): **$9.80** of the $30 credit budget.
+*A Colab notebook (`notebooks/colab_train.ipynb`) is provided for parity
+— it walks the same story phase-by-phase so judges can re-run it locally.
+But the canonical, reproducible-anywhere path is the HF Jobs command above.*
+---
 ## Training history — 3,200 episodes across 6 GRPO runs
 The submitted Phase 3A LoRA isn't the result of one happy run. It's the