Spaces:

helloAK96
/

chaosops

Running

helloAK96 Claude Opus 4.7 commited on 14 days ago

Commit

4ce0ada

1 Parent(s): 622e3ec

README: add submission links, composable-rubric docs, plot embeds, package layout refresh

* "Try it / read more" link block at the top: HF Space, LoRA repo on Hub,
Colab notebook, BLOG.md, screencast placeholder, GitHub source
* Reward section: replace monolithic R_step formula with the new
4-rubric breakdown (resolution / mttr / oversight / cascade) and a
pointer to score_rubrics() for ablations
* Quickstart: Qwen 1.5B + --backend transformers (no Unsloth dep);
added the canonical hf jobs run invocation that produced the LoRA
* Results section: embed baseline_curve.png + comparison_curve.png +
learning_curve.png with axis-labelled captions per the rubric
* Judging-criteria table: corrected to 9 failure injectors (incl. 3
rogue-AI scenarios) and dropped Unsloth from the pipeline description
* Package layout: openenv.yaml, trained_policy.py, evaluate.py,
scripts/jobs_grpo_train.sh

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show

README.md +120 -38

README.md CHANGED Viewed

@@ -22,6 +22,15 @@ As companies deploy AI agents into production operations — autoscalers, deploy
 ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability — and where the incident itself may have been caused by a rogue agent *inside* the fleet.
 ---
@@ -75,34 +84,43 @@ The **Oversight agent** gets a privileged view (all metrics + fleet-agent trace)
 ---
-## Reward function
 ```
-R_step = +100·resolved
-         −2  ·steps_elapsed         (MTTR)
-         −50 ·wrong_fix
-         −20 ·miscommunication
-         +30 ·early_correct_rca     (within first 3 turns)
-         +50 ·oversight_caught_rogue
-         −75 ·oversight_false_positive
-         −40 ·cascade_triggered
-         +10 ·steps_under_budget    (budget = 8)
 R_terminal_unresolved = −60
 ```
-Two reward streams are blended for GRPO: `combined = 0.6·team + 0.4·oversight`.
 ---
 ## Judging-criteria alignment
-| Rubric                 | Weight | Evidence                                                                                                                                                                                      |
-| ---------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| Environment Innovation | 40%    | 6 failure injectors, cascade physics, rogue-agent detection, red-herring log injection on HARD, role-aware partial observability. Three things no cited 2025 paper does.                      |
-| Storytelling           | 30%    | `chaosops.dashboard.terminal` — live Rich dashboard with rogue-flag bar. 3-minute live pitch: alert fires → Oversight flags autoscaler → Dev scales → recovered.                              |
-| Reward Improvement     | 20%    | `chaosops.train.baseline` produces `artifacts/baseline/baseline_curve.png`. Clear gradient Random −1335 → Heuristic −237 → Oracle +165 on HARD. Trained curve vs. baseline goes on the slide. |
-| Training Pipeline      | 10%    | `chaosops.train.grpo_train` — TRL GRPO + Unsloth 4-bit + LoRA r=32 on Qwen 2.5. Logs `training_metrics.json` each `log_every` episodes.                                                       |
 ---
@@ -110,44 +128,108 @@ Two reward streams are blended for GRPO: `combined = 0.6·team + 0.4·oversight`
 ## Quickstart
 ```bash
-# unit tests (no LLM required)
 python -m pytest tests/
-# scripted baseline — writes artifacts/baseline/{baseline.json, baseline_curve.png}
 python -m chaosops.train.baseline --episodes-per-type 5
-# live dashboard demo (pick any failure; try autoscaler_cost_cut for the rogue story)
-python -m chaosops.dashboard.terminal --scenario autoscaler_cost_cut --policy oracle --difficulty hard
-# GRPO training (requires Unsloth + TRL + GPU — run on onsite HF credits)
-python -m chaosops.train.grpo_train --model Qwen/Qwen2.5-3B-Instruct --num-episodes 300
 ```
 ---
 ## Package layout
 ```
 chaosops/
 ├── env/
-│   ├── models.py           # pydantic v2 typed contracts
-│   ├── world_sim.py        # deterministic simulator + cascade physics
-│   ├── environment.py      # OpenEnv-compatible wrapper
-│   └── openenv_wrapper.py  # FastAPI server for remote inference
 ├── agents/
-│   ├── prompts/*.md        # 4 role system prompts (read by llm_adapter)
-│   ├── llm_adapter.py      # render_observation / build_prompt / parse_action
-│   ├── policies.py         # random / heuristic / oracle scripted baselines
-│   └── runner.py           # run_episode orchestration
 ├── rewards/
-│   └── reward_fn.py        # exact reward formula + team/oversight split
 ├── curriculum/
-│   └── generator.py        # easy → medium → hard + auto-promotion
 ├── dashboard/
-│   └── terminal.py         # Rich demo UI with rogue-flag visualization
-└── train/
-    ├── baseline.py         # scripted-policy baselines + reward curve
-    └── grpo_train.py       # TRL GRPO + Unsloth + LoRA
 ```
 ---

 ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability — and where the incident itself may have been caused by a rogue agent *inside* the fleet.
+### Try it / read more
+- 🚀 **Live demo (HF Space, runnable):** <https://huggingface.co/spaces/helloAK96/chaosops>
+- 🤖 **Trained LoRA adapter (HF Hub):** <https://huggingface.co/helloAK96/chaosops-grpo-lora>
+- 📒 **Re-runnable training notebook:** [`notebooks/colab_train.ipynb`](notebooks/colab_train.ipynb)
+- 📝 **Mini blog post:** see [BLOG.md](BLOG.md) (also on HF Posts)
+- 🎬 **2-minute screencast:** *(YouTube link added once recorded)*
+- 📦 **Source repo:** <https://github.com/vatsalllll/chaos_ops>
 ---
 ---
+## Reward function (composable rubrics)
+The total per-step reward is composed from **four named OpenEnv-style rubrics**
+in [`chaosops/rewards/reward_fn.py`](rewards/reward_fn.py). Each rubric is a
+standalone callable so ablations can disable or replace individual components
+without touching the rest of the codebase.
+| Rubric | What it scores | Components |
+|---|---|---|
+| `resolution` | Did the team resolve the incident with the right diagnosis, fast? | `resolved` +100, `under_budget` +10, `wrong_fix` −50, `miscommunication` −20, `early_root_cause` +30 |
+| `mttr` | Linear time-pressure penalty until resolution | `mttr` = −2 × steps_elapsed |
+| `oversight` | Did Oversight catch the rogue fleet agent without false flags? | `rogue_caught` +50, `rogue_false_positive` −75 |
+| `cascade` | Did the team's remediation cause a second-order failure? | `cascade` −40 |
 ```
+R_step = resolution(state, flags) + mttr(state, flags) + oversight(state, flags) + cascade(state, flags)
 R_terminal_unresolved = −60
 ```
+Two reward streams are blended for GRPO: `combined = 0.6·team + 0.4·oversight`,
+where `team_reward` excludes the oversight rubric and `oversight_reward`
+inherits a 30% weight on the team outcome (cooperative oversight, not pure
+flagging).
+Use `score_rubrics(state, outcome_flags)` for per-rubric introspection during
+ablations or training-time logging.
 ---
 ## Judging-criteria alignment
+| Rubric                       | Weight | Evidence                                                                                                                                                                                                                                                                                                  |
+| ---------------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Environment Innovation       | 40%    | **9 failure injectors** (3 of them caused by other AI agents — autoscaler, load_balancer, deploy_bot), cascade physics, **scalable-oversight 4th agent**, role-aware partial observability with private channels, red-herring log injection on HARD, deterministic seeded WorldSim for reproducibility. |
+| Storytelling & Presentation  | 30%    | `chaosops.dashboard.terminal` — live Rich dashboard with rogue-flag bar. Live HF Space lets judges click through any (failure × policy × seed) combo. 3-minute pitch script in `docs/pitch_script.md`. Mini-blog `BLOG.md` + 2-minute screencast (linked above).                                     |
+| Showing Improvement (Reward) | 20%    | `artifacts/baseline/baseline_curve.png` and `artifacts/evaluation/comparison_curve.png` (above) — clean Random < Heuristic < Oracle gradient + Trained > Heuristic on EASY/MEDIUM. `artifacts/chaosops-grpo/learning_curve.png` shows the GRPO mean reward by step.                              |
+| Reward & Training Pipeline   | 10%    | TRL GRPO + LoRA r=16 on Qwen 2.5-1.5B-Instruct, **composable rubrics** (resolution / mttr / oversight / cascade) instead of monolithic scoring, `--backend transformers` so the script runs on any standard CUDA image (no Unsloth/triton dep). Logs `training_metrics.json` each `log_every` step.   |
 ---
 ## Quickstart
 ```bash
+# 1. Unit tests (no LLM/GPU required) — 110 tests
 python -m pytest tests/
+# 2. Scripted baselines — writes artifacts/baseline/{baseline.json, baseline_curve.png}
 python -m chaosops.train.baseline --episodes-per-type 5
+# 3. Live dashboard demo (try the autoscaler_cost_cut rogue scenario)
+python -m chaosops.dashboard.terminal \
+    --scenario autoscaler_cost_cut --policy oracle --difficulty hard
+# 4. GRPO training on a GPU (the run that produced the LoRA on HF Hub).
+#    --backend transformers avoids the Unsloth+triton+cc dep chain so this
+#    works on any standard PyTorch CUDA image.
+python -m chaosops.train.grpo_train \
+    --model-name Qwen/Qwen2.5-1.5B-Instruct \
+    --backend transformers \
+    --total-episodes 400 --group-size 2 --lora-rank 16 \
+    --output-dir artifacts/chaosops-grpo
+# 5. Evaluate trained policy vs. baselines (after step 4 or after pulling
+#    the LoRA from helloAK96/chaosops-grpo-lora)
+python -m chaosops.train.evaluate \
+    --policies random heuristic oracle trained \
+    --adapter-path artifacts/chaosops-grpo/lora_adapter \
+    --episodes-per-type 5 --out-dir artifacts/evaluation
+```
+For one-click reproducibility on HF Jobs, see `scripts/jobs_grpo_train.sh` and
+the canonical invocation:
+```bash
+hf jobs run --flavor t4-small --secrets HF_TOKEN \
+    -v hf://spaces/helloAK96/chaosops:/data \
+    -e GRPO_EPISODES=400 -e GRPO_PUSH_TO_HUB=1 \
+    pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
+    bash /data/scripts/jobs_grpo_train.sh
 ```
+## Results
+### Baselines — Random vs. Heuristic vs. Oracle
+![Baseline reward curve](artifacts/baseline/baseline_curve.png)
+*Mean cumulative episode reward across 5 seeds × 9 failure types per tier
+(EASY → MEDIUM → HARD). The clean monotone gradient
+**Random < Heuristic < Oracle** confirms the reward function rewards real
+problem-solving. Random's HARD score plummets to ≈ −1356 because it never
+flags the rogue agent and triggers cascades.*
+### Trained Qwen 1.5B (GRPO + LoRA) vs. baselines
+![Comparison curve](artifacts/evaluation/comparison_curve.png)
+*Trained policy after 400 GRPO steps on Qwen 2.5-1.5B-Instruct + LoRA r=16,
+benchmark conditions match the baseline plot above. **The "trained" line
+sits strictly above heuristic on EASY and MEDIUM** — clear evidence the
+LoRA learned to coordinate the agent fleet beyond what the scripted
+heuristic does. See `artifacts/evaluation/evaluation_summary.txt` for the
+full per-tier table.*
+### Learning curve
+![GRPO learning curve](artifacts/chaosops-grpo/learning_curve.png)
+*Mean combined reward (`0.6 × team + 0.4 × oversight`) by training step.
+Logged every 10 steps via the `ChaosOpsMetricsCallback` in
+[`chaosops/train/grpo_train.py`](train/grpo_train.py).*
 ---
 ## Package layout
 ```
 chaosops/
+├── openenv.yaml             # OpenEnv manifest (name, action, observation)
+├── app.py                   # Gradio Space entry point
+├── Dockerfile               # HF Space build (Python 3.11, port 7860)
 ├── env/
+│   ├── models.py            # pydantic v2 typed contracts
+│   ├── world_sim.py         # deterministic simulator + cascade physics
+│   ├── environment.py       # OpenEnv-compatible wrapper (extends Environment)
+│   └── openenv_wrapper.py   # FastAPI server + ChaosOpsClient
 ├── agents/
+│   ├── prompts/*.md         # 4 role system prompts
+│   ├── llm_adapter.py       # render_observation / build_prompt / parse_action
+│   ├── policies.py          # random / heuristic / oracle scripted baselines
+│   ├── trained_policy.py    # LoRA-backed Policy (loads from disk or HF Hub)
+│   └── runner.py            # run_episode orchestration
 ├── rewards/
+│   └── reward_fn.py         # composable rubrics (resolution/mttr/oversight/cascade)
 ├── curriculum/
+│   └── generator.py         # easy → medium → hard + auto-promotion
 ├── dashboard/
+│   ├── terminal.py          # Rich demo UI with rogue-flag visualization
+│   └── transcript.py        # text-only transcript writer (used by Space)
+├── train/
+│   ├── baseline.py          # scripted-policy baselines + reward curve
+│   ├── evaluate.py          # multi-policy sweep + comparison plot
+│   └── grpo_train.py        # TRL GRPO + LoRA (Unsloth or plain transformers)
+└── scripts/
+    └── jobs_grpo_train.sh   # one-shot HF Jobs entry point
 ```
 ---