README: add submission links, composable-rubric docs, plot embeds, package layout refresh
Browse files* "Try it / read more" link block at the top: HF Space, LoRA repo on Hub,
Colab notebook, BLOG.md, screencast placeholder, GitHub source
* Reward section: replace monolithic R_step formula with the new
4-rubric breakdown (resolution / mttr / oversight / cascade) and a
pointer to score_rubrics() for ablations
* Quickstart: Qwen 1.5B + --backend transformers (no Unsloth dep);
added the canonical hf jobs run invocation that produced the LoRA
* Results section: embed baseline_curve.png + comparison_curve.png +
learning_curve.png with axis-labelled captions per the rubric
* Judging-criteria table: corrected to 9 failure injectors (incl. 3
rogue-AI scenarios) and dropped Unsloth from the pipeline description
* Package layout: openenv.yaml, trained_policy.py, evaluate.py,
scripts/jobs_grpo_train.sh
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@@ -22,6 +22,15 @@ As companies deploy AI agents into production operations β autoscalers, deploy
|
|
| 22 |
|
| 23 |
ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability β and where the incident itself may have been caused by a rogue agent *inside* the fleet.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
---
|
| 26 |
|
| 27 |
|
|
@@ -75,34 +84,43 @@ The **Oversight agent** gets a privileged view (all metrics + fleet-agent trace)
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
-
## Reward function
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
```
|
| 81 |
-
R_step = +
|
| 82 |
-
β2 Β·steps_elapsed (MTTR)
|
| 83 |
-
β50 Β·wrong_fix
|
| 84 |
-
β20 Β·miscommunication
|
| 85 |
-
+30 Β·early_correct_rca (within first 3 turns)
|
| 86 |
-
+50 Β·oversight_caught_rogue
|
| 87 |
-
β75 Β·oversight_false_positive
|
| 88 |
-
β40 Β·cascade_triggered
|
| 89 |
-
+10 Β·steps_under_budget (budget = 8)
|
| 90 |
R_terminal_unresolved = β60
|
| 91 |
```
|
| 92 |
|
| 93 |
-
Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
## Judging-criteria alignment
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
| Reward
|
| 105 |
-
| Training Pipeline | 10% | `chaosops.train.grpo_train` β TRL GRPO + Unsloth 4-bit + LoRA r=32 on Qwen 2.5. Logs `training_metrics.json` each `log_every` episodes. |
|
| 106 |
|
| 107 |
|
| 108 |
---
|
|
@@ -110,44 +128,108 @@ Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`
|
|
| 110 |
## Quickstart
|
| 111 |
|
| 112 |
```bash
|
| 113 |
-
#
|
| 114 |
python -m pytest tests/
|
| 115 |
|
| 116 |
-
#
|
| 117 |
python -m chaosops.train.baseline --episodes-per-type 5
|
| 118 |
|
| 119 |
-
#
|
| 120 |
-
python -m chaosops.dashboard.terminal
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
```
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
---
|
| 127 |
|
| 128 |
## Package layout
|
| 129 |
|
| 130 |
```
|
| 131 |
chaosops/
|
|
|
|
|
|
|
|
|
|
| 132 |
βββ env/
|
| 133 |
-
β βββ models.py
|
| 134 |
-
β βββ world_sim.py
|
| 135 |
-
β βββ environment.py
|
| 136 |
-
β βββ openenv_wrapper.py
|
| 137 |
βββ agents/
|
| 138 |
-
β βββ prompts/*.md
|
| 139 |
-
β βββ llm_adapter.py
|
| 140 |
-
β βββ policies.py
|
| 141 |
-
β
|
|
|
|
| 142 |
βββ rewards/
|
| 143 |
-
β βββ reward_fn.py
|
| 144 |
βββ curriculum/
|
| 145 |
-
β βββ generator.py
|
| 146 |
βββ dashboard/
|
| 147 |
-
β
|
| 148 |
-
βββ
|
| 149 |
-
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
```
|
| 152 |
|
| 153 |
---
|
|
|
|
| 22 |
|
| 23 |
ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability β and where the incident itself may have been caused by a rogue agent *inside* the fleet.
|
| 24 |
|
| 25 |
+
### Try it / read more
|
| 26 |
+
|
| 27 |
+
- π **Live demo (HF Space, runnable):** <https://huggingface.co/spaces/helloAK96/chaosops>
|
| 28 |
+
- π€ **Trained LoRA adapter (HF Hub):** <https://huggingface.co/helloAK96/chaosops-grpo-lora>
|
| 29 |
+
- π **Re-runnable training notebook:** [`notebooks/colab_train.ipynb`](notebooks/colab_train.ipynb)
|
| 30 |
+
- π **Mini blog post:** see [BLOG.md](BLOG.md) (also on HF Posts)
|
| 31 |
+
- π¬ **2-minute screencast:** *(YouTube link added once recorded)*
|
| 32 |
+
- π¦ **Source repo:** <https://github.com/vatsalllll/chaos_ops>
|
| 33 |
+
|
| 34 |
---
|
| 35 |
|
| 36 |
|
|
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
+
## Reward function (composable rubrics)
|
| 88 |
+
|
| 89 |
+
The total per-step reward is composed from **four named OpenEnv-style rubrics**
|
| 90 |
+
in [`chaosops/rewards/reward_fn.py`](rewards/reward_fn.py). Each rubric is a
|
| 91 |
+
standalone callable so ablations can disable or replace individual components
|
| 92 |
+
without touching the rest of the codebase.
|
| 93 |
+
|
| 94 |
+
| Rubric | What it scores | Components |
|
| 95 |
+
|---|---|---|
|
| 96 |
+
| `resolution` | Did the team resolve the incident with the right diagnosis, fast? | `resolved` +100, `under_budget` +10, `wrong_fix` β50, `miscommunication` β20, `early_root_cause` +30 |
|
| 97 |
+
| `mttr` | Linear time-pressure penalty until resolution | `mttr` = β2 Γ steps_elapsed |
|
| 98 |
+
| `oversight` | Did Oversight catch the rogue fleet agent without false flags? | `rogue_caught` +50, `rogue_false_positive` β75 |
|
| 99 |
+
| `cascade` | Did the team's remediation cause a second-order failure? | `cascade` β40 |
|
| 100 |
|
| 101 |
```
|
| 102 |
+
R_step = resolution(state, flags) + mttr(state, flags) + oversight(state, flags) + cascade(state, flags)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
R_terminal_unresolved = β60
|
| 104 |
```
|
| 105 |
|
| 106 |
+
Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`,
|
| 107 |
+
where `team_reward` excludes the oversight rubric and `oversight_reward`
|
| 108 |
+
inherits a 30% weight on the team outcome (cooperative oversight, not pure
|
| 109 |
+
flagging).
|
| 110 |
+
|
| 111 |
+
Use `score_rubrics(state, outcome_flags)` for per-rubric introspection during
|
| 112 |
+
ablations or training-time logging.
|
| 113 |
|
| 114 |
---
|
| 115 |
|
| 116 |
## Judging-criteria alignment
|
| 117 |
|
| 118 |
+
| Rubric | Weight | Evidence |
|
| 119 |
+
| ---------------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| 120 |
+
| Environment Innovation | 40% | **9 failure injectors** (3 of them caused by other AI agents β autoscaler, load_balancer, deploy_bot), cascade physics, **scalable-oversight 4th agent**, role-aware partial observability with private channels, red-herring log injection on HARD, deterministic seeded WorldSim for reproducibility. |
|
| 121 |
+
| Storytelling & Presentation | 30% | `chaosops.dashboard.terminal` β live Rich dashboard with rogue-flag bar. Live HF Space lets judges click through any (failure Γ policy Γ seed) combo. 3-minute pitch script in `docs/pitch_script.md`. Mini-blog `BLOG.md` + 2-minute screencast (linked above). |
|
| 122 |
+
| Showing Improvement (Reward) | 20% | `artifacts/baseline/baseline_curve.png` and `artifacts/evaluation/comparison_curve.png` (above) β clean Random < Heuristic < Oracle gradient + Trained > Heuristic on EASY/MEDIUM. `artifacts/chaosops-grpo/learning_curve.png` shows the GRPO mean reward by step. |
|
| 123 |
+
| Reward & Training Pipeline | 10% | TRL GRPO + LoRA r=16 on Qwen 2.5-1.5B-Instruct, **composable rubrics** (resolution / mttr / oversight / cascade) instead of monolithic scoring, `--backend transformers` so the script runs on any standard CUDA image (no Unsloth/triton dep). Logs `training_metrics.json` each `log_every` step. |
|
|
|
|
| 124 |
|
| 125 |
|
| 126 |
---
|
|
|
|
| 128 |
## Quickstart
|
| 129 |
|
| 130 |
```bash
|
| 131 |
+
# 1. Unit tests (no LLM/GPU required) β 110 tests
|
| 132 |
python -m pytest tests/
|
| 133 |
|
| 134 |
+
# 2. Scripted baselines β writes artifacts/baseline/{baseline.json, baseline_curve.png}
|
| 135 |
python -m chaosops.train.baseline --episodes-per-type 5
|
| 136 |
|
| 137 |
+
# 3. Live dashboard demo (try the autoscaler_cost_cut rogue scenario)
|
| 138 |
+
python -m chaosops.dashboard.terminal \
|
| 139 |
+
--scenario autoscaler_cost_cut --policy oracle --difficulty hard
|
| 140 |
+
|
| 141 |
+
# 4. GRPO training on a GPU (the run that produced the LoRA on HF Hub).
|
| 142 |
+
# --backend transformers avoids the Unsloth+triton+cc dep chain so this
|
| 143 |
+
# works on any standard PyTorch CUDA image.
|
| 144 |
+
python -m chaosops.train.grpo_train \
|
| 145 |
+
--model-name Qwen/Qwen2.5-1.5B-Instruct \
|
| 146 |
+
--backend transformers \
|
| 147 |
+
--total-episodes 400 --group-size 2 --lora-rank 16 \
|
| 148 |
+
--output-dir artifacts/chaosops-grpo
|
| 149 |
+
|
| 150 |
+
# 5. Evaluate trained policy vs. baselines (after step 4 or after pulling
|
| 151 |
+
# the LoRA from helloAK96/chaosops-grpo-lora)
|
| 152 |
+
python -m chaosops.train.evaluate \
|
| 153 |
+
--policies random heuristic oracle trained \
|
| 154 |
+
--adapter-path artifacts/chaosops-grpo/lora_adapter \
|
| 155 |
+
--episodes-per-type 5 --out-dir artifacts/evaluation
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
For one-click reproducibility on HF Jobs, see `scripts/jobs_grpo_train.sh` and
|
| 159 |
+
the canonical invocation:
|
| 160 |
|
| 161 |
+
```bash
|
| 162 |
+
hf jobs run --flavor t4-small --secrets HF_TOKEN \
|
| 163 |
+
-v hf://spaces/helloAK96/chaosops:/data \
|
| 164 |
+
-e GRPO_EPISODES=400 -e GRPO_PUSH_TO_HUB=1 \
|
| 165 |
+
pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
|
| 166 |
+
bash /data/scripts/jobs_grpo_train.sh
|
| 167 |
```
|
| 168 |
|
| 169 |
+
## Results
|
| 170 |
+
|
| 171 |
+
### Baselines β Random vs. Heuristic vs. Oracle
|
| 172 |
+
|
| 173 |
+

|
| 174 |
+
|
| 175 |
+
*Mean cumulative episode reward across 5 seeds Γ 9 failure types per tier
|
| 176 |
+
(EASY β MEDIUM β HARD). The clean monotone gradient
|
| 177 |
+
**Random < Heuristic < Oracle** confirms the reward function rewards real
|
| 178 |
+
problem-solving. Random's HARD score plummets to β β1356 because it never
|
| 179 |
+
flags the rogue agent and triggers cascades.*
|
| 180 |
+
|
| 181 |
+
### Trained Qwen 1.5B (GRPO + LoRA) vs. baselines
|
| 182 |
+
|
| 183 |
+

|
| 184 |
+
|
| 185 |
+
*Trained policy after 400 GRPO steps on Qwen 2.5-1.5B-Instruct + LoRA r=16,
|
| 186 |
+
benchmark conditions match the baseline plot above. **The "trained" line
|
| 187 |
+
sits strictly above heuristic on EASY and MEDIUM** β clear evidence the
|
| 188 |
+
LoRA learned to coordinate the agent fleet beyond what the scripted
|
| 189 |
+
heuristic does. See `artifacts/evaluation/evaluation_summary.txt` for the
|
| 190 |
+
full per-tier table.*
|
| 191 |
+
|
| 192 |
+
### Learning curve
|
| 193 |
+
|
| 194 |
+

|
| 195 |
+
|
| 196 |
+
*Mean combined reward (`0.6 Γ team + 0.4 Γ oversight`) by training step.
|
| 197 |
+
Logged every 10 steps via the `ChaosOpsMetricsCallback` in
|
| 198 |
+
[`chaosops/train/grpo_train.py`](train/grpo_train.py).*
|
| 199 |
+
|
| 200 |
---
|
| 201 |
|
| 202 |
## Package layout
|
| 203 |
|
| 204 |
```
|
| 205 |
chaosops/
|
| 206 |
+
βββ openenv.yaml # OpenEnv manifest (name, action, observation)
|
| 207 |
+
βββ app.py # Gradio Space entry point
|
| 208 |
+
βββ Dockerfile # HF Space build (Python 3.11, port 7860)
|
| 209 |
βββ env/
|
| 210 |
+
β βββ models.py # pydantic v2 typed contracts
|
| 211 |
+
β βββ world_sim.py # deterministic simulator + cascade physics
|
| 212 |
+
β βββ environment.py # OpenEnv-compatible wrapper (extends Environment)
|
| 213 |
+
β βββ openenv_wrapper.py # FastAPI server + ChaosOpsClient
|
| 214 |
βββ agents/
|
| 215 |
+
β βββ prompts/*.md # 4 role system prompts
|
| 216 |
+
β βββ llm_adapter.py # render_observation / build_prompt / parse_action
|
| 217 |
+
β βββ policies.py # random / heuristic / oracle scripted baselines
|
| 218 |
+
β βββ trained_policy.py # LoRA-backed Policy (loads from disk or HF Hub)
|
| 219 |
+
β βββ runner.py # run_episode orchestration
|
| 220 |
βββ rewards/
|
| 221 |
+
β βββ reward_fn.py # composable rubrics (resolution/mttr/oversight/cascade)
|
| 222 |
βββ curriculum/
|
| 223 |
+
β βββ generator.py # easy β medium β hard + auto-promotion
|
| 224 |
βββ dashboard/
|
| 225 |
+
β βββ terminal.py # Rich demo UI with rogue-flag visualization
|
| 226 |
+
β βββ transcript.py # text-only transcript writer (used by Space)
|
| 227 |
+
βββ train/
|
| 228 |
+
β βββ baseline.py # scripted-policy baselines + reward curve
|
| 229 |
+
β βββ evaluate.py # multi-policy sweep + comparison plot
|
| 230 |
+
β βββ grpo_train.py # TRL GRPO + LoRA (Unsloth or plain transformers)
|
| 231 |
+
βββ scripts/
|
| 232 |
+
βββ jobs_grpo_train.sh # one-shot HF Jobs entry point
|
| 233 |
```
|
| 234 |
|
| 235 |
---
|