helloAK96 Claude Opus 4.7 commited on
Commit
4ce0ada
Β·
1 Parent(s): 622e3ec

README: add submission links, composable-rubric docs, plot embeds, package layout refresh

Browse files

* "Try it / read more" link block at the top: HF Space, LoRA repo on Hub,
Colab notebook, BLOG.md, screencast placeholder, GitHub source
* Reward section: replace monolithic R_step formula with the new
4-rubric breakdown (resolution / mttr / oversight / cascade) and a
pointer to score_rubrics() for ablations
* Quickstart: Qwen 1.5B + --backend transformers (no Unsloth dep);
added the canonical hf jobs run invocation that produced the LoRA
* Results section: embed baseline_curve.png + comparison_curve.png +
learning_curve.png with axis-labelled captions per the rubric
* Judging-criteria table: corrected to 9 failure injectors (incl. 3
rogue-AI scenarios) and dropped Unsloth from the pipeline description
* Package layout: openenv.yaml, trained_policy.py, evaluate.py,
scripts/jobs_grpo_train.sh

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +120 -38
README.md CHANGED
@@ -22,6 +22,15 @@ As companies deploy AI agents into production operations β€” autoscalers, deploy
22
 
23
  ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability β€” and where the incident itself may have been caused by a rogue agent *inside* the fleet.
24
 
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
 
@@ -75,34 +84,43 @@ The **Oversight agent** gets a privileged view (all metrics + fleet-agent trace)
75
 
76
  ---
77
 
78
- ## Reward function
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  ```
81
- R_step = +100Β·resolved
82
- βˆ’2 Β·steps_elapsed (MTTR)
83
- βˆ’50 Β·wrong_fix
84
- βˆ’20 Β·miscommunication
85
- +30 Β·early_correct_rca (within first 3 turns)
86
- +50 Β·oversight_caught_rogue
87
- βˆ’75 Β·oversight_false_positive
88
- βˆ’40 Β·cascade_triggered
89
- +10 Β·steps_under_budget (budget = 8)
90
  R_terminal_unresolved = βˆ’60
91
  ```
92
 
93
- Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`.
 
 
 
 
 
 
94
 
95
  ---
96
 
97
  ## Judging-criteria alignment
98
 
99
-
100
- | Rubric | Weight | Evidence |
101
- | ---------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
102
- | Environment Innovation | 40% | 6 failure injectors, cascade physics, rogue-agent detection, red-herring log injection on HARD, role-aware partial observability. Three things no cited 2025 paper does. |
103
- | Storytelling | 30% | `chaosops.dashboard.terminal` β€” live Rich dashboard with rogue-flag bar. 3-minute live pitch: alert fires β†’ Oversight flags autoscaler β†’ Dev scales β†’ recovered. |
104
- | Reward Improvement | 20% | `chaosops.train.baseline` produces `artifacts/baseline/baseline_curve.png`. Clear gradient Random βˆ’1335 β†’ Heuristic βˆ’237 β†’ Oracle +165 on HARD. Trained curve vs. baseline goes on the slide. |
105
- | Training Pipeline | 10% | `chaosops.train.grpo_train` β€” TRL GRPO + Unsloth 4-bit + LoRA r=32 on Qwen 2.5. Logs `training_metrics.json` each `log_every` episodes. |
106
 
107
 
108
  ---
@@ -110,44 +128,108 @@ Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`
110
  ## Quickstart
111
 
112
  ```bash
113
- # unit tests (no LLM required)
114
  python -m pytest tests/
115
 
116
- # scripted baseline β€” writes artifacts/baseline/{baseline.json, baseline_curve.png}
117
  python -m chaosops.train.baseline --episodes-per-type 5
118
 
119
- # live dashboard demo (pick any failure; try autoscaler_cost_cut for the rogue story)
120
- python -m chaosops.dashboard.terminal --scenario autoscaler_cost_cut --policy oracle --difficulty hard
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
- # GRPO training (requires Unsloth + TRL + GPU β€” run on onsite HF credits)
123
- python -m chaosops.train.grpo_train --model Qwen/Qwen2.5-3B-Instruct --num-episodes 300
 
 
 
 
124
  ```
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ---
127
 
128
  ## Package layout
129
 
130
  ```
131
  chaosops/
 
 
 
132
  β”œβ”€β”€ env/
133
- β”‚ β”œβ”€β”€ models.py # pydantic v2 typed contracts
134
- β”‚ β”œβ”€β”€ world_sim.py # deterministic simulator + cascade physics
135
- β”‚ β”œβ”€β”€ environment.py # OpenEnv-compatible wrapper
136
- β”‚ └── openenv_wrapper.py # FastAPI server for remote inference
137
  β”œβ”€β”€ agents/
138
- β”‚ β”œβ”€β”€ prompts/*.md # 4 role system prompts (read by llm_adapter)
139
- β”‚ β”œβ”€β”€ llm_adapter.py # render_observation / build_prompt / parse_action
140
- β”‚ β”œβ”€β”€ policies.py # random / heuristic / oracle scripted baselines
141
- β”‚ └── runner.py # run_episode orchestration
 
142
  β”œβ”€β”€ rewards/
143
- β”‚ └── reward_fn.py # exact reward formula + team/oversight split
144
  β”œβ”€β”€ curriculum/
145
- β”‚ └── generator.py # easy β†’ medium β†’ hard + auto-promotion
146
  β”œβ”€β”€ dashboard/
147
- β”‚ └── terminal.py # Rich demo UI with rogue-flag visualization
148
- └── train/
149
- β”œβ”€β”€ baseline.py # scripted-policy baselines + reward curve
150
- └── grpo_train.py # TRL GRPO + Unsloth + LoRA
 
 
 
 
151
  ```
152
 
153
  ---
 
22
 
23
  ChaosOps AI is a reinforcement-learning environment where four specialist LLM agents (SRE, Developer, Manager, Oversight) handle cascading production incidents under partial observability β€” and where the incident itself may have been caused by a rogue agent *inside* the fleet.
24
 
25
+ ### Try it / read more
26
+
27
+ - πŸš€ **Live demo (HF Space, runnable):** <https://huggingface.co/spaces/helloAK96/chaosops>
28
+ - πŸ€– **Trained LoRA adapter (HF Hub):** <https://huggingface.co/helloAK96/chaosops-grpo-lora>
29
+ - πŸ“’ **Re-runnable training notebook:** [`notebooks/colab_train.ipynb`](notebooks/colab_train.ipynb)
30
+ - πŸ“ **Mini blog post:** see [BLOG.md](BLOG.md) (also on HF Posts)
31
+ - 🎬 **2-minute screencast:** *(YouTube link added once recorded)*
32
+ - πŸ“¦ **Source repo:** <https://github.com/vatsalllll/chaos_ops>
33
+
34
  ---
35
 
36
 
 
84
 
85
  ---
86
 
87
+ ## Reward function (composable rubrics)
88
+
89
+ The total per-step reward is composed from **four named OpenEnv-style rubrics**
90
+ in [`chaosops/rewards/reward_fn.py`](rewards/reward_fn.py). Each rubric is a
91
+ standalone callable so ablations can disable or replace individual components
92
+ without touching the rest of the codebase.
93
+
94
+ | Rubric | What it scores | Components |
95
+ |---|---|---|
96
+ | `resolution` | Did the team resolve the incident with the right diagnosis, fast? | `resolved` +100, `under_budget` +10, `wrong_fix` βˆ’50, `miscommunication` βˆ’20, `early_root_cause` +30 |
97
+ | `mttr` | Linear time-pressure penalty until resolution | `mttr` = βˆ’2 Γ— steps_elapsed |
98
+ | `oversight` | Did Oversight catch the rogue fleet agent without false flags? | `rogue_caught` +50, `rogue_false_positive` βˆ’75 |
99
+ | `cascade` | Did the team's remediation cause a second-order failure? | `cascade` βˆ’40 |
100
 
101
  ```
102
+ R_step = resolution(state, flags) + mttr(state, flags) + oversight(state, flags) + cascade(state, flags)
 
 
 
 
 
 
 
 
103
  R_terminal_unresolved = βˆ’60
104
  ```
105
 
106
+ Two reward streams are blended for GRPO: `combined = 0.6Β·team + 0.4Β·oversight`,
107
+ where `team_reward` excludes the oversight rubric and `oversight_reward`
108
+ inherits a 30% weight on the team outcome (cooperative oversight, not pure
109
+ flagging).
110
+
111
+ Use `score_rubrics(state, outcome_flags)` for per-rubric introspection during
112
+ ablations or training-time logging.
113
 
114
  ---
115
 
116
  ## Judging-criteria alignment
117
 
118
+ | Rubric | Weight | Evidence |
119
+ | ---------------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
120
+ | Environment Innovation | 40% | **9 failure injectors** (3 of them caused by other AI agents β€” autoscaler, load_balancer, deploy_bot), cascade physics, **scalable-oversight 4th agent**, role-aware partial observability with private channels, red-herring log injection on HARD, deterministic seeded WorldSim for reproducibility. |
121
+ | Storytelling & Presentation | 30% | `chaosops.dashboard.terminal` β€” live Rich dashboard with rogue-flag bar. Live HF Space lets judges click through any (failure Γ— policy Γ— seed) combo. 3-minute pitch script in `docs/pitch_script.md`. Mini-blog `BLOG.md` + 2-minute screencast (linked above). |
122
+ | Showing Improvement (Reward) | 20% | `artifacts/baseline/baseline_curve.png` and `artifacts/evaluation/comparison_curve.png` (above) β€” clean Random < Heuristic < Oracle gradient + Trained > Heuristic on EASY/MEDIUM. `artifacts/chaosops-grpo/learning_curve.png` shows the GRPO mean reward by step. |
123
+ | Reward & Training Pipeline | 10% | TRL GRPO + LoRA r=16 on Qwen 2.5-1.5B-Instruct, **composable rubrics** (resolution / mttr / oversight / cascade) instead of monolithic scoring, `--backend transformers` so the script runs on any standard CUDA image (no Unsloth/triton dep). Logs `training_metrics.json` each `log_every` step. |
 
124
 
125
 
126
  ---
 
128
  ## Quickstart
129
 
130
  ```bash
131
+ # 1. Unit tests (no LLM/GPU required) β€” 110 tests
132
  python -m pytest tests/
133
 
134
+ # 2. Scripted baselines β€” writes artifacts/baseline/{baseline.json, baseline_curve.png}
135
  python -m chaosops.train.baseline --episodes-per-type 5
136
 
137
+ # 3. Live dashboard demo (try the autoscaler_cost_cut rogue scenario)
138
+ python -m chaosops.dashboard.terminal \
139
+ --scenario autoscaler_cost_cut --policy oracle --difficulty hard
140
+
141
+ # 4. GRPO training on a GPU (the run that produced the LoRA on HF Hub).
142
+ # --backend transformers avoids the Unsloth+triton+cc dep chain so this
143
+ # works on any standard PyTorch CUDA image.
144
+ python -m chaosops.train.grpo_train \
145
+ --model-name Qwen/Qwen2.5-1.5B-Instruct \
146
+ --backend transformers \
147
+ --total-episodes 400 --group-size 2 --lora-rank 16 \
148
+ --output-dir artifacts/chaosops-grpo
149
+
150
+ # 5. Evaluate trained policy vs. baselines (after step 4 or after pulling
151
+ # the LoRA from helloAK96/chaosops-grpo-lora)
152
+ python -m chaosops.train.evaluate \
153
+ --policies random heuristic oracle trained \
154
+ --adapter-path artifacts/chaosops-grpo/lora_adapter \
155
+ --episodes-per-type 5 --out-dir artifacts/evaluation
156
+ ```
157
+
158
+ For one-click reproducibility on HF Jobs, see `scripts/jobs_grpo_train.sh` and
159
+ the canonical invocation:
160
 
161
+ ```bash
162
+ hf jobs run --flavor t4-small --secrets HF_TOKEN \
163
+ -v hf://spaces/helloAK96/chaosops:/data \
164
+ -e GRPO_EPISODES=400 -e GRPO_PUSH_TO_HUB=1 \
165
+ pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime \
166
+ bash /data/scripts/jobs_grpo_train.sh
167
  ```
168
 
169
+ ## Results
170
+
171
+ ### Baselines β€” Random vs. Heuristic vs. Oracle
172
+
173
+ ![Baseline reward curve](artifacts/baseline/baseline_curve.png)
174
+
175
+ *Mean cumulative episode reward across 5 seeds Γ— 9 failure types per tier
176
+ (EASY β†’ MEDIUM β†’ HARD). The clean monotone gradient
177
+ **Random < Heuristic < Oracle** confirms the reward function rewards real
178
+ problem-solving. Random's HARD score plummets to β‰ˆ βˆ’1356 because it never
179
+ flags the rogue agent and triggers cascades.*
180
+
181
+ ### Trained Qwen 1.5B (GRPO + LoRA) vs. baselines
182
+
183
+ ![Comparison curve](artifacts/evaluation/comparison_curve.png)
184
+
185
+ *Trained policy after 400 GRPO steps on Qwen 2.5-1.5B-Instruct + LoRA r=16,
186
+ benchmark conditions match the baseline plot above. **The "trained" line
187
+ sits strictly above heuristic on EASY and MEDIUM** β€” clear evidence the
188
+ LoRA learned to coordinate the agent fleet beyond what the scripted
189
+ heuristic does. See `artifacts/evaluation/evaluation_summary.txt` for the
190
+ full per-tier table.*
191
+
192
+ ### Learning curve
193
+
194
+ ![GRPO learning curve](artifacts/chaosops-grpo/learning_curve.png)
195
+
196
+ *Mean combined reward (`0.6 Γ— team + 0.4 Γ— oversight`) by training step.
197
+ Logged every 10 steps via the `ChaosOpsMetricsCallback` in
198
+ [`chaosops/train/grpo_train.py`](train/grpo_train.py).*
199
+
200
  ---
201
 
202
  ## Package layout
203
 
204
  ```
205
  chaosops/
206
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (name, action, observation)
207
+ β”œβ”€β”€ app.py # Gradio Space entry point
208
+ β”œβ”€β”€ Dockerfile # HF Space build (Python 3.11, port 7860)
209
  β”œβ”€β”€ env/
210
+ β”‚ β”œβ”€β”€ models.py # pydantic v2 typed contracts
211
+ β”‚ β”œβ”€β”€ world_sim.py # deterministic simulator + cascade physics
212
+ β”‚ β”œβ”€β”€ environment.py # OpenEnv-compatible wrapper (extends Environment)
213
+ β”‚ └── openenv_wrapper.py # FastAPI server + ChaosOpsClient
214
  β”œβ”€β”€ agents/
215
+ β”‚ β”œβ”€β”€ prompts/*.md # 4 role system prompts
216
+ β”‚ β”œβ”€β”€ llm_adapter.py # render_observation / build_prompt / parse_action
217
+ β”‚ β”œβ”€β”€ policies.py # random / heuristic / oracle scripted baselines
218
+ β”‚ β”œβ”€β”€ trained_policy.py # LoRA-backed Policy (loads from disk or HF Hub)
219
+ β”‚ └── runner.py # run_episode orchestration
220
  β”œβ”€β”€ rewards/
221
+ β”‚ └── reward_fn.py # composable rubrics (resolution/mttr/oversight/cascade)
222
  β”œβ”€β”€ curriculum/
223
+ β”‚ └── generator.py # easy β†’ medium β†’ hard + auto-promotion
224
  β”œβ”€β”€ dashboard/
225
+ β”‚ β”œβ”€β”€ terminal.py # Rich demo UI with rogue-flag visualization
226
+ β”‚ └── transcript.py # text-only transcript writer (used by Space)
227
+ β”œβ”€β”€ train/
228
+ β”‚ β”œβ”€β”€ baseline.py # scripted-policy baselines + reward curve
229
+ β”‚ β”œβ”€β”€ evaluate.py # multi-policy sweep + comparison plot
230
+ β”‚ └── grpo_train.py # TRL GRPO + LoRA (Unsloth or plain transformers)
231
+ └── scripts/
232
+ └── jobs_grpo_train.sh # one-shot HF Jobs entry point
233
  ```
234
 
235
  ---