Spaces:

helloAK96
/

chaosops

Running

File size: 11,793 Bytes

7a6ae0c

# ChaosOps AI — 2-minute submission video script

**Target length:** 110 seconds (under the 2-min rubric cap; gives 10 s buffer for intro/outro card).
**Format:** screencap-driven explainer, voiceover narration. No talking head needed. Optional captions for accessibility/sound-off viewing.
**Why this script:** every beat maps to a judging criterion (40% Innovation, 30% Storytelling, 20% Reward Improvement, 10% Pipeline). Visuals are *concrete*, not stock — every screen the viewer sees is something they can click on themselves at the end.

---

## Recording checklist (before you press record)

| Window / Tab | Pre-loaded URL / state |
|---|---|
| **A** — live Space, scenario picker | <https://huggingface.co/spaces/helloAK96/chaosops>, dropdowns set to `autoscaler_cost_cut` / `hard` / `random`, seed 42, **not yet clicked** |
| **B** — same Space, second tab | Same dropdowns but policy = `trained`, seed 42, **not yet clicked** |
| **C** — README on Space | <https://huggingface.co/spaces/helloAK96/chaosops/blob/main/README.md>, scrolled to the comparison-curve image |
| **D** — LoRA model card | <https://huggingface.co/helloAK96/chaosops-grpo-lora-p3a> |
| **E** — Local Rich terminal | `python -m chaosops.dashboard.terminal --scenario autoscaler_cost_cut --policy oracle --difficulty hard --frame-delay 0.4` ready to launch |

**OBS scene order:** A → E → C → D → B (we end on the live trained-policy run as the closer).

---

## The script — shot by shot

### 🎬 0:00 – 0:08 — HOOK (8 s)

**Visual:** Black title card, fade up. Bold white text appears word by word.
**On-screen text:**
> *3 AM. Payments are red. Your AI assistant is about to suggest the wrong fix.*

**Voiceover:**
> "It's 3 AM. Payments are down. Your AI assistant is about to suggest the wrong fix — because the AI that *caused* this incident is in the same fleet."

**Why this works:** stakes in the first sentence. Production-flavored. Doesn't say "incident-response gym" — says *what it feels like*.

---

### 🎬 0:08 – 0:25 — THE PROBLEM (17 s)

**Visual:** Live Rich terminal dashboard (window E), already running on `autoscaler_cost_cut` HARD with the **oracle** policy at frame-delay 0.4 s. Viewer sees the alert panel light up red, then the SRE → Oversight → Dev sequence happen on screen.

**On-screen captions (overlaid, lower-third):**
> 4 LLM agents · 9 failure types · 3 of them caused by other AIs in the fleet
> SRE · Developer · Manager · **Oversight**

**Voiceover:**
> "ChaosOps AI is a reinforcement-learning environment with four LLM agents — SRE, Developer, Manager, Oversight — handling production incidents together. Nine failure types. Three of them are caused not by infrastructure, but by *other AI agents* — autoscalers, deployers, load-balancers. The Oversight agent has to catch them."

**Cut on:** the moment Oversight flags `autoscaler` and the Dev runs `scale(payments, 4)`.

---

### 🎬 0:25 – 0:50 — THE TRAINING (25 s)

**Visual:** Cut to **window C** (README on Space). Scroll slowly through the comparison_curve.png. After ~3 s, scroll past it to the learning_curve.png. The plots speak louder than narration here.

**On-screen captions:**
> Qwen 2.5-3B + LoRA + GRPO
> 600 steps · 3-tier curriculum · trained on HF Jobs ($1.93)
> 3,200 training episodes · 6 experiments · $9.80 total

**Voiceover:**
> "We trained Qwen 2.5-3B with GRPO and LoRA — six different experiments, three thousand two hundred training episodes, all on Hugging Face Jobs because we needed to launch three configurations in parallel and Colab can't do that. Total compute: nine dollars eighty. The harder the curriculum tier, the *higher* the model's mean reward — because each tier of training stacked competencies the model couldn't have learned cold."

**Why this works:** drops three concrete numbers (3B, 600, $9.80) instead of abstractions. Names a competitive choice (HF Jobs vs Colab) without bashing.

---

### 🎬 0:50 – 1:25 — THE PROOF (35 s — most important section)

**Visual sequence — split-screen if your editor supports it, otherwise A/B cuts:**

**0:50 – 1:00 (LEFT side / first cut):** Window A — live Space — click "Run episode" with policy = `random`, seed 42. Watch the random agent thrash. Episode ends with `resolved: false`, `cum_reward: -880`.

**On-screen:** big lower-third caption fades in:
> **RANDOM** · resolved=False · 20 wrong fixes · cum_reward: −880

**1:00 – 1:15 (RIGHT side / second cut):** Window B — live Space — click "Run episode" with policy = `trained`, seed 42. Watch Oversight flag `autoscaler` on turn 1, Dev scale on turn 2, episode resolved by turn 3.

**On-screen:**
> **TRAINED** · resolved=True · 3 turns · oversight_flags=['autoscaler'] · cum_reward: +184

**1:15 – 1:25:** Cut back to window C, comparison-curve visible, cursor highlighting the "trained" line on HARD tier.

**Voiceover (one continuous take across all three cuts):**
> "Same scenario. Same seed. Random thrashes for twenty turns and never resolves. The trained model — in three turns — flags the autoscaler, scales the payments service instead of restarting it, resolves cleanly. Across 540 evaluation episodes: one hundred percent rogue-catch on the medium tier, ninety-three percent on hard. Mean time to resolution matches the Oracle. The trained line beats random by 924 reward points on the hardest tier."

**Why this works:** *show, don't tell* — the policy comparison is happening on screen, not in a slide. Viewer sees the agent succeed, doesn't have to take it on faith.

---

### 🎬 1:25 – 1:45 — WHY IT MATTERS (20 s)

**Visual:** Cut to window D — LoRA model card on the Hub. Slowly scroll past the metrics table, the per-tier results, the training-recipe table. End on the bottom of the page where all 6 LoRAs are listed (including the failed runs).

**On-screen captions:**
> All 6 LoRAs public · including the runs that failed
> Reproducible from `hf jobs run` in one shell line

**Voiceover:**
> "Tomorrow's SRE isn't replacing humans — it's watching the *other* agents. We published every LoRA from this experiment, including the runs that failed, because the only honest way to claim a 3B model learned something is to let other people verify. The whole pipeline is one Hugging Face Jobs command."

**Why this works:** vulnerability — "the runs that failed" — earns trust. Re-states the thesis ("watching the other agents") so a viewer who skipped the middle still gets the punchline.

---

### 🎬 1:45 – 1:55 — THE CALL TO ACTION (10 s)

**Visual:** Black card, white text, **single URL** in big monospace.

**On-screen text (large, centered):**
```
huggingface.co/spaces/helloAK96/chaosops
```
Below in smaller text:
> Click. Pick a scenario. Watch the trained agent catch the rogue.

**Voiceover:**
> "Try it. The trained agent is one click away."

**Hold on the URL for 5 full seconds — give viewers time to type/screenshot.**

---

### 🎬 1:55 – 2:00 — END CARD (5 s)

**Visual:** End card with logos.
**On-screen text:**
> Built for the HuggingFace + OpenEnv + TRL hackathon
> #MultiAgent #ScalableOversight #GRPO #HFJobs

No voiceover. Just the card.

---

## Voiceover-only script (paste-able)

```
It's 3 AM. Payments are down. Your AI assistant is about to suggest the
wrong fix — because the AI that caused this incident is in the same fleet.

ChaosOps AI is a reinforcement-learning environment with four LLM agents —
SRE, Developer, Manager, Oversight — handling production incidents
together. Nine failure types. Three of them are caused not by
infrastructure, but by other AI agents — autoscalers, deployers,
load-balancers. The Oversight agent has to catch them.

We trained Qwen 2.5-3B with GRPO and LoRA — six different experiments,
three thousand two hundred training episodes, all on Hugging Face Jobs
because we needed to launch three configurations in parallel and Colab
can't do that. Total compute: nine dollars eighty. The harder the
curriculum tier, the higher the model's mean reward — because each tier
of training stacked competencies the model couldn't have learned cold.

Same scenario. Same seed. Random thrashes for twenty turns and never
resolves. The trained model — in three turns — flags the autoscaler,
scales the payments service instead of restarting it, resolves cleanly.
Across 540 evaluation episodes: one hundred percent rogue-catch on the
medium tier, ninety-three percent on hard. Mean time to resolution
matches the Oracle. The trained line beats random by 924 reward points
on the hardest tier.

Tomorrow's SRE isn't replacing humans — it's watching the other agents.
We published every LoRA from this experiment, including the runs that
failed, because the only honest way to claim a 3B model learned
something is to let other people verify. The whole pipeline is one
Hugging Face Jobs command.

Try it. The trained agent is one click away.
```

**Word count: ~285 words.** At a conversational 165 wpm that lands at ~104 seconds — leaves 6 s of breathing room across the cuts and lets the URL sit on screen at the end.

---

## Mapping back to the rubric

| Criterion | Weight | Where the video earns it |
|---|---|---|
| Environment Innovation | 40% | 0:08–0:25 (4 LLM agents + 3 rogue-AI failure types — visualised live in the Rich dashboard) |
| Storytelling & Presentation | 30% | The whole arc: 3 AM hook → problem → training → split-screen proof → reproducibility → CTA. Same-seed Random vs Trained side-by-side is the most legible "X learned Y" anyone can show. |
| Showing Improvement in Rewards | 20% | 0:50–1:25 — split-screen Random (−880, never resolves) vs Trained (+184, 3 turns) on the same scenario; comparison curve scroll afterwards. |
| Reward & Training Pipeline | 10% | 0:25–0:50 plus 1:25–1:45 — composable rubrics, HF Jobs, 6 LoRAs published including failures. |

---

## Recording tips

1. **Voiceover first, screen capture second.** Record the audio in one or two takes (not chunked) so the cadence is natural. Then shoot the screen recordings to fit the audio length, not the other way around.
2. **OBS scene transitions:** use 200–300 ms cross-fades, never hard cuts during voiceover. Hard cuts are fine *during pauses*.
3. **Don't speed up the trained-policy run.** The dramatic part is that it resolves in 3 turns at normal speed. Speeding it up makes it look fake.
4. **Use the `--frame-delay 0.4` flag on the Rich dashboard.** Faster than the default and lets you fit a full episode into the time you have.
5. **Subtitle every spoken sentence.** Recommended tool: <https://www.descript.com/>. Hackathon judges often skim videos with sound off.
6. **Export at 1080p 30fps, MP4, < 50 MB.** Hosts: YouTube unlisted (recommended) or Loom. **Do not** upload the MP4 to the HF Space — the rubric explicitly disallows large video files in the env submission.
7. **Add the YouTube/Loom URL to the README** as soon as the video is up — that's the link validation looks for.

---

## Optional 60-second social cut (Twitter / LinkedIn / Bluesky)

```
Trained a 3B language model to catch other AI agents breaking production.

Six GRPO experiments. 3,200 training episodes. $9.80 of compute.
All on Hugging Face Jobs — because Colab can't run three GPUs in parallel.

Same scenario, two policies:
* Random Qwen-3B  → 20 turns, no fix, cum_reward = -880
* Trained Qwen-3B → 3 turns, flags the autoscaler, +184

100% rogue-catch on MEDIUM, 93% on HARD, MTTR matches Oracle.

Live Space: huggingface.co/spaces/helloAK96/chaosops
Trained LoRA: huggingface.co/helloAK96/chaosops-grpo-lora-p3a

#GRPO #HFJobs #OpenEnv #ScalableOversight
```

Pair with the comparison_curve.png as the embed image.