Spaces:

helloAK96
/

chaosops

Running

helloAK96 Claude Opus 4.7 commited on 14 days ago

Commit

7a6ae0c

1 Parent(s): 5abeaf0

docs: add 2-minute submission video script

Production-grade screencap-driven explainer with shot-by-shot directions,
on-screen text, voiceover script, and recording checklist. Maps every
beat to a judging criterion:

* 0:00-0:08 hook (3 AM stakes)
* 0:08-0:25 environment (4 LLMs, 9 failures, 3 rogue-AI) [Innovation 40%]
* 0:25-0:50 training (3B + GRPO + curriculum + HF Jobs) [Pipeline 10%]
* 0:50-1:25 split-screen Random vs Trained on same seed [Reward Improv 20%]
* 1:25-1:45 reproducibility + 6 LoRAs published [Storytelling 30%]
* 1:45-2:00 CTA + end card

Includes voiceover-only script for the recording session, OBS scene order,
recording tips, and an optional 60-second social cut for Twitter/LinkedIn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show

docs/video_script.md +219 -0

docs/video_script.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# ChaosOps AI — 2-minute submission video script
+**Target length:** 110 seconds (under the 2-min rubric cap; gives 10 s buffer for intro/outro card).
+**Format:** screencap-driven explainer, voiceover narration. No talking head needed. Optional captions for accessibility/sound-off viewing.
+**Why this script:** every beat maps to a judging criterion (40% Innovation, 30% Storytelling, 20% Reward Improvement, 10% Pipeline). Visuals are *concrete*, not stock — every screen the viewer sees is something they can click on themselves at the end.
+---
+## Recording checklist (before you press record)
+| Window / Tab | Pre-loaded URL / state |
+|---|---|
+| **A** — live Space, scenario picker | <https://huggingface.co/spaces/helloAK96/chaosops>, dropdowns set to `autoscaler_cost_cut` / `hard` / `random`, seed 42, **not yet clicked** |
+| **B** — same Space, second tab | Same dropdowns but policy = `trained`, seed 42, **not yet clicked** |
+| **C** — README on Space | <https://huggingface.co/spaces/helloAK96/chaosops/blob/main/README.md>, scrolled to the comparison-curve image |
+| **D** — LoRA model card | <https://huggingface.co/helloAK96/chaosops-grpo-lora-p3a> |
+| **E** — Local Rich terminal | `python -m chaosops.dashboard.terminal --scenario autoscaler_cost_cut --policy oracle --difficulty hard --frame-delay 0.4` ready to launch |
+**OBS scene order:** A → E → C → D → B (we end on the live trained-policy run as the closer).
+---
+## The script — shot by shot
+### 🎬 0:00 – 0:08 — HOOK (8 s)
+**Visual:** Black title card, fade up. Bold white text appears word by word.
+**On-screen text:**
+> *3 AM. Payments are red. Your AI assistant is about to suggest the wrong fix.*
+**Voiceover:**
+> "It's 3 AM. Payments are down. Your AI assistant is about to suggest the wrong fix — because the AI that *caused* this incident is in the same fleet."
+**Why this works:** stakes in the first sentence. Production-flavored. Doesn't say "incident-response gym" — says *what it feels like*.
+---
+### 🎬 0:08 – 0:25 — THE PROBLEM (17 s)
+**Visual:** Live Rich terminal dashboard (window E), already running on `autoscaler_cost_cut` HARD with the **oracle** policy at frame-delay 0.4 s. Viewer sees the alert panel light up red, then the SRE → Oversight → Dev sequence happen on screen.
+**On-screen captions (overlaid, lower-third):**
+> 4 LLM agents · 9 failure types · 3 of them caused by other AIs in the fleet
+> SRE · Developer · Manager · **Oversight**
+**Voiceover:**
+> "ChaosOps AI is a reinforcement-learning environment with four LLM agents — SRE, Developer, Manager, Oversight — handling production incidents together. Nine failure types. Three of them are caused not by infrastructure, but by *other AI agents* — autoscalers, deployers, load-balancers. The Oversight agent has to catch them."
+**Cut on:** the moment Oversight flags `autoscaler` and the Dev runs `scale(payments, 4)`.
+---
+### 🎬 0:25 – 0:50 — THE TRAINING (25 s)
+**Visual:** Cut to **window C** (README on Space). Scroll slowly through the comparison_curve.png. After ~3 s, scroll past it to the learning_curve.png. The plots speak louder than narration here.
+**On-screen captions:**
+> Qwen 2.5-3B + LoRA + GRPO
+> 600 steps · 3-tier curriculum · trained on HF Jobs ($1.93)
+> 3,200 training episodes · 6 experiments · $9.80 total
+**Voiceover:**
+> "We trained Qwen 2.5-3B with GRPO and LoRA — six different experiments, three thousand two hundred training episodes, all on Hugging Face Jobs because we needed to launch three configurations in parallel and Colab can't do that. Total compute: nine dollars eighty. The harder the curriculum tier, the *higher* the model's mean reward — because each tier of training stacked competencies the model couldn't have learned cold."
+**Why this works:** drops three concrete numbers (3B, 600, $9.80) instead of abstractions. Names a competitive choice (HF Jobs vs Colab) without bashing.
+---
+### 🎬 0:50 – 1:25 — THE PROOF (35 s — most important section)
+**Visual sequence — split-screen if your editor supports it, otherwise A/B cuts:**
+**0:50 – 1:00 (LEFT side / first cut):** Window A — live Space — click "Run episode" with policy = `random`, seed 42. Watch the random agent thrash. Episode ends with `resolved: false`, `cum_reward: -880`.
+**On-screen:** big lower-third caption fades in:
+> **RANDOM** · resolved=False · 20 wrong fixes · cum_reward: −880
+**1:00 – 1:15 (RIGHT side / second cut):** Window B — live Space — click "Run episode" with policy = `trained`, seed 42. Watch Oversight flag `autoscaler` on turn 1, Dev scale on turn 2, episode resolved by turn 3.
+**On-screen:**
+> **TRAINED** · resolved=True · 3 turns · oversight_flags=['autoscaler'] · cum_reward: +184
+**1:15 – 1:25:** Cut back to window C, comparison-curve visible, cursor highlighting the "trained" line on HARD tier.
+**Voiceover (one continuous take across all three cuts):**
+> "Same scenario. Same seed. Random thrashes for twenty turns and never resolves. The trained model — in three turns — flags the autoscaler, scales the payments service instead of restarting it, resolves cleanly. Across 540 evaluation episodes: one hundred percent rogue-catch on the medium tier, ninety-three percent on hard. Mean time to resolution matches the Oracle. The trained line beats random by 924 reward points on the hardest tier."
+**Why this works:** *show, don't tell* — the policy comparison is happening on screen, not in a slide. Viewer sees the agent succeed, doesn't have to take it on faith.
+---
+### 🎬 1:25 – 1:45 — WHY IT MATTERS (20 s)
+**Visual:** Cut to window D — LoRA model card on the Hub. Slowly scroll past the metrics table, the per-tier results, the training-recipe table. End on the bottom of the page where all 6 LoRAs are listed (including the failed runs).
+**On-screen captions:**
+> All 6 LoRAs public · including the runs that failed
+> Reproducible from `hf jobs run` in one shell line
+**Voiceover:**
+> "Tomorrow's SRE isn't replacing humans — it's watching the *other* agents. We published every LoRA from this experiment, including the runs that failed, because the only honest way to claim a 3B model learned something is to let other people verify. The whole pipeline is one Hugging Face Jobs command."
+**Why this works:** vulnerability — "the runs that failed" — earns trust. Re-states the thesis ("watching the other agents") so a viewer who skipped the middle still gets the punchline.
+---
+### 🎬 1:45 – 1:55 — THE CALL TO ACTION (10 s)
+**Visual:** Black card, white text, **single URL** in big monospace.
+**On-screen text (large, centered):**
+```
+huggingface.co/spaces/helloAK96/chaosops
+```
+Below in smaller text:
+> Click. Pick a scenario. Watch the trained agent catch the rogue.
+**Voiceover:**
+> "Try it. The trained agent is one click away."
+**Hold on the URL for 5 full seconds — give viewers time to type/screenshot.**
+---
+### 🎬 1:55 – 2:00 — END CARD (5 s)
+**Visual:** End card with logos.
+**On-screen text:**
+> Built for the HuggingFace + OpenEnv + TRL hackathon
+> #MultiAgent #ScalableOversight #GRPO #HFJobs
+No voiceover. Just the card.
+---
+## Voiceover-only script (paste-able)
+```
+It's 3 AM. Payments are down. Your AI assistant is about to suggest the
+wrong fix — because the AI that caused this incident is in the same fleet.
+ChaosOps AI is a reinforcement-learning environment with four LLM agents —
+SRE, Developer, Manager, Oversight — handling production incidents
+together. Nine failure types. Three of them are caused not by
+infrastructure, but by other AI agents — autoscalers, deployers,
+load-balancers. The Oversight agent has to catch them.
+We trained Qwen 2.5-3B with GRPO and LoRA — six different experiments,
+three thousand two hundred training episodes, all on Hugging Face Jobs
+because we needed to launch three configurations in parallel and Colab
+can't do that. Total compute: nine dollars eighty. The harder the
+curriculum tier, the higher the model's mean reward — because each tier
+of training stacked competencies the model couldn't have learned cold.
+Same scenario. Same seed. Random thrashes for twenty turns and never
+resolves. The trained model — in three turns — flags the autoscaler,
+scales the payments service instead of restarting it, resolves cleanly.
+Across 540 evaluation episodes: one hundred percent rogue-catch on the
+medium tier, ninety-three percent on hard. Mean time to resolution
+matches the Oracle. The trained line beats random by 924 reward points
+on the hardest tier.
+Tomorrow's SRE isn't replacing humans — it's watching the other agents.
+We published every LoRA from this experiment, including the runs that
+failed, because the only honest way to claim a 3B model learned
+something is to let other people verify. The whole pipeline is one
+Hugging Face Jobs command.
+Try it. The trained agent is one click away.
+```
+**Word count: ~285 words.** At a conversational 165 wpm that lands at ~104 seconds — leaves 6 s of breathing room across the cuts and lets the URL sit on screen at the end.
+---
+## Mapping back to the rubric
+| Criterion | Weight | Where the video earns it |
+|---|---|---|
+| Environment Innovation | 40% | 0:08–0:25 (4 LLM agents + 3 rogue-AI failure types — visualised live in the Rich dashboard) |
+| Storytelling & Presentation | 30% | The whole arc: 3 AM hook → problem → training → split-screen proof → reproducibility → CTA. Same-seed Random vs Trained side-by-side is the most legible "X learned Y" anyone can show. |
+| Showing Improvement in Rewards | 20% | 0:50–1:25 — split-screen Random (−880, never resolves) vs Trained (+184, 3 turns) on the same scenario; comparison curve scroll afterwards. |
+| Reward & Training Pipeline | 10% | 0:25–0:50 plus 1:25–1:45 — composable rubrics, HF Jobs, 6 LoRAs published including failures. |
+---
+## Recording tips
+1. **Voiceover first, screen capture second.** Record the audio in one or two takes (not chunked) so the cadence is natural. Then shoot the screen recordings to fit the audio length, not the other way around.
+2. **OBS scene transitions:** use 200–300 ms cross-fades, never hard cuts during voiceover. Hard cuts are fine *during pauses*.
+3. **Don't speed up the trained-policy run.** The dramatic part is that it resolves in 3 turns at normal speed. Speeding it up makes it look fake.
+4. **Use the `--frame-delay 0.4` flag on the Rich dashboard.** Faster than the default and lets you fit a full episode into the time you have.
+5. **Subtitle every spoken sentence.** Recommended tool: <https://www.descript.com/>. Hackathon judges often skim videos with sound off.
+6. **Export at 1080p 30fps, MP4, < 50 MB.** Hosts: YouTube unlisted (recommended) or Loom. **Do not** upload the MP4 to the HF Space — the rubric explicitly disallows large video files in the env submission.
+7. **Add the YouTube/Loom URL to the README** as soon as the video is up — that's the link validation looks for.
+---
+## Optional 60-second social cut (Twitter / LinkedIn / Bluesky)
+```
+Trained a 3B language model to catch other AI agents breaking production.
+Six GRPO experiments. 3,200 training episodes. $9.80 of compute.
+All on Hugging Face Jobs — because Colab can't run three GPUs in parallel.
+Same scenario, two policies:
+* Random Qwen-3B  → 20 turns, no fix, cum_reward = -880
+* Trained Qwen-3B → 3 turns, flags the autoscaler, +184
+100% rogue-catch on MEDIUM, 93% on HARD, MTTR matches Oracle.
+Live Space: huggingface.co/spaces/helloAK96/chaosops
+Trained LoRA: huggingface.co/helloAK96/chaosops-grpo-lora-p3a
+#GRPO #HFJobs #OpenEnv #ScalableOversight
+```
+Pair with the comparison_curve.png as the embed image.