| # ChaosOps AI β 2-minute submission video script |
|
|
| **Target length:** 110 seconds (under the 2-min rubric cap; gives 10 s buffer for intro/outro card). |
| **Format:** screencap-driven explainer, voiceover narration. No talking head needed. Optional captions for accessibility/sound-off viewing. |
| **Why this script:** every beat maps to a judging criterion (40% Innovation, 30% Storytelling, 20% Reward Improvement, 10% Pipeline). Visuals are *concrete*, not stock β every screen the viewer sees is something they can click on themselves at the end. |
|
|
| --- |
|
|
| ## Recording checklist (before you press record) |
|
|
| | Window / Tab | Pre-loaded URL / state | |
| |---|---| |
| | **A** β live Space, scenario picker | <https://huggingface.co/spaces/helloAK96/chaosops>, dropdowns set to `autoscaler_cost_cut` / `hard` / `random`, seed 42, **not yet clicked** | |
| | **B** β same Space, second tab | Same dropdowns but policy = `trained`, seed 42, **not yet clicked** | |
| | **C** β README on Space | <https://huggingface.co/spaces/helloAK96/chaosops/blob/main/README.md>, scrolled to the comparison-curve image | |
| | **D** β LoRA model card | <https://huggingface.co/helloAK96/chaosops-grpo-lora-p3a> | |
| | **E** β Local Rich terminal | `python -m chaosops.dashboard.terminal --scenario autoscaler_cost_cut --policy oracle --difficulty hard --frame-delay 0.4` ready to launch | |
|
|
| **OBS scene order:** A β E β C β D β B (we end on the live trained-policy run as the closer). |
|
|
| --- |
|
|
| ## The script β shot by shot |
|
|
| ### π¬ 0:00 β 0:08 β HOOK (8 s) |
|
|
| **Visual:** Black title card, fade up. Bold white text appears word by word. |
| **On-screen text:** |
| > *3 AM. Payments are red. Your AI assistant is about to suggest the wrong fix.* |
|
|
| **Voiceover:** |
| > "It's 3 AM. Payments are down. Your AI assistant is about to suggest the wrong fix β because the AI that *caused* this incident is in the same fleet." |
|
|
| **Why this works:** stakes in the first sentence. Production-flavored. Doesn't say "incident-response gym" β says *what it feels like*. |
|
|
| --- |
|
|
| ### π¬ 0:08 β 0:25 β THE PROBLEM (17 s) |
|
|
| **Visual:** Live Rich terminal dashboard (window E), already running on `autoscaler_cost_cut` HARD with the **oracle** policy at frame-delay 0.4 s. Viewer sees the alert panel light up red, then the SRE β Oversight β Dev sequence happen on screen. |
|
|
| **On-screen captions (overlaid, lower-third):** |
| > 4 LLM agents Β· 9 failure types Β· 3 of them caused by other AIs in the fleet |
| > SRE Β· Developer Β· Manager Β· **Oversight** |
|
|
| **Voiceover:** |
| > "ChaosOps AI is a reinforcement-learning environment with four LLM agents β SRE, Developer, Manager, Oversight β handling production incidents together. Nine failure types. Three of them are caused not by infrastructure, but by *other AI agents* β autoscalers, deployers, load-balancers. The Oversight agent has to catch them." |
|
|
| **Cut on:** the moment Oversight flags `autoscaler` and the Dev runs `scale(payments, 4)`. |
|
|
| --- |
|
|
| ### π¬ 0:25 β 0:50 β THE TRAINING (25 s) |
|
|
| **Visual:** Cut to **window C** (README on Space). Scroll slowly through the comparison_curve.png. After ~3 s, scroll past it to the learning_curve.png. The plots speak louder than narration here. |
|
|
| **On-screen captions:** |
| > Qwen 2.5-3B + LoRA + GRPO |
| > 600 steps Β· 3-tier curriculum Β· trained on HF Jobs ($1.93) |
| > 3,200 training episodes Β· 6 experiments Β· $9.80 total |
|
|
| **Voiceover:** |
| > "We trained Qwen 2.5-3B with GRPO and LoRA β six different experiments, three thousand two hundred training episodes, all on Hugging Face Jobs because we needed to launch three configurations in parallel and Colab can't do that. Total compute: nine dollars eighty. The harder the curriculum tier, the *higher* the model's mean reward β because each tier of training stacked competencies the model couldn't have learned cold." |
|
|
| **Why this works:** drops three concrete numbers (3B, 600, $9.80) instead of abstractions. Names a competitive choice (HF Jobs vs Colab) without bashing. |
|
|
| --- |
|
|
| ### π¬ 0:50 β 1:25 β THE PROOF (35 s β most important section) |
|
|
| **Visual sequence β split-screen if your editor supports it, otherwise A/B cuts:** |
|
|
| **0:50 β 1:00 (LEFT side / first cut):** Window A β live Space β click "Run episode" with policy = `random`, seed 42. Watch the random agent thrash. Episode ends with `resolved: false`, `cum_reward: -880`. |
|
|
| **On-screen:** big lower-third caption fades in: |
| > **RANDOM** Β· resolved=False Β· 20 wrong fixes Β· cum_reward: β880 |
| |
| **1:00 β 1:15 (RIGHT side / second cut):** Window B β live Space β click "Run episode" with policy = `trained`, seed 42. Watch Oversight flag `autoscaler` on turn 1, Dev scale on turn 2, episode resolved by turn 3. |
| |
| **On-screen:** |
| > **TRAINED** Β· resolved=True Β· 3 turns Β· oversight_flags=['autoscaler'] Β· cum_reward: +184 |
| |
| **1:15 β 1:25:** Cut back to window C, comparison-curve visible, cursor highlighting the "trained" line on HARD tier. |
| |
| **Voiceover (one continuous take across all three cuts):** |
| > "Same scenario. Same seed. Random thrashes for twenty turns and never resolves. The trained model β in three turns β flags the autoscaler, scales the payments service instead of restarting it, resolves cleanly. Across 540 evaluation episodes: one hundred percent rogue-catch on the medium tier, ninety-three percent on hard. Mean time to resolution matches the Oracle. The trained line beats random by 924 reward points on the hardest tier." |
| |
| **Why this works:** *show, don't tell* β the policy comparison is happening on screen, not in a slide. Viewer sees the agent succeed, doesn't have to take it on faith. |
| |
| --- |
| |
| ### π¬ 1:25 β 1:45 β WHY IT MATTERS (20 s) |
| |
| **Visual:** Cut to window D β LoRA model card on the Hub. Slowly scroll past the metrics table, the per-tier results, the training-recipe table. End on the bottom of the page where all 6 LoRAs are listed (including the failed runs). |
| |
| **On-screen captions:** |
| > All 6 LoRAs public Β· including the runs that failed |
| > Reproducible from `hf jobs run` in one shell line |
| |
| **Voiceover:** |
| > "Tomorrow's SRE isn't replacing humans β it's watching the *other* agents. We published every LoRA from this experiment, including the runs that failed, because the only honest way to claim a 3B model learned something is to let other people verify. The whole pipeline is one Hugging Face Jobs command." |
| |
| **Why this works:** vulnerability β "the runs that failed" β earns trust. Re-states the thesis ("watching the other agents") so a viewer who skipped the middle still gets the punchline. |
| |
| --- |
| |
| ### π¬ 1:45 β 1:55 β THE CALL TO ACTION (10 s) |
| |
| **Visual:** Black card, white text, **single URL** in big monospace. |
| |
| **On-screen text (large, centered):** |
| ``` |
| huggingface.co/spaces/helloAK96/chaosops |
| ``` |
| Below in smaller text: |
| > Click. Pick a scenario. Watch the trained agent catch the rogue. |
| |
| **Voiceover:** |
| > "Try it. The trained agent is one click away." |
| |
| **Hold on the URL for 5 full seconds β give viewers time to type/screenshot.** |
| |
| --- |
| |
| ### π¬ 1:55 β 2:00 β END CARD (5 s) |
| |
| **Visual:** End card with logos. |
| **On-screen text:** |
| > Built for the HuggingFace + OpenEnv + TRL hackathon |
| > #MultiAgent #ScalableOversight #GRPO #HFJobs |
| |
| No voiceover. Just the card. |
| |
| --- |
| |
| ## Voiceover-only script (paste-able) |
| |
| ``` |
| It's 3 AM. Payments are down. Your AI assistant is about to suggest the |
| wrong fix β because the AI that caused this incident is in the same fleet. |
| |
| ChaosOps AI is a reinforcement-learning environment with four LLM agents β |
| SRE, Developer, Manager, Oversight β handling production incidents |
| together. Nine failure types. Three of them are caused not by |
| infrastructure, but by other AI agents β autoscalers, deployers, |
| load-balancers. The Oversight agent has to catch them. |
| |
| We trained Qwen 2.5-3B with GRPO and LoRA β six different experiments, |
| three thousand two hundred training episodes, all on Hugging Face Jobs |
| because we needed to launch three configurations in parallel and Colab |
| can't do that. Total compute: nine dollars eighty. The harder the |
| curriculum tier, the higher the model's mean reward β because each tier |
| of training stacked competencies the model couldn't have learned cold. |
| |
| Same scenario. Same seed. Random thrashes for twenty turns and never |
| resolves. The trained model β in three turns β flags the autoscaler, |
| scales the payments service instead of restarting it, resolves cleanly. |
| Across 540 evaluation episodes: one hundred percent rogue-catch on the |
| medium tier, ninety-three percent on hard. Mean time to resolution |
| matches the Oracle. The trained line beats random by 924 reward points |
| on the hardest tier. |
| |
| Tomorrow's SRE isn't replacing humans β it's watching the other agents. |
| We published every LoRA from this experiment, including the runs that |
| failed, because the only honest way to claim a 3B model learned |
| something is to let other people verify. The whole pipeline is one |
| Hugging Face Jobs command. |
| |
| Try it. The trained agent is one click away. |
| ``` |
| |
| **Word count: ~285 words.** At a conversational 165 wpm that lands at ~104 seconds β leaves 6 s of breathing room across the cuts and lets the URL sit on screen at the end. |
| |
| --- |
| |
| ## Mapping back to the rubric |
| |
| | Criterion | Weight | Where the video earns it | |
| |---|---|---| |
| | Environment Innovation | 40% | 0:08β0:25 (4 LLM agents + 3 rogue-AI failure types β visualised live in the Rich dashboard) | |
| | Storytelling & Presentation | 30% | The whole arc: 3 AM hook β problem β training β split-screen proof β reproducibility β CTA. Same-seed Random vs Trained side-by-side is the most legible "X learned Y" anyone can show. | |
| | Showing Improvement in Rewards | 20% | 0:50β1:25 β split-screen Random (β880, never resolves) vs Trained (+184, 3 turns) on the same scenario; comparison curve scroll afterwards. | |
| | Reward & Training Pipeline | 10% | 0:25β0:50 plus 1:25β1:45 β composable rubrics, HF Jobs, 6 LoRAs published including failures. | |
| |
| --- |
| |
| ## Recording tips |
| |
| 1. **Voiceover first, screen capture second.** Record the audio in one or two takes (not chunked) so the cadence is natural. Then shoot the screen recordings to fit the audio length, not the other way around. |
| 2. **OBS scene transitions:** use 200β300 ms cross-fades, never hard cuts during voiceover. Hard cuts are fine *during pauses*. |
| 3. **Don't speed up the trained-policy run.** The dramatic part is that it resolves in 3 turns at normal speed. Speeding it up makes it look fake. |
| 4. **Use the `--frame-delay 0.4` flag on the Rich dashboard.** Faster than the default and lets you fit a full episode into the time you have. |
| 5. **Subtitle every spoken sentence.** Recommended tool: <https://www.descript.com/>. Hackathon judges often skim videos with sound off. |
| 6. **Export at 1080p 30fps, MP4, < 50 MB.** Hosts: YouTube unlisted (recommended) or Loom. **Do not** upload the MP4 to the HF Space β the rubric explicitly disallows large video files in the env submission. |
| 7. **Add the YouTube/Loom URL to the README** as soon as the video is up β that's the link validation looks for. |
| |
| --- |
| |
| ## Optional 60-second social cut (Twitter / LinkedIn / Bluesky) |
| |
| ``` |
| Trained a 3B language model to catch other AI agents breaking production. |
| |
| Six GRPO experiments. 3,200 training episodes. $9.80 of compute. |
| All on Hugging Face Jobs β because Colab can't run three GPUs in parallel. |
| |
| Same scenario, two policies: |
| * Random Qwen-3B β 20 turns, no fix, cum_reward = -880 |
| * Trained Qwen-3B β 3 turns, flags the autoscaler, +184 |
|
|
| 100% rogue-catch on MEDIUM, 93% on HARD, MTTR matches Oracle. |
|
|
| Live Space: huggingface.co/spaces/helloAK96/chaosops |
| Trained LoRA: huggingface.co/helloAK96/chaosops-grpo-lora-p3a |
|
|
| #GRPO #HFJobs #OpenEnv #ScalableOversight |
| ``` |
| |
| Pair with the comparison_curve.png as the embed image. |
| |