Notebook: add Phase 8b — training history (3,200 episodes, 6-run ablation table)
Browse files- notebooks/colab_train.ipynb +39 -0
notebooks/colab_train.ipynb
CHANGED
|
@@ -364,6 +364,45 @@
|
|
| 364 |
"Image('/content/artifacts/evaluation/comparison_curve.png')\n"
|
| 365 |
]
|
| 366 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 367 |
{
|
| 368 |
"cell_type": "markdown",
|
| 369 |
"metadata": {},
|
|
|
|
| 364 |
"Image('/content/artifacts/evaluation/comparison_curve.png')\n"
|
| 365 |
]
|
| 366 |
},
|
| 367 |
+
{
|
| 368 |
+
"cell_type": "markdown",
|
| 369 |
+
"metadata": {},
|
| 370 |
+
"source": [
|
| 371 |
+
"## Phase 8b \u2014 How we got here: training history (3,200 episodes, 6 GRPO runs)\n",
|
| 372 |
+
"\n",
|
| 373 |
+
"The Phase 3A LoRA you just evaluated isn't from one happy run \u2014 it's\n",
|
| 374 |
+
"the winner of a 6-run ablation sweep on HF Jobs. Each phase tested a\n",
|
| 375 |
+
"specific hypothesis about what was bottlenecking the previous run.\n",
|
| 376 |
+
"Total simulated: **3,200 GRPO training episodes + 3,240 evaluation\n",
|
| 377 |
+
"episodes**.\n",
|
| 378 |
+
"\n",
|
| 379 |
+
"| Phase | Hypothesis | Knobs vs prev | Eval mean (E/M/H) | Verdict |\n",
|
| 380 |
+
"|---|---|---|---|---|\n",
|
| 381 |
+
"| **0** | TRL defaults are fine on 1.5B | Qwen-1.5B, LR 5e-6, EASY-only, 400 steps, r=16 | \u2212251 / \u2212315 / \u2212826 | flat reward, KL 0.14 \u2014 undertrained |\n",
|
| 382 |
+
"| **1** | gradient too small | LR 5e-6 \u2192 **2e-5** (4\u00d7) | \u2212218 / \u2212283 / \u2212820 | KL 4\u00d7 higher; rogue catch lost (20% \u2192 0%) |\n",
|
| 383 |
+
"| **2** | model never sees harder scenarios | + **easy:200,medium:200,hard:200 curriculum**, r=32, 600 steps | \u2212221 / \u2212296 / \u2212834 | resolution 10/40/0%; HARD-tier *training* reward (\u22124.4) > EASY (\u22126.1); **1.5B capacity-bound** |\n",
|
| 384 |
+
"| **3A \ud83c\udfc6** | bigger model + amplified oversight signal | **Qwen-1.5B \u2192 Qwen-3B**, **rogue-rubric \u00d72** (catch +100, FP \u2212150), temperature 0.8 | **+49 / \u221217 / \u2212433** | 85/80/40% solve, **100% rogue+ on MED, 93% on HARD**, MTTR matches Oracle. \ud83c\udfc6 SUBMITTED |\n",
|
| 385 |
+
"| 3B (control) | reward shape alone enough on 1.5B | Phase 2 + rogue-rubric \u00d72 | \u2212222 / \u2212269 / \u2212813 | falsifies \u2014 1.5B doesn't transfer the rogue signal |\n",
|
| 386 |
+
"| 3C (control) | reverse curriculum (HARD \u2192 MED \u2192 EASY) helps EASY | Phase 3B + reverse curric | \u2212241 / \u2212363 / \u2212821 | falsifies \u2014 reverse curriculum strictly hurts |\n",
|
| 387 |
+
"\n",
|
| 388 |
+
"The **3B-vs-3A delta proves model capacity was the binding constraint**\n",
|
| 389 |
+
"\u2014 the same reward shape on 1.5B got nowhere; on 3B it broke through.\n",
|
| 390 |
+
"All six LoRAs are public on the Hub (`chaosops-grpo-lora{,-p1,-p2,-p3a,-p3b,-p3c}`)\n",
|
| 391 |
+
"so the ablation is independently reproducible.\n",
|
| 392 |
+
"\n",
|
| 393 |
+
"Phase 3A's per-tier *training* reward stream (60 log points, every 10 steps):\n",
|
| 394 |
+
"\n",
|
| 395 |
+
"| Tier (steps) | Mean | Min | Max | Last |\n",
|
| 396 |
+
"|---|---|---|---|---|\n",
|
| 397 |
+
"| EASY (1\u2013200) | **+6.90** | \u22121.01 | +17.14 | +4.95 |\n",
|
| 398 |
+
"| MEDIUM (201\u2013400) | **+12.68** | +2.96 | +30.75 | +13.49 |\n",
|
| 399 |
+
"| HARD (401\u2013600) | **+14.00** | +4.94 | +30.33 | +16.28 |\n",
|
| 400 |
+
"\n",
|
| 401 |
+
"All three tiers ended with positive mean reward, and the harder the\n",
|
| 402 |
+
"tier the higher the mean \u2014 the curriculum effect *compounded*. Final\n",
|
| 403 |
+
"KL to base 0.595, sustained.\n"
|
| 404 |
+
]
|
| 405 |
+
},
|
| 406 |
{
|
| 407 |
"cell_type": "markdown",
|
| 408 |
"metadata": {},
|