helloAK96 commited on
Commit
a9790c1
·
verified ·
1 Parent(s): 5228bdf

Notebook: add Phase 8b — training history (3,200 episodes, 6-run ablation table)

Browse files
Files changed (1) hide show
  1. notebooks/colab_train.ipynb +39 -0
notebooks/colab_train.ipynb CHANGED
@@ -364,6 +364,45 @@
364
  "Image('/content/artifacts/evaluation/comparison_curve.png')\n"
365
  ]
366
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
367
  {
368
  "cell_type": "markdown",
369
  "metadata": {},
 
364
  "Image('/content/artifacts/evaluation/comparison_curve.png')\n"
365
  ]
366
  },
367
+ {
368
+ "cell_type": "markdown",
369
+ "metadata": {},
370
+ "source": [
371
+ "## Phase 8b \u2014 How we got here: training history (3,200 episodes, 6 GRPO runs)\n",
372
+ "\n",
373
+ "The Phase 3A LoRA you just evaluated isn't from one happy run \u2014 it's\n",
374
+ "the winner of a 6-run ablation sweep on HF Jobs. Each phase tested a\n",
375
+ "specific hypothesis about what was bottlenecking the previous run.\n",
376
+ "Total simulated: **3,200 GRPO training episodes + 3,240 evaluation\n",
377
+ "episodes**.\n",
378
+ "\n",
379
+ "| Phase | Hypothesis | Knobs vs prev | Eval mean (E/M/H) | Verdict |\n",
380
+ "|---|---|---|---|---|\n",
381
+ "| **0** | TRL defaults are fine on 1.5B | Qwen-1.5B, LR 5e-6, EASY-only, 400 steps, r=16 | \u2212251 / \u2212315 / \u2212826 | flat reward, KL 0.14 \u2014 undertrained |\n",
382
+ "| **1** | gradient too small | LR 5e-6 \u2192 **2e-5** (4\u00d7) | \u2212218 / \u2212283 / \u2212820 | KL 4\u00d7 higher; rogue catch lost (20% \u2192 0%) |\n",
383
+ "| **2** | model never sees harder scenarios | + **easy:200,medium:200,hard:200 curriculum**, r=32, 600 steps | \u2212221 / \u2212296 / \u2212834 | resolution 10/40/0%; HARD-tier *training* reward (\u22124.4) > EASY (\u22126.1); **1.5B capacity-bound** |\n",
384
+ "| **3A \ud83c\udfc6** | bigger model + amplified oversight signal | **Qwen-1.5B \u2192 Qwen-3B**, **rogue-rubric \u00d72** (catch +100, FP \u2212150), temperature 0.8 | **+49 / \u221217 / \u2212433** | 85/80/40% solve, **100% rogue+ on MED, 93% on HARD**, MTTR matches Oracle. \ud83c\udfc6 SUBMITTED |\n",
385
+ "| 3B (control) | reward shape alone enough on 1.5B | Phase 2 + rogue-rubric \u00d72 | \u2212222 / \u2212269 / \u2212813 | falsifies \u2014 1.5B doesn't transfer the rogue signal |\n",
386
+ "| 3C (control) | reverse curriculum (HARD \u2192 MED \u2192 EASY) helps EASY | Phase 3B + reverse curric | \u2212241 / \u2212363 / \u2212821 | falsifies \u2014 reverse curriculum strictly hurts |\n",
387
+ "\n",
388
+ "The **3B-vs-3A delta proves model capacity was the binding constraint**\n",
389
+ "\u2014 the same reward shape on 1.5B got nowhere; on 3B it broke through.\n",
390
+ "All six LoRAs are public on the Hub (`chaosops-grpo-lora{,-p1,-p2,-p3a,-p3b,-p3c}`)\n",
391
+ "so the ablation is independently reproducible.\n",
392
+ "\n",
393
+ "Phase 3A's per-tier *training* reward stream (60 log points, every 10 steps):\n",
394
+ "\n",
395
+ "| Tier (steps) | Mean | Min | Max | Last |\n",
396
+ "|---|---|---|---|---|\n",
397
+ "| EASY (1\u2013200) | **+6.90** | \u22121.01 | +17.14 | +4.95 |\n",
398
+ "| MEDIUM (201\u2013400) | **+12.68** | +2.96 | +30.75 | +13.49 |\n",
399
+ "| HARD (401\u2013600) | **+14.00** | +4.94 | +30.33 | +16.28 |\n",
400
+ "\n",
401
+ "All three tiers ended with positive mean reward, and the harder the\n",
402
+ "tier the higher the mean \u2014 the curriculum effect *compounded*. Final\n",
403
+ "KL to base 0.595, sustained.\n"
404
+ ]
405
+ },
406
  {
407
  "cell_type": "markdown",
408
  "metadata": {},