Spaces:

helloAK96
/

chaosops

Sleeping

App Files Files Community

helloAK96 commited on 28 days ago

Commit

a9790c1

verified ·

1 Parent(s): 5228bdf

Notebook: add Phase 8b — training history (3,200 episodes, 6-run ablation table)

Browse files

Files changed (1) hide show

notebooks/colab_train.ipynb +39 -0

notebooks/colab_train.ipynb CHANGED Viewed

@@ -364,6 +364,45 @@
     "Image('/content/artifacts/evaluation/comparison_curve.png')\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},

     "Image('/content/artifacts/evaluation/comparison_curve.png')\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Phase 8b \u2014 How we got here: training history (3,200 episodes, 6 GRPO runs)\n",
+    "\n",
+    "The Phase 3A LoRA you just evaluated isn't from one happy run \u2014 it's\n",
+    "the winner of a 6-run ablation sweep on HF Jobs. Each phase tested a\n",
+    "specific hypothesis about what was bottlenecking the previous run.\n",
+    "Total simulated: **3,200 GRPO training episodes + 3,240 evaluation\n",
+    "episodes**.\n",
+    "\n",
+    "| Phase | Hypothesis | Knobs vs prev | Eval mean (E/M/H) | Verdict |\n",
+    "|---|---|---|---|---|\n",
+    "| **0** | TRL defaults are fine on 1.5B | Qwen-1.5B, LR 5e-6, EASY-only, 400 steps, r=16 | \u2212251 / \u2212315 / \u2212826 | flat reward, KL 0.14 \u2014 undertrained |\n",
+    "| **1** | gradient too small | LR 5e-6 \u2192 **2e-5** (4\u00d7) | \u2212218 / \u2212283 / \u2212820 | KL 4\u00d7 higher; rogue catch lost (20% \u2192 0%) |\n",
+    "| **2** | model never sees harder scenarios | + **easy:200,medium:200,hard:200 curriculum**, r=32, 600 steps | \u2212221 / \u2212296 / \u2212834 | resolution 10/40/0%; HARD-tier *training* reward (\u22124.4) > EASY (\u22126.1); **1.5B capacity-bound** |\n",
+    "| **3A \ud83c\udfc6** | bigger model + amplified oversight signal | **Qwen-1.5B \u2192 Qwen-3B**, **rogue-rubric \u00d72** (catch +100, FP \u2212150), temperature 0.8 | **+49 / \u221217 / \u2212433** | 85/80/40% solve, **100% rogue+ on MED, 93% on HARD**, MTTR matches Oracle. \ud83c\udfc6 SUBMITTED |\n",
+    "| 3B (control) | reward shape alone enough on 1.5B | Phase 2 + rogue-rubric \u00d72 | \u2212222 / \u2212269 / \u2212813 | falsifies \u2014 1.5B doesn't transfer the rogue signal |\n",
+    "| 3C (control) | reverse curriculum (HARD \u2192 MED \u2192 EASY) helps EASY | Phase 3B + reverse curric | \u2212241 / \u2212363 / \u2212821 | falsifies \u2014 reverse curriculum strictly hurts |\n",
+    "\n",
+    "The **3B-vs-3A delta proves model capacity was the binding constraint**\n",
+    "\u2014 the same reward shape on 1.5B got nowhere; on 3B it broke through.\n",
+    "All six LoRAs are public on the Hub (`chaosops-grpo-lora{,-p1,-p2,-p3a,-p3b,-p3c}`)\n",
+    "so the ablation is independently reproducible.\n",
+    "\n",
+    "Phase 3A's per-tier *training* reward stream (60 log points, every 10 steps):\n",
+    "\n",
+    "| Tier (steps) | Mean | Min | Max | Last |\n",
+    "|---|---|---|---|---|\n",
+    "| EASY (1\u2013200) | **+6.90** | \u22121.01 | +17.14 | +4.95 |\n",
+    "| MEDIUM (201\u2013400) | **+12.68** | +2.96 | +30.75 | +13.49 |\n",
+    "| HARD (401\u2013600) | **+14.00** | +4.94 | +30.33 | +16.28 |\n",
+    "\n",
+    "All three tiers ended with positive mean reward, and the harder the\n",
+    "tier the higher the mean \u2014 the curriculum effect *compounded*. Final\n",
+    "KL to base 0.595, sustained.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},