lablab-ai-amd-developer-hackathon
/

SentinelBrain-14B-MoE-v0.1

@@ -44,7 +44,7 @@ model-index:
             verified: true
           - name: Training Loss (realignment latest)
             type: loss
-            value: 6.76
             verified: true
 ---
@@ -475,8 +475,10 @@ Loss
 | 📈 Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
 | ⭐ Realign best | 1,000 | 6.07 | val=**5.773 ★ NEW BEST** |
 | � SIGTERM crash + restart | 1,001→1,123 | 9.22→6.64 | Fresh optimizer, cold momentum |
-| 🔥 SGDR warm restart | 1,124 | 9.09→7.86 | LR boosted 2.9e-5→4.5e-5, 30× faster descent |
-| 🔄 Current (live) | ~1,150 | ~7.86 | SGDR cosine decay phase, recovering fast |
 | **Total pretrain** | | | **11.72 → 1.99 (−83%)** |
 | **Realignment** | | | **15.87 → 5.77 (val, −64%)** |
@@ -484,16 +486,19 @@ Loss
 | Metric | Value |
 |:--|:--|
-| **Current Phase** | ⚡ Corpus Realignment + SGDR Warm Restart |
-| **Current Step** | ~1,150 / 5,000 |
-| **Training Loss** | ~7.86 (SGDR recovery — 30× faster descent vs pre-restart) |
 | **Best Validation Loss** | **5.773** (step 1,000) ★ |
-| **Throughput** | 5,806 tokens/second |
 | **VRAM Used** | 120 GB / 206 GB (58%) — all experts unfrozen |
-| **Total Tokens Processed** | ~226M (this run) + 178M (pretrain) |
 | **Experts Active** | All 4 unfrozen since step 500 |
-| **SGDR Status** | Peak LR 4.5e-5, cosine decay → rejoin normal schedule at step 1300 |
-| **ETA** | ~36 hours |
 ### Realignment Eval History
@@ -504,9 +509,12 @@ Loss
 | 700 | 6.24 | 515 | Converging |
 | 800 | 6.01 | 407 | Converging |
 | 900 | 5.91 | 367 | Converging |
-| **1,000** | **5.773** | **321** | **★ NEW BEST** |
-| 1,100 | 6.55 | 701 | Optimizer restart recovery |
-| 1,200 | *pending* | *pending* | SGDR warm restart active |
 ### Published Checkpoint (v0.1)
@@ -530,11 +538,15 @@ Training a 14.4B model on a single GPU for days demands bullet-proof infrastruct
 | **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
 | **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
 | **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
-| **SGDR warm restart** | After optimizer cold-start, cosine warm restart (Loshchilov & Hutter, 2017) to recover 30× faster |
 | **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
 **Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart — recovering 30× faster than natural momentum rebuilding.
 ---
 ## 🌡️ Consciousness Metric (Φ) — Deep Dive

             verified: true
           - name: Training Loss (realignment latest)
             type: loss
+            value: 6.26
             verified: true
 ---
 | 📈 Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
 | ⭐ Realign best | 1,000 | 6.07 | val=**5.773 ★ NEW BEST** |
 | � SIGTERM crash + restart | 1,001→1,123 | 9.22→6.64 | Fresh optimizer, cold momentum |
+| 🔥 SGDR warm restart (Cycle 0) | 1,124 | 9.09→6.24 | LR boosted 2.9e-5→4.5e-5, T=200 steps |
+| 📊 Recovery eval | 1,200 | 6.74 | val=6.48, AdamW variance at ~17% convergence |
+| 📊 Recovery eval | 1,300 | 6.24 | val=**6.10** (new recovery best) |
+| 🔄 Multi-cycle SGDR | 1,301+ | 8.79→↓ | Cycle 1 pending (step 1400, T=400, peak 3.8e-5) |
 | **Total pretrain** | | | **11.72 → 1.99 (−83%)** |
 | **Realignment** | | | **15.87 → 5.77 (val, −64%)** |
 | Metric | Value |
 |:--|:--|
+| **Current Phase** | ⚡ Corpus Realignment + Multi-Cycle SGDR |
+| **Current Step** | ~1,320 / 5,000 |
+| **Training Loss** | ~6.26 (recovering from optimizer cold-start) |
 | **Best Validation Loss** | **5.773** (step 1,000) ★ |
+| **Recovery Val Loss** | 6.096 (step 1,300) — gap closing |
+| **Throughput** | 5,857 tokens/second |
 | **VRAM Used** | 120 GB / 206 GB (58%) — all experts unfrozen |
+| **Total Tokens Processed** | ~260M (this run) + 178M (pretrain) |
 | **Experts Active** | All 4 unfrozen since step 500 |
+| **SGDR Status** | Multi-cycle: Cycle 1 at step 1400 (T=400, peak 3.8e-5), Cycle 2 at step 2000 (T=800, peak 3.0e-5) |
+| **MIN_LR** | 1.5e-5 (raised from 1e-5, prevents stagnation) |
+| **Expert LR Boost** | 1.33× during restart windows |
+| **ETA** | ~34 hours |
 ### Realignment Eval History
 | 700 | 6.24 | 515 | Converging |
 | 800 | 6.01 | 407 | Converging |
 | 900 | 5.91 | 367 | Converging |
+| **1,000** | **5.773** | **321** | **★ ALL-TIME BEST** |
+| 1,100 | 6.55 | 701 | Optimizer cold-start |
+| 1,200 | 6.48 | 652 | SGDR Cycle 0 (recovering) |
+| 1,300 | **6.096** | **444** | Recovery best, gap=5.5% to peak |
+| 1,400 | *pending* | | SGDR Cycle 1 starts (T=400) |
+| 2,000 | *pending* | | SGDR Cycle 2 starts (T=800) |
 ### Published Checkpoint (v0.1)
 | **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
 | **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
 | **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
+| **Multi-cycle SGDR** | Period-doubling warm restarts (Loshchilov & Hutter, 2017): Cycle 0 (T=200), Cycle 1 (T=400), Cycle 2 (T=800) |
+| **Expert LR boost** | During restart windows, expert LR scale increases 0.3→0.4 (ST-MoE stability guideline) |
+| **MIN_LR floor** | Raised from 1e-5 to 1.5e-5 to prevent cosine decay stagnation in recovery |
 | **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
 **Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart — recovering 30× faster than natural momentum rebuilding.
+**Multi-Cycle SGDR (April 29, 2026)**: After the initial SGDR Cycle 0 completed (steps 1100-1300), analysis showed the single restart was insufficient to escape the recovery basin (val=6.10 vs target 5.77). Based on the original SGDR paper's period-doubling strategy ($T_{i+1} = T_i \times 2$), we added Cycle 1 (steps 1400-1800, peak 3.8e-5) and Cycle 2 (steps 2000-2800, peak 3.0e-5). The AdamW second moment ($\beta_2=0.999$) needs ~1000 steps for 63% convergence — these cycles provide periodic "shocks" to escape local basins while the variance estimate matures.
 ---
 ## 🌡️ Consciousness Metric (Φ) — Deep Dive