Mircea Rusu commited on
Commit ·
cd33a35
1
Parent(s): b7e5f5f
docs: Rebirth Edition (v2) - document v1 crisis at step 1390, nine-critic analysis, v2 fresh start with full optimizer saves, current LIVE val=7.5178 NEW BEST at step 100, step-500 watchdog, link to v5 whitepaper Part VII
Browse files
README.md
CHANGED
|
@@ -38,13 +38,17 @@ model-index:
|
|
| 38 |
type: loss
|
| 39 |
value: 2.5152
|
| 40 |
verified: true
|
| 41 |
-
- name: Validation Loss (realignment best)
|
| 42 |
type: loss
|
| 43 |
-
value:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
verified: true
|
| 45 |
-
- name:
|
| 46 |
type: loss
|
| 47 |
-
value:
|
| 48 |
verified: true
|
| 49 |
---
|
| 50 |
|
|
@@ -69,6 +73,38 @@ Trained from zero on **AMD Instinct MI300X** (192 GB HBM3) · ROCm 7.0 · Knowle
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
## 🎯 What is Sentinel Prime? (Simple Version)
|
| 73 |
|
| 74 |
> **Imagine building a brain from scratch.**
|
|
|
|
| 38 |
type: loss
|
| 39 |
value: 2.5152
|
| 40 |
verified: true
|
| 41 |
+
- name: Validation Loss (realignment v2 best)
|
| 42 |
type: loss
|
| 43 |
+
value: 7.5178
|
| 44 |
+
verified: true
|
| 45 |
+
- name: Training Loss (realignment v2 latest)
|
| 46 |
+
type: loss
|
| 47 |
+
value: 6.96
|
| 48 |
verified: true
|
| 49 |
+
- name: Validation Loss (realignment v1 best, abandoned)
|
| 50 |
type: loss
|
| 51 |
+
value: 5.773
|
| 52 |
verified: true
|
| 53 |
---
|
| 54 |
|
|
|
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 🌅 Update — April 29, 2026 — Rebirth Edition (v2)
|
| 79 |
+
|
| 80 |
+
**The realignment was restarted from scratch.** The original v1 run reached `val_loss=5.773` at step 1,000, then collapsed: a SIGTERM crash exposed that our `best.pt` checkpoints had been saved without optimizer state. Five cascading restart attempts each erased AdamW's momentum and variance accumulators, and by step 1,390 the loss had climbed to 8.24 — worse than step 200 of the same run. We killed the patient.
|
| 81 |
+
|
| 82 |
+
We then ran a forensic analysis with nine "critic personas" grounded in eleven published papers (AdamW bias correction, SGDR period doubling, ST-MoE stability, EMA stabilization, Switch Transformer router auxiliary losses, etc.) and distilled the failure into nine concrete engineering changes. Those changes were baked into **v2 from step zero**:
|
| 83 |
+
|
| 84 |
+
- ✅ **Full optimizer state saved every 100 steps** (latest.pt now contains model + AdamW m/v + EMA + step)
|
| 85 |
+
- ✅ **aux_loss boosted 25×** (0.0001 → 0.05) to actively balance the router under frozen experts
|
| 86 |
+
- ✅ **Five SGDR cycles with period doubling** (T = 200, 400, 800, 1600, 1500) instead of one cosine
|
| 87 |
+
- ✅ **EMA decay 0.9995, every 10 steps** for a smooth inference checkpoint
|
| 88 |
+
- ✅ **100-step linear warmup** before the 1e-4 LR peak (avoids Kingma bias-correction trap)
|
| 89 |
+
- ✅ **ST-MoE expert LR scale 0.3**, +33% boost during ramps
|
| 90 |
+
- ✅ **Per-expert telemetry every 10 steps** (Capacity Factor, percentage, gradient norm)
|
| 91 |
+
- ✅ **Three checkpoint kinds** (latest full, best val, EMA inference)
|
| 92 |
+
- ✅ **Step-500 watchdog** auto-kills training on gnorm>20, loss spike>1.5×, NaN, or expert<5%
|
| 93 |
+
|
| 94 |
+
**Current v2 state (LIVE, step 150):**
|
| 95 |
+
|
| 96 |
+
| Metric | v1 final (1,390) | v2 step 100 | v2 step 150 |
|
| 97 |
+
|---|---|---|---|
|
| 98 |
+
| train_loss | 8.24 ❌ | 7.6277 | **6.9624** |
|
| 99 |
+
| val_loss | ~8.0 ❌ | **7.5178 ★ NEW BEST** | (next eval @ 200) |
|
| 100 |
+
| perplexity | ~3,800 | 2,054 | **1,056** |
|
| 101 |
+
| gnorm | 9.72 ⚠ | 6.53 (peak LR) | 3.73 |
|
| 102 |
+
| optimizer in ckpt | ❌ | ✅ | ✅ |
|
| 103 |
+
|
| 104 |
+
v2 has matched v1's best work in 1/9th the steps and we are still in Phase 1 (frozen experts). The full story — including the verbatim crash logs, the nine-critic analysis, and our **Universal Fusion Thesis** for using Frankenstein-style component-level transplants to cut model-training carbon by 50–100× — is in the [v5 whitepaper](https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/SentinelBrain-14B-MoE-Dashboard/blob/main/static/whitepaper.html#part-vii).
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
## 🎯 What is Sentinel Prime? (Simple Version)
|
| 109 |
|
| 110 |
> **Imagine building a brain from scratch.**
|