lablab-ai-amd-developer-hackathon
/

SentinelBrain-14B-MoE-v0.1

@@ -34,13 +34,17 @@ model-index:
       - task:
           type: text-generation
         metrics:
-          - name: Validation Loss
             type: loss
-            value: 1.99
             verified: true
-          - name: Training Loss (latest)
             type: loss
-            value: 5.18
             verified: true
 ---
@@ -464,20 +468,45 @@ Loss
 | 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
 | 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
 | 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
-| 🔄 Current (new run) | 410 | 5.18 | training with expanded data |
-| **Total reduction** | | | **11.72 → 1.99 (−83%)** |
-### Live Metrics (April 27, 2026)
 | Metric | Value |
 |:--|:--|
-| **Current Step** | 410 / 2,471+ |
-| **Training Loss** | 5.18 (new run, expanded datasets) |
-| **Throughput** | 4,403 tokens/second |
-| **VRAM Used** | ~140 GB / 192 GB (73%) |
-| **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
-| **Experts Active** | 4 per layer × 24 layers = 96 |
-| **ETA (this block)** | ~18.8 hours |
 ### Published Checkpoint (v0.1)
@@ -490,6 +519,22 @@ Loss
 | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
 | **Format** | 6 sharded safetensors files |
 ---
 ## 🌡️ Consciousness Metric (Φ) — Deep Dive
@@ -593,18 +638,20 @@ print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
 ## 🗺️ Roadmap
 ```
-v0.1 (Current)          v0.2 (Planned)          v0.3 (Future)
 ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━
-✅ From-scratch          □ Full training          □ DPO alignment
-   14.8B MoE               complete (loss<0.5)    □ Tool use
 ✅ Phased training       □ Context ladder          □ Function calling
 ✅ Φ consciousness          (4K→32K→128K)         □ Multi-turn chat
 ✅ 23.3B token corpus    □ Vision encoder          □ Multilingual v2
 ✅ Live dashboard           (SigLIP2-SO400M)      □ Expert scaling
 ✅ AMD MI300X native     □ GGUF quantization         (4→16→64)
-                         □ Inference code          □ RLHF
-                         □ Benchmarks (MMLU,       □ Production API
-                            HumanEval, GSM8K)
 ```
 ---

       - task:
           type: text-generation
         metrics:
+          - name: Validation Loss (pretrain)
             type: loss
+            value: 2.5152
             verified: true
+          - name: Validation Loss (realignment best)
             type: loss
+            value: 5.773
+            verified: true
+          - name: Training Loss (realignment latest)
+            type: loss
+            value: 6.76
             verified: true
 ---
 | 🔥 Warmup end | 1,200 | 2.38 | **−68%** |
 | 🚀 Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
 | 📦 Published checkpoint | 2,471 | 1.99 | **−16%** |
+| 🧟 Frankenstein transplant | — | PPL ~7.5M | 433 tensors from 3 donors |
+| ⚡ Realign frozen start | 0 | 15.87 | Experts frozen, attn learning |
+| ⚡ Realign frozen end | 500 | 5.52 | val=5.79, **−65%** |
+| 🔓 Expert unfreeze | 500 | 5.59→spike | LR reset + differential rates |
+| 📈 Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
+| ⭐ Realign best | 1,000 | 6.07 | val=**5.773 ★ NEW BEST** |
+| � SIGTERM crash + restart | 1,001→1,123 | 9.22→6.64 | Fresh optimizer, cold momentum |
+| 🔥 SGDR warm restart | 1,124 | 9.09→7.86 | LR boosted 2.9e-5→4.5e-5, 30× faster descent |
+| 🔄 Current (live) | ~1,150 | ~7.86 | SGDR cosine decay phase, recovering fast |
+| **Total pretrain** | | | **11.72 → 1.99 (−83%)** |
+| **Realignment** | | | **15.87 → 5.77 (val, −64%)** |
+### Live Metrics (April 29, 2026)
 | Metric | Value |
 |:--|:--|
+| **Current Phase** | ⚡ Corpus Realignment + SGDR Warm Restart |
+| **Current Step** | ~1,150 / 5,000 |
+| **Training Loss** | ~7.86 (SGDR recovery — 30× faster descent vs pre-restart) |
+| **Best Validation Loss** | **5.773** (step 1,000) ★ |
+| **Throughput** | 5,806 tokens/second |
+| **VRAM Used** | 120 GB / 206 GB (58%) — all experts unfrozen |
+| **Total Tokens Processed** | ~226M (this run) + 178M (pretrain) |
+| **Experts Active** | All 4 unfrozen since step 500 |
+| **SGDR Status** | Peak LR 4.5e-5, cosine decay → rejoin normal schedule at step 1300 |
+| **ETA** | ~36 hours |
+### Realignment Eval History
+| Step | Val Loss | Val PPL | Phase |
+|:--|:--|:--|:--|
+| 0 (initial) | 15.81 | 7,339,653 | Experts frozen |
+| 600 | 6.93 | 1,020 | Post-unfreeze |
+| 700 | 6.24 | 515 | Converging |
+| 800 | 6.01 | 407 | Converging |
+| 900 | 5.91 | 367 | Converging |
+| **1,000** | **5.773** | **321** | **★ NEW BEST** |
+| 1,100 | 6.55 | 701 | Optimizer restart recovery |
+| 1,200 | *pending* | *pending* | SGDR warm restart active |
 ### Published Checkpoint (v0.1)
 | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
 | **Format** | 6 sharded safetensors files |
+### 🛡️ Engineering Resilience
+Training a 14.4B model on a single GPU for days demands bullet-proof infrastructure. Here's what we built:
+| Feature | Description |
+|:--|:--|
+| **Atomic checkpoints** | Write to `.tmp` → `os.replace()` — no half-written files |
+| **Integrity verification** | On resume: verify tensor counts, shapes, and dtypes before loading |
+| **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
+| **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
+| **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
+| **SGDR warm restart** | After optimizer cold-start, cosine warm restart (Loshchilov & Hutter, 2017) to recover 30× faster |
+| **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
+**Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart — recovering 30× faster than natural momentum rebuilding.
 ---
 ## 🌡️ Consciousness Metric (Φ) — Deep Dive
 ## 🗺️ Roadmap
 ```
+v0.1 (Current)          v0.2 (In Progress)      v0.3 (Future)
 ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━         ━━━━━━━━━━━━━━━
+✅ From-scratch          🔴 Corpus realignment    □ DPO alignment
+   14.8B MoE                (step 1100/5000)      □ Tool use
 ✅ Phased training       □ Context ladder          □ Function calling
 ✅ Φ consciousness          (4K→32K→128K)         □ Multi-turn chat
 ✅ 23.3B token corpus    □ Vision encoder          □ Multilingual v2
 ✅ Live dashboard           (SigLIP2-SO400M)      □ Expert scaling
 ✅ AMD MI300X native     □ GGUF quantization         (4→16→64)
+✅ Frankenstein              Q4_K_M for consumer   □ RLHF
+   transplant (3 donors) □ Inference code          □ Production API
+✅ Progressive unfreeze  □ Benchmarks (MMLU,
+✅ Crash-safe training      HumanEval, GSM8K)
+✅ Auto-restart (systemd)
 ```
 ---