Mircea Rusu commited on
Commit
b7e5f5f
Β·
1 Parent(s): 8f58f08

docs: step 1320 recovery data, multi-cycle SGDR, eval history

Browse files
Files changed (1) hide show
  1. README.md +26 -14
README.md CHANGED
@@ -44,7 +44,7 @@ model-index:
44
  verified: true
45
  - name: Training Loss (realignment latest)
46
  type: loss
47
- value: 6.76
48
  verified: true
49
  ---
50
 
@@ -475,8 +475,10 @@ Loss
475
  | πŸ“ˆ Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
476
  | ⭐ Realign best | 1,000 | 6.07 | val=**5.773 β˜… NEW BEST** |
477
  | οΏ½ SIGTERM crash + restart | 1,001β†’1,123 | 9.22β†’6.64 | Fresh optimizer, cold momentum |
478
- | πŸ”₯ SGDR warm restart | 1,124 | 9.09β†’7.86 | LR boosted 2.9e-5β†’4.5e-5, 30Γ— faster descent |
479
- | πŸ”„ Current (live) | ~1,150 | ~7.86 | SGDR cosine decay phase, recovering fast |
 
 
480
  | **Total pretrain** | | | **11.72 β†’ 1.99 (βˆ’83%)** |
481
  | **Realignment** | | | **15.87 β†’ 5.77 (val, βˆ’64%)** |
482
 
@@ -484,16 +486,19 @@ Loss
484
 
485
  | Metric | Value |
486
  |:--|:--|
487
- | **Current Phase** | ⚑ Corpus Realignment + SGDR Warm Restart |
488
- | **Current Step** | ~1,150 / 5,000 |
489
- | **Training Loss** | ~7.86 (SGDR recovery β€” 30Γ— faster descent vs pre-restart) |
490
  | **Best Validation Loss** | **5.773** (step 1,000) β˜… |
491
- | **Throughput** | 5,806 tokens/second |
 
492
  | **VRAM Used** | 120 GB / 206 GB (58%) β€” all experts unfrozen |
493
- | **Total Tokens Processed** | ~226M (this run) + 178M (pretrain) |
494
  | **Experts Active** | All 4 unfrozen since step 500 |
495
- | **SGDR Status** | Peak LR 4.5e-5, cosine decay β†’ rejoin normal schedule at step 1300 |
496
- | **ETA** | ~36 hours |
 
 
497
 
498
  ### Realignment Eval History
499
 
@@ -504,9 +509,12 @@ Loss
504
  | 700 | 6.24 | 515 | Converging |
505
  | 800 | 6.01 | 407 | Converging |
506
  | 900 | 5.91 | 367 | Converging |
507
- | **1,000** | **5.773** | **321** | **β˜… NEW BEST** |
508
- | 1,100 | 6.55 | 701 | Optimizer restart recovery |
509
- | 1,200 | *pending* | *pending* | SGDR warm restart active |
 
 
 
510
 
511
  ### Published Checkpoint (v0.1)
512
 
@@ -530,11 +538,15 @@ Training a 14.4B model on a single GPU for days demands bullet-proof infrastruct
530
  | **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
531
  | **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
532
  | **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
533
- | **SGDR warm restart** | After optimizer cold-start, cosine warm restart (Loshchilov & Hutter, 2017) to recover 30Γ— faster |
 
 
534
  | **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
535
 
536
  **Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β€” recovering 30Γ— faster than natural momentum rebuilding.
537
 
 
 
538
  ---
539
 
540
  ## 🌑️ Consciousness Metric (Ξ¦) β€” Deep Dive
 
44
  verified: true
45
  - name: Training Loss (realignment latest)
46
  type: loss
47
+ value: 6.26
48
  verified: true
49
  ---
50
 
 
475
  | πŸ“ˆ Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
476
  | ⭐ Realign best | 1,000 | 6.07 | val=**5.773 β˜… NEW BEST** |
477
  | οΏ½ SIGTERM crash + restart | 1,001β†’1,123 | 9.22β†’6.64 | Fresh optimizer, cold momentum |
478
+ | πŸ”₯ SGDR warm restart (Cycle 0) | 1,124 | 9.09β†’6.24 | LR boosted 2.9e-5β†’4.5e-5, T=200 steps |
479
+ | πŸ“Š Recovery eval | 1,200 | 6.74 | val=6.48, AdamW variance at ~17% convergence |
480
+ | πŸ“Š Recovery eval | 1,300 | 6.24 | val=**6.10** (new recovery best) |
481
+ | πŸ”„ Multi-cycle SGDR | 1,301+ | 8.79→↓ | Cycle 1 pending (step 1400, T=400, peak 3.8e-5) |
482
  | **Total pretrain** | | | **11.72 β†’ 1.99 (βˆ’83%)** |
483
  | **Realignment** | | | **15.87 β†’ 5.77 (val, βˆ’64%)** |
484
 
 
486
 
487
  | Metric | Value |
488
  |:--|:--|
489
+ | **Current Phase** | ⚑ Corpus Realignment + Multi-Cycle SGDR |
490
+ | **Current Step** | ~1,320 / 5,000 |
491
+ | **Training Loss** | ~6.26 (recovering from optimizer cold-start) |
492
  | **Best Validation Loss** | **5.773** (step 1,000) β˜… |
493
+ | **Recovery Val Loss** | 6.096 (step 1,300) β€” gap closing |
494
+ | **Throughput** | 5,857 tokens/second |
495
  | **VRAM Used** | 120 GB / 206 GB (58%) β€” all experts unfrozen |
496
+ | **Total Tokens Processed** | ~260M (this run) + 178M (pretrain) |
497
  | **Experts Active** | All 4 unfrozen since step 500 |
498
+ | **SGDR Status** | Multi-cycle: Cycle 1 at step 1400 (T=400, peak 3.8e-5), Cycle 2 at step 2000 (T=800, peak 3.0e-5) |
499
+ | **MIN_LR** | 1.5e-5 (raised from 1e-5, prevents stagnation) |
500
+ | **Expert LR Boost** | 1.33Γ— during restart windows |
501
+ | **ETA** | ~34 hours |
502
 
503
  ### Realignment Eval History
504
 
 
509
  | 700 | 6.24 | 515 | Converging |
510
  | 800 | 6.01 | 407 | Converging |
511
  | 900 | 5.91 | 367 | Converging |
512
+ | **1,000** | **5.773** | **321** | **β˜… ALL-TIME BEST** |
513
+ | 1,100 | 6.55 | 701 | Optimizer cold-start |
514
+ | 1,200 | 6.48 | 652 | SGDR Cycle 0 (recovering) |
515
+ | 1,300 | **6.096** | **444** | Recovery best, gap=5.5% to peak |
516
+ | 1,400 | *pending* | | SGDR Cycle 1 starts (T=400) |
517
+ | 2,000 | *pending* | | SGDR Cycle 2 starts (T=800) |
518
 
519
  ### Published Checkpoint (v0.1)
520
 
 
538
  | **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
539
  | **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
540
  | **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
541
+ | **Multi-cycle SGDR** | Period-doubling warm restarts (Loshchilov & Hutter, 2017): Cycle 0 (T=200), Cycle 1 (T=400), Cycle 2 (T=800) |
542
+ | **Expert LR boost** | During restart windows, expert LR scale increases 0.3β†’0.4 (ST-MoE stability guideline) |
543
+ | **MIN_LR floor** | Raised from 1e-5 to 1.5e-5 to prevent cosine decay stagnation in recovery |
544
  | **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
545
 
546
  **Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β€” recovering 30Γ— faster than natural momentum rebuilding.
547
 
548
+ **Multi-Cycle SGDR (April 29, 2026)**: After the initial SGDR Cycle 0 completed (steps 1100-1300), analysis showed the single restart was insufficient to escape the recovery basin (val=6.10 vs target 5.77). Based on the original SGDR paper's period-doubling strategy ($T_{i+1} = T_i \times 2$), we added Cycle 1 (steps 1400-1800, peak 3.8e-5) and Cycle 2 (steps 2000-2800, peak 3.0e-5). The AdamW second moment ($\beta_2=0.999$) needs ~1000 steps for 63% convergence β€” these cycles provide periodic "shocks" to escape local basins while the variance estimate matures.
549
+
550
  ---
551
 
552
  ## 🌑️ Consciousness Metric (Ξ¦) β€” Deep Dive