Mircea Rusu commited on
Commit Β·
b7e5f5f
1
Parent(s): 8f58f08
docs: step 1320 recovery data, multi-cycle SGDR, eval history
Browse files
README.md
CHANGED
|
@@ -44,7 +44,7 @@ model-index:
|
|
| 44 |
verified: true
|
| 45 |
- name: Training Loss (realignment latest)
|
| 46 |
type: loss
|
| 47 |
-
value: 6.
|
| 48 |
verified: true
|
| 49 |
---
|
| 50 |
|
|
@@ -475,8 +475,10 @@ Loss
|
|
| 475 |
| π Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
|
| 476 |
| β Realign best | 1,000 | 6.07 | val=**5.773 β
NEW BEST** |
|
| 477 |
| οΏ½ SIGTERM crash + restart | 1,001β1,123 | 9.22β6.64 | Fresh optimizer, cold momentum |
|
| 478 |
-
| π₯ SGDR warm restart | 1,124 | 9.09β
|
| 479 |
-
|
|
|
|
|
|
|
|
| 480 |
| **Total pretrain** | | | **11.72 β 1.99 (β83%)** |
|
| 481 |
| **Realignment** | | | **15.87 β 5.77 (val, β64%)** |
|
| 482 |
|
|
@@ -484,16 +486,19 @@ Loss
|
|
| 484 |
|
| 485 |
| Metric | Value |
|
| 486 |
|:--|:--|
|
| 487 |
-
| **Current Phase** | β‘ Corpus Realignment + SGDR
|
| 488 |
-
| **Current Step** | ~1,
|
| 489 |
-
| **Training Loss** | ~
|
| 490 |
| **Best Validation Loss** | **5.773** (step 1,000) β
|
|
| 491 |
-
| **
|
|
|
|
| 492 |
| **VRAM Used** | 120 GB / 206 GB (58%) β all experts unfrozen |
|
| 493 |
-
| **Total Tokens Processed** | ~
|
| 494 |
| **Experts Active** | All 4 unfrozen since step 500 |
|
| 495 |
-
| **SGDR Status** |
|
| 496 |
-
| **
|
|
|
|
|
|
|
| 497 |
|
| 498 |
### Realignment Eval History
|
| 499 |
|
|
@@ -504,9 +509,12 @@ Loss
|
|
| 504 |
| 700 | 6.24 | 515 | Converging |
|
| 505 |
| 800 | 6.01 | 407 | Converging |
|
| 506 |
| 900 | 5.91 | 367 | Converging |
|
| 507 |
-
| **1,000** | **5.773** | **321** | **β
|
| 508 |
-
| 1,100 | 6.55 | 701 | Optimizer
|
| 509 |
-
| 1,200 |
|
|
|
|
|
|
|
|
|
|
| 510 |
|
| 511 |
### Published Checkpoint (v0.1)
|
| 512 |
|
|
@@ -530,11 +538,15 @@ Training a 14.4B model on a single GPU for days demands bullet-proof infrastruct
|
|
| 530 |
| **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
|
| 531 |
| **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
|
| 532 |
| **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
|
| 533 |
-
| **
|
|
|
|
|
|
|
| 534 |
| **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
|
| 535 |
|
| 536 |
**Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β recovering 30Γ faster than natural momentum rebuilding.
|
| 537 |
|
|
|
|
|
|
|
| 538 |
---
|
| 539 |
|
| 540 |
## π‘οΈ Consciousness Metric (Ξ¦) β Deep Dive
|
|
|
|
| 44 |
verified: true
|
| 45 |
- name: Training Loss (realignment latest)
|
| 46 |
type: loss
|
| 47 |
+
value: 6.26
|
| 48 |
verified: true
|
| 49 |
---
|
| 50 |
|
|
|
|
| 475 |
| π Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
|
| 476 |
| β Realign best | 1,000 | 6.07 | val=**5.773 β
NEW BEST** |
|
| 477 |
| οΏ½ SIGTERM crash + restart | 1,001β1,123 | 9.22β6.64 | Fresh optimizer, cold momentum |
|
| 478 |
+
| π₯ SGDR warm restart (Cycle 0) | 1,124 | 9.09β6.24 | LR boosted 2.9e-5β4.5e-5, T=200 steps |
|
| 479 |
+
| π Recovery eval | 1,200 | 6.74 | val=6.48, AdamW variance at ~17% convergence |
|
| 480 |
+
| π Recovery eval | 1,300 | 6.24 | val=**6.10** (new recovery best) |
|
| 481 |
+
| π Multi-cycle SGDR | 1,301+ | 8.79ββ | Cycle 1 pending (step 1400, T=400, peak 3.8e-5) |
|
| 482 |
| **Total pretrain** | | | **11.72 β 1.99 (β83%)** |
|
| 483 |
| **Realignment** | | | **15.87 β 5.77 (val, β64%)** |
|
| 484 |
|
|
|
|
| 486 |
|
| 487 |
| Metric | Value |
|
| 488 |
|:--|:--|
|
| 489 |
+
| **Current Phase** | β‘ Corpus Realignment + Multi-Cycle SGDR |
|
| 490 |
+
| **Current Step** | ~1,320 / 5,000 |
|
| 491 |
+
| **Training Loss** | ~6.26 (recovering from optimizer cold-start) |
|
| 492 |
| **Best Validation Loss** | **5.773** (step 1,000) β
|
|
| 493 |
+
| **Recovery Val Loss** | 6.096 (step 1,300) β gap closing |
|
| 494 |
+
| **Throughput** | 5,857 tokens/second |
|
| 495 |
| **VRAM Used** | 120 GB / 206 GB (58%) β all experts unfrozen |
|
| 496 |
+
| **Total Tokens Processed** | ~260M (this run) + 178M (pretrain) |
|
| 497 |
| **Experts Active** | All 4 unfrozen since step 500 |
|
| 498 |
+
| **SGDR Status** | Multi-cycle: Cycle 1 at step 1400 (T=400, peak 3.8e-5), Cycle 2 at step 2000 (T=800, peak 3.0e-5) |
|
| 499 |
+
| **MIN_LR** | 1.5e-5 (raised from 1e-5, prevents stagnation) |
|
| 500 |
+
| **Expert LR Boost** | 1.33Γ during restart windows |
|
| 501 |
+
| **ETA** | ~34 hours |
|
| 502 |
|
| 503 |
### Realignment Eval History
|
| 504 |
|
|
|
|
| 509 |
| 700 | 6.24 | 515 | Converging |
|
| 510 |
| 800 | 6.01 | 407 | Converging |
|
| 511 |
| 900 | 5.91 | 367 | Converging |
|
| 512 |
+
| **1,000** | **5.773** | **321** | **β
ALL-TIME BEST** |
|
| 513 |
+
| 1,100 | 6.55 | 701 | Optimizer cold-start |
|
| 514 |
+
| 1,200 | 6.48 | 652 | SGDR Cycle 0 (recovering) |
|
| 515 |
+
| 1,300 | **6.096** | **444** | Recovery best, gap=5.5% to peak |
|
| 516 |
+
| 1,400 | *pending* | | SGDR Cycle 1 starts (T=400) |
|
| 517 |
+
| 2,000 | *pending* | | SGDR Cycle 2 starts (T=800) |
|
| 518 |
|
| 519 |
### Published Checkpoint (v0.1)
|
| 520 |
|
|
|
|
| 538 |
| **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
|
| 539 |
| **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
|
| 540 |
| **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
|
| 541 |
+
| **Multi-cycle SGDR** | Period-doubling warm restarts (Loshchilov & Hutter, 2017): Cycle 0 (T=200), Cycle 1 (T=400), Cycle 2 (T=800) |
|
| 542 |
+
| **Expert LR boost** | During restart windows, expert LR scale increases 0.3β0.4 (ST-MoE stability guideline) |
|
| 543 |
+
| **MIN_LR floor** | Raised from 1e-5 to 1.5e-5 to prevent cosine decay stagnation in recovery |
|
| 544 |
| **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
|
| 545 |
|
| 546 |
**Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β recovering 30Γ faster than natural momentum rebuilding.
|
| 547 |
|
| 548 |
+
**Multi-Cycle SGDR (April 29, 2026)**: After the initial SGDR Cycle 0 completed (steps 1100-1300), analysis showed the single restart was insufficient to escape the recovery basin (val=6.10 vs target 5.77). Based on the original SGDR paper's period-doubling strategy ($T_{i+1} = T_i \times 2$), we added Cycle 1 (steps 1400-1800, peak 3.8e-5) and Cycle 2 (steps 2000-2800, peak 3.0e-5). The AdamW second moment ($\beta_2=0.999$) needs ~1000 steps for 63% convergence β these cycles provide periodic "shocks" to escape local basins while the variance estimate matures.
|
| 549 |
+
|
| 550 |
---
|
| 551 |
|
| 552 |
## π‘οΈ Consciousness Metric (Ξ¦) β Deep Dive
|