Mircea Rusu commited on
Commit
8f58f08
Β·
1 Parent(s): fad32f8

Update model card: realignment progress (step 1150), SGDR warm restart, crash recovery, engineering resilience section

Browse files
Files changed (1) hide show
  1. README.md +68 -21
README.md CHANGED
@@ -34,13 +34,17 @@ model-index:
34
  - task:
35
  type: text-generation
36
  metrics:
37
- - name: Validation Loss
38
  type: loss
39
- value: 1.99
40
  verified: true
41
- - name: Training Loss (latest)
42
  type: loss
43
- value: 5.18
 
 
 
 
44
  verified: true
45
  ---
46
 
@@ -464,20 +468,45 @@ Loss
464
  | πŸ”₯ Warmup end | 1,200 | 2.38 | **βˆ’68%** |
465
  | πŸš€ Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
466
  | πŸ“¦ Published checkpoint | 2,471 | 1.99 | **βˆ’16%** |
467
- | πŸ”„ Current (new run) | 410 | 5.18 | training with expanded data |
468
- | **Total reduction** | | | **11.72 β†’ 1.99 (βˆ’83%)** |
469
-
470
- ### Live Metrics (April 27, 2026)
 
 
 
 
 
 
 
 
 
471
 
472
  | Metric | Value |
473
  |:--|:--|
474
- | **Current Step** | 410 / 2,471+ |
475
- | **Training Loss** | 5.18 (new run, expanded datasets) |
476
- | **Throughput** | 4,403 tokens/second |
477
- | **VRAM Used** | ~140 GB / 192 GB (73%) |
478
- | **Total Tokens Processed** | 59.3M (this run) + 178M (prev run) |
479
- | **Experts Active** | 4 per layer Γ— 24 layers = 96 |
480
- | **ETA (this block)** | ~18.8 hours |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
481
 
482
  ### Published Checkpoint (v0.1)
483
 
@@ -490,6 +519,22 @@ Loss
490
  | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
491
  | **Format** | 6 sharded safetensors files |
492
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
493
  ---
494
 
495
  ## 🌑️ Consciousness Metric (Ξ¦) β€” Deep Dive
@@ -593,18 +638,20 @@ print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
593
  ## πŸ—ΊοΈ Roadmap
594
 
595
  ```
596
- v0.1 (Current) v0.2 (Planned) v0.3 (Future)
597
  ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━
598
- βœ… From-scratch β–‘ Full training β–‘ DPO alignment
599
- 14.8B MoE complete (loss<0.5) β–‘ Tool use
600
  βœ… Phased training β–‘ Context ladder β–‘ Function calling
601
  βœ… Ξ¦ consciousness (4Kβ†’32Kβ†’128K) β–‘ Multi-turn chat
602
  βœ… 23.3B token corpus β–‘ Vision encoder β–‘ Multilingual v2
603
  βœ… Live dashboard (SigLIP2-SO400M) β–‘ Expert scaling
604
  βœ… AMD MI300X native β–‘ GGUF quantization (4β†’16β†’64)
605
- β–‘ Inference code β–‘ RLHF
606
- β–‘ Benchmarks (MMLU, β–‘ Production API
607
- HumanEval, GSM8K)
 
 
608
  ```
609
 
610
  ---
 
34
  - task:
35
  type: text-generation
36
  metrics:
37
+ - name: Validation Loss (pretrain)
38
  type: loss
39
+ value: 2.5152
40
  verified: true
41
+ - name: Validation Loss (realignment best)
42
  type: loss
43
+ value: 5.773
44
+ verified: true
45
+ - name: Training Loss (realignment latest)
46
+ type: loss
47
+ value: 6.76
48
  verified: true
49
  ---
50
 
 
468
  | πŸ”₯ Warmup end | 1,200 | 2.38 | **βˆ’68%** |
469
  | πŸš€ Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
470
  | πŸ“¦ Published checkpoint | 2,471 | 1.99 | **βˆ’16%** |
471
+ | 🧟 Frankenstein transplant | β€” | PPL ~7.5M | 433 tensors from 3 donors |
472
+ | ⚑ Realign frozen start | 0 | 15.87 | Experts frozen, attn learning |
473
+ | ⚑ Realign frozen end | 500 | 5.52 | val=5.79, **βˆ’65%** |
474
+ | πŸ”“ Expert unfreeze | 500 | 5.59β†’spike | LR reset + differential rates |
475
+ | πŸ“ˆ Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
476
+ | ⭐ Realign best | 1,000 | 6.07 | val=**5.773 β˜… NEW BEST** |
477
+ | οΏ½ SIGTERM crash + restart | 1,001β†’1,123 | 9.22β†’6.64 | Fresh optimizer, cold momentum |
478
+ | πŸ”₯ SGDR warm restart | 1,124 | 9.09β†’7.86 | LR boosted 2.9e-5β†’4.5e-5, 30Γ— faster descent |
479
+ | πŸ”„ Current (live) | ~1,150 | ~7.86 | SGDR cosine decay phase, recovering fast |
480
+ | **Total pretrain** | | | **11.72 β†’ 1.99 (βˆ’83%)** |
481
+ | **Realignment** | | | **15.87 β†’ 5.77 (val, βˆ’64%)** |
482
+
483
+ ### Live Metrics (April 29, 2026)
484
 
485
  | Metric | Value |
486
  |:--|:--|
487
+ | **Current Phase** | ⚑ Corpus Realignment + SGDR Warm Restart |
488
+ | **Current Step** | ~1,150 / 5,000 |
489
+ | **Training Loss** | ~7.86 (SGDR recovery β€” 30Γ— faster descent vs pre-restart) |
490
+ | **Best Validation Loss** | **5.773** (step 1,000) β˜… |
491
+ | **Throughput** | 5,806 tokens/second |
492
+ | **VRAM Used** | 120 GB / 206 GB (58%) β€” all experts unfrozen |
493
+ | **Total Tokens Processed** | ~226M (this run) + 178M (pretrain) |
494
+ | **Experts Active** | All 4 unfrozen since step 500 |
495
+ | **SGDR Status** | Peak LR 4.5e-5, cosine decay β†’ rejoin normal schedule at step 1300 |
496
+ | **ETA** | ~36 hours |
497
+
498
+ ### Realignment Eval History
499
+
500
+ | Step | Val Loss | Val PPL | Phase |
501
+ |:--|:--|:--|:--|
502
+ | 0 (initial) | 15.81 | 7,339,653 | Experts frozen |
503
+ | 600 | 6.93 | 1,020 | Post-unfreeze |
504
+ | 700 | 6.24 | 515 | Converging |
505
+ | 800 | 6.01 | 407 | Converging |
506
+ | 900 | 5.91 | 367 | Converging |
507
+ | **1,000** | **5.773** | **321** | **β˜… NEW BEST** |
508
+ | 1,100 | 6.55 | 701 | Optimizer restart recovery |
509
+ | 1,200 | *pending* | *pending* | SGDR warm restart active |
510
 
511
  ### Published Checkpoint (v0.1)
512
 
 
519
  | **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
520
  | **Format** | 6 sharded safetensors files |
521
 
522
+ ### πŸ›‘οΈ Engineering Resilience
523
+
524
+ Training a 14.4B model on a single GPU for days demands bullet-proof infrastructure. Here's what we built:
525
+
526
+ | Feature | Description |
527
+ |:--|:--|
528
+ | **Atomic checkpoints** | Write to `.tmp` β†’ `os.replace()` β€” no half-written files |
529
+ | **Integrity verification** | On resume: verify tensor counts, shapes, and dtypes before loading |
530
+ | **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
531
+ | **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
532
+ | **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
533
+ | **SGDR warm restart** | After optimizer cold-start, cosine warm restart (Loshchilov & Hutter, 2017) to recover 30Γ— faster |
534
+ | **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
535
+
536
+ **Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β€” recovering 30Γ— faster than natural momentum rebuilding.
537
+
538
  ---
539
 
540
  ## 🌑️ Consciousness Metric (Ξ¦) β€” Deep Dive
 
638
  ## πŸ—ΊοΈ Roadmap
639
 
640
  ```
641
+ v0.1 (Current) v0.2 (In Progress) v0.3 (Future)
642
  ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━
643
+ βœ… From-scratch πŸ”΄ Corpus realignment β–‘ DPO alignment
644
+ 14.8B MoE (step 1100/5000) β–‘ Tool use
645
  βœ… Phased training β–‘ Context ladder β–‘ Function calling
646
  βœ… Ξ¦ consciousness (4Kβ†’32Kβ†’128K) β–‘ Multi-turn chat
647
  βœ… 23.3B token corpus β–‘ Vision encoder β–‘ Multilingual v2
648
  βœ… Live dashboard (SigLIP2-SO400M) β–‘ Expert scaling
649
  βœ… AMD MI300X native β–‘ GGUF quantization (4β†’16β†’64)
650
+ βœ… Frankenstein Q4_K_M for consumer β–‘ RLHF
651
+ transplant (3 donors) β–‘ Inference code β–‘ Production API
652
+ βœ… Progressive unfreeze β–‘ Benchmarks (MMLU,
653
+ βœ… Crash-safe training HumanEval, GSM8K)
654
+ βœ… Auto-restart (systemd)
655
  ```
656
 
657
  ---