Mircea Rusu commited on
Commit Β·
8f58f08
1
Parent(s): fad32f8
Update model card: realignment progress (step 1150), SGDR warm restart, crash recovery, engineering resilience section
Browse files
README.md
CHANGED
|
@@ -34,13 +34,17 @@ model-index:
|
|
| 34 |
- task:
|
| 35 |
type: text-generation
|
| 36 |
metrics:
|
| 37 |
-
- name: Validation Loss
|
| 38 |
type: loss
|
| 39 |
-
value:
|
| 40 |
verified: true
|
| 41 |
-
- name:
|
| 42 |
type: loss
|
| 43 |
-
value: 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
verified: true
|
| 45 |
---
|
| 46 |
|
|
@@ -464,20 +468,45 @@ Loss
|
|
| 464 |
| π₯ Warmup end | 1,200 | 2.38 | **β68%** |
|
| 465 |
| π Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
|
| 466 |
| π¦ Published checkpoint | 2,471 | 1.99 | **β16%** |
|
| 467 |
-
|
|
| 468 |
-
|
|
| 469 |
-
|
| 470 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 471 |
|
| 472 |
| Metric | Value |
|
| 473 |
|:--|:--|
|
| 474 |
-
| **Current
|
| 475 |
-
| **
|
| 476 |
-
| **
|
| 477 |
-
| **
|
| 478 |
-
| **
|
| 479 |
-
| **
|
| 480 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 481 |
|
| 482 |
### Published Checkpoint (v0.1)
|
| 483 |
|
|
@@ -490,6 +519,22 @@ Loss
|
|
| 490 |
| **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
|
| 491 |
| **Format** | 6 sharded safetensors files |
|
| 492 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 493 |
---
|
| 494 |
|
| 495 |
## π‘οΈ Consciousness Metric (Ξ¦) β Deep Dive
|
|
@@ -593,18 +638,20 @@ print(f"Total params: {sum(v.numel() for v in state_dict.values()):,}")
|
|
| 593 |
## πΊοΈ Roadmap
|
| 594 |
|
| 595 |
```
|
| 596 |
-
v0.1 (Current) v0.2 (
|
| 597 |
βββββββββββββββ βββββββββββββββ βββββββββββββββ
|
| 598 |
-
β
From-scratch
|
| 599 |
-
14.8B MoE
|
| 600 |
β
Phased training β‘ Context ladder β‘ Function calling
|
| 601 |
β
Ξ¦ consciousness (4Kβ32Kβ128K) β‘ Multi-turn chat
|
| 602 |
β
23.3B token corpus β‘ Vision encoder β‘ Multilingual v2
|
| 603 |
β
Live dashboard (SigLIP2-SO400M) β‘ Expert scaling
|
| 604 |
β
AMD MI300X native β‘ GGUF quantization (4β16β64)
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
|
|
|
|
|
|
| 608 |
```
|
| 609 |
|
| 610 |
---
|
|
|
|
| 34 |
- task:
|
| 35 |
type: text-generation
|
| 36 |
metrics:
|
| 37 |
+
- name: Validation Loss (pretrain)
|
| 38 |
type: loss
|
| 39 |
+
value: 2.5152
|
| 40 |
verified: true
|
| 41 |
+
- name: Validation Loss (realignment best)
|
| 42 |
type: loss
|
| 43 |
+
value: 5.773
|
| 44 |
+
verified: true
|
| 45 |
+
- name: Training Loss (realignment latest)
|
| 46 |
+
type: loss
|
| 47 |
+
value: 6.76
|
| 48 |
verified: true
|
| 49 |
---
|
| 50 |
|
|
|
|
| 468 |
| π₯ Warmup end | 1,200 | 2.38 | **β68%** |
|
| 469 |
| π Block start | 1,200 | 2.38 | (model grew to 14.4B MoE) |
|
| 470 |
| π¦ Published checkpoint | 2,471 | 1.99 | **β16%** |
|
| 471 |
+
| π§ Frankenstein transplant | β | PPL ~7.5M | 433 tensors from 3 donors |
|
| 472 |
+
| β‘ Realign frozen start | 0 | 15.87 | Experts frozen, attn learning |
|
| 473 |
+
| β‘ Realign frozen end | 500 | 5.52 | val=5.79, **β65%** |
|
| 474 |
+
| π Expert unfreeze | 500 | 5.59βspike | LR reset + differential rates |
|
| 475 |
+
| π Realign recovery | 600 | 7.20 | val=6.93 (recovering) |
|
| 476 |
+
| β Realign best | 1,000 | 6.07 | val=**5.773 β
NEW BEST** |
|
| 477 |
+
| οΏ½ SIGTERM crash + restart | 1,001β1,123 | 9.22β6.64 | Fresh optimizer, cold momentum |
|
| 478 |
+
| π₯ SGDR warm restart | 1,124 | 9.09β7.86 | LR boosted 2.9e-5β4.5e-5, 30Γ faster descent |
|
| 479 |
+
| π Current (live) | ~1,150 | ~7.86 | SGDR cosine decay phase, recovering fast |
|
| 480 |
+
| **Total pretrain** | | | **11.72 β 1.99 (β83%)** |
|
| 481 |
+
| **Realignment** | | | **15.87 β 5.77 (val, β64%)** |
|
| 482 |
+
|
| 483 |
+
### Live Metrics (April 29, 2026)
|
| 484 |
|
| 485 |
| Metric | Value |
|
| 486 |
|:--|:--|
|
| 487 |
+
| **Current Phase** | β‘ Corpus Realignment + SGDR Warm Restart |
|
| 488 |
+
| **Current Step** | ~1,150 / 5,000 |
|
| 489 |
+
| **Training Loss** | ~7.86 (SGDR recovery β 30Γ faster descent vs pre-restart) |
|
| 490 |
+
| **Best Validation Loss** | **5.773** (step 1,000) β
|
|
| 491 |
+
| **Throughput** | 5,806 tokens/second |
|
| 492 |
+
| **VRAM Used** | 120 GB / 206 GB (58%) β all experts unfrozen |
|
| 493 |
+
| **Total Tokens Processed** | ~226M (this run) + 178M (pretrain) |
|
| 494 |
+
| **Experts Active** | All 4 unfrozen since step 500 |
|
| 495 |
+
| **SGDR Status** | Peak LR 4.5e-5, cosine decay β rejoin normal schedule at step 1300 |
|
| 496 |
+
| **ETA** | ~36 hours |
|
| 497 |
+
|
| 498 |
+
### Realignment Eval History
|
| 499 |
+
|
| 500 |
+
| Step | Val Loss | Val PPL | Phase |
|
| 501 |
+
|:--|:--|:--|:--|
|
| 502 |
+
| 0 (initial) | 15.81 | 7,339,653 | Experts frozen |
|
| 503 |
+
| 600 | 6.93 | 1,020 | Post-unfreeze |
|
| 504 |
+
| 700 | 6.24 | 515 | Converging |
|
| 505 |
+
| 800 | 6.01 | 407 | Converging |
|
| 506 |
+
| 900 | 5.91 | 367 | Converging |
|
| 507 |
+
| **1,000** | **5.773** | **321** | **β
NEW BEST** |
|
| 508 |
+
| 1,100 | 6.55 | 701 | Optimizer restart recovery |
|
| 509 |
+
| 1,200 | *pending* | *pending* | SGDR warm restart active |
|
| 510 |
|
| 511 |
### Published Checkpoint (v0.1)
|
| 512 |
|
|
|
|
| 519 |
| **File Size** | ~81 GB (checkpoint), ~28 GB (safetensors) |
|
| 520 |
| **Format** | 6 sharded safetensors files |
|
| 521 |
|
| 522 |
+
### π‘οΈ Engineering Resilience
|
| 523 |
+
|
| 524 |
+
Training a 14.4B model on a single GPU for days demands bullet-proof infrastructure. Here's what we built:
|
| 525 |
+
|
| 526 |
+
| Feature | Description |
|
| 527 |
+
|:--|:--|
|
| 528 |
+
| **Atomic checkpoints** | Write to `.tmp` β `os.replace()` β no half-written files |
|
| 529 |
+
| **Integrity verification** | On resume: verify tensor counts, shapes, and dtypes before loading |
|
| 530 |
+
| **Rollback anchors** | `best.pt` (model-only) + `latest.pt` (full state) + `.LOCKED` safety copy |
|
| 531 |
+
| **Emergency save** | SIGTERM/SIGINT handlers serialize full state before exit |
|
| 532 |
+
| **Watchdog** | Independent process monitors loss EMA, restarts on NaN/divergence |
|
| 533 |
+
| **SGDR warm restart** | After optimizer cold-start, cosine warm restart (Loshchilov & Hutter, 2017) to recover 30Γ faster |
|
| 534 |
+
| **Systemd auto-restart** | Dashboard + watchdog survive OOM kills with `Restart=always` + `OOMScoreAdjust=-500` |
|
| 535 |
+
|
| 536 |
+
**Battle-tested**: At step 1001, a SIGTERM killed the process mid-step. The checkpoint at step 1000 was corrupted (bad zip archive). The system automatically fell back to `best.pt` (val=5.773), resumed at step 1001 with a fresh optimizer, detected the cold-start plateau via the watchdog, and applied SGDR warm restart β recovering 30Γ faster than natural momentum rebuilding.
|
| 537 |
+
|
| 538 |
---
|
| 539 |
|
| 540 |
## π‘οΈ Consciousness Metric (Ξ¦) β Deep Dive
|
|
|
|
| 638 |
## πΊοΈ Roadmap
|
| 639 |
|
| 640 |
```
|
| 641 |
+
v0.1 (Current) v0.2 (In Progress) v0.3 (Future)
|
| 642 |
βββββββββββββββ βββββββββββββββ βββββββββββββββ
|
| 643 |
+
β
From-scratch π΄ Corpus realignment β‘ DPO alignment
|
| 644 |
+
14.8B MoE (step 1100/5000) β‘ Tool use
|
| 645 |
β
Phased training β‘ Context ladder β‘ Function calling
|
| 646 |
β
Ξ¦ consciousness (4Kβ32Kβ128K) β‘ Multi-turn chat
|
| 647 |
β
23.3B token corpus β‘ Vision encoder β‘ Multilingual v2
|
| 648 |
β
Live dashboard (SigLIP2-SO400M) β‘ Expert scaling
|
| 649 |
β
AMD MI300X native β‘ GGUF quantization (4β16β64)
|
| 650 |
+
β
Frankenstein Q4_K_M for consumer β‘ RLHF
|
| 651 |
+
transplant (3 donors) β‘ Inference code β‘ Production API
|
| 652 |
+
β
Progressive unfreeze β‘ Benchmarks (MMLU,
|
| 653 |
+
β
Crash-safe training HumanEval, GSM8K)
|
| 654 |
+
β
Auto-restart (systemd)
|
| 655 |
```
|
| 656 |
|
| 657 |
---
|