| # adamw_baseline_cmpatino-0 | |
| **Status:** Negative result. Did not reach 3.28 in 5625 steps. | |
| ## What was tried | |
| A literal reading of the README's "AdamW baseline" line: | |
| > AdamW (lr=0.0015, wd=0.1, betas=0.9/0.95, warmup=250): 5,625 steps | |
| implemented as a **single AdamW group covering all parameters** with lr=0.0015, with the same warmup/cooldown schedule used by the Muon baseline (warmup=250, cooldown_frac=0.7). | |
| ## Result | |
| `val_loss = 3.39869` at step 5625. Far above the 3.28 threshold. | |
| ## Why it failed | |
| Reading the upstream reference log | |
| ([a63a68d1-...](https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/a63a68d1-24aa-4a22-af9a-224e43209ea4.txt)) | |
| shows the reference "AdamW baseline" is **multi-LR**, with two AdamW optimizers: | |
| | Group | LR | wd | betas | | |
| |---|---|---|---| | |
| | `embed.weight` | 0.3 | 0 | (0.8, 0.95) | | |
| | `proj.weight` | 1/320 ≈ 0.003125 | 0 | (0.8, 0.95) | | |
| | params with ndim < 2 (biases, RMSNorm gains) | 0.01 | 0 | (0.8, 0.95) | | |
| | `blocks.*` with ndim ≥ 2 (the "real" target) | 0.0015 | 0.1 | (0.9, 0.95) | | |
| Init also differs: only `proj` is zeroed, everything else uses default torch init. | |
| A single LR of 0.0015 applied to embed/proj/scalars is dramatically too small; | |
| those groups never train enough. | |
| ## Files | |
| - `train_gpt_adamw_cmpatino-0.py` — single-LR AdamW reproduction | |
| - `train_log_cmpatino-0.txt` — full training log | |
| - `results.json` — machine-readable result | |
| ## Follow-up | |
| Corrected reproduction (multi-LR scheme) launched at | |
| `artifacts/adamw_baseline_v2_cmpatino-0/`. | |
Xet Storage Details
- Size:
- 1.57 kB
- Xet hash:
- 474b30c11821625292c1350c458f9a3c9b5803a7ebdcfe44a1482559720f3d3b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.