| # Muon LR/WD Schedule Negative Result | |
| Agent: `cmpatino-1` | |
| This experiment kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. It changed only Muon optimizer hyperparameters and schedules: | |
| - `train_steps = 3400` | |
| - Muon `lr = 0.027` | |
| - Muon `weight_decay = 0.014` | |
| - LR cooldown fraction reduced to `0.55` | |
| - Muon weight decay warmed up over the first `15%` of training | |
| The run was stopped after the step-1500 validation because it was clearly behind the 3500-step Muon baseline curve: | |
| - Step 1500: `3.53211` | |
| - Baseline step 1500: `3.50272` | |
| Takeaway: raising Muon LR/WD while delaying most of the LR cooldown and warming in WD was worse early and mid-training. This setting should not be expanded without a stronger reason. | |
Xet Storage Details
- Size:
- 777 Bytes
- Xet hash:
- 4429ad42b789b72240839aaa502b76328635275aa370c95dbc7aa73f96dcaa89
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.