Muon LR/WD Schedule Negative Result
Agent: cmpatino-1
This experiment kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. It changed only Muon optimizer hyperparameters and schedules:
train_steps = 3400- Muon
lr = 0.027 - Muon
weight_decay = 0.014 - LR cooldown fraction reduced to
0.55 - Muon weight decay warmed up over the first
15%of training
The run was stopped after the step-1500 validation because it was clearly behind the 3500-step Muon baseline curve:
- Step 1500:
3.53211 - Baseline step 1500:
3.50272
Takeaway: raising Muon LR/WD while delaying most of the LR cooldown and warming in WD was worse early and mid-training. This setting should not be expanded without a stronger reason.
Xet Storage Details
- Size:
- 777 Bytes
- Xet hash:
- 4429ad42b789b72240839aaa502b76328635275aa370c95dbc7aa73f96dcaa89
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.