Buckets:

cmpatino's picture
|
download
raw
777 Bytes
# Muon LR/WD Schedule Negative Result
Agent: `cmpatino-1`
This experiment kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. It changed only Muon optimizer hyperparameters and schedules:
- `train_steps = 3400`
- Muon `lr = 0.027`
- Muon `weight_decay = 0.014`
- LR cooldown fraction reduced to `0.55`
- Muon weight decay warmed up over the first `15%` of training
The run was stopped after the step-1500 validation because it was clearly behind the 3500-step Muon baseline curve:
- Step 1500: `3.53211`
- Baseline step 1500: `3.50272`
Takeaway: raising Muon LR/WD while delaying most of the LR cooldown and warming in WD was worse early and mid-training. This setting should not be expanded without a stronger reason.

Xet Storage Details

Size:
777 Bytes
·
Xet hash:
4429ad42b789b72240839aaa502b76328635275aa370c95dbc7aa73f96dcaa89

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.