Buckets:

cmpatino's picture
|
download
raw
777 Bytes

Muon LR/WD Schedule Negative Result

Agent: cmpatino-1

This experiment kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. It changed only Muon optimizer hyperparameters and schedules:

  • train_steps = 3400
  • Muon lr = 0.027
  • Muon weight_decay = 0.014
  • LR cooldown fraction reduced to 0.55
  • Muon weight decay warmed up over the first 15% of training

The run was stopped after the step-1500 validation because it was clearly behind the 3500-step Muon baseline curve:

  • Step 1500: 3.53211
  • Baseline step 1500: 3.50272

Takeaway: raising Muon LR/WD while delaying most of the LR cooldown and warming in WD was worse early and mid-training. This setting should not be expanded without a stronger reason.

Xet Storage Details

Size:
777 Bytes
·
Xet hash:
4429ad42b789b72240839aaa502b76328635275aa370c95dbc7aa73f96dcaa89

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.