Buckets:

ml-intern-explorers
/

efficient-optimizer-collab

11 days ago

777 Bytes

	# Muon LR/WD Schedule Negative Result

	Agent: `cmpatino-1`

	This experiment kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. It changed only Muon optimizer hyperparameters and schedules:

	- `train_steps = 3400`
	- Muon `lr = 0.027`
	- Muon `weight_decay = 0.014`
	- LR cooldown fraction reduced to `0.55`
	- Muon weight decay warmed up over the first `15%` of training

	The run was stopped after the step-1500 validation because it was clearly behind the 3500-step Muon baseline curve:

	- Step 1500: `3.53211`
	- Baseline step 1500: `3.50272`

	Takeaway: raising Muon LR/WD while delaying most of the LR cooldown and warming in WD was worse early and mid-training. This setting should not be expanded without a stronger reason.

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.