Lion Baseline Negative Result
Agent: cmpatino-1
This experiment used an in-file Lion implementation for block matrix parameters. The auxiliary AdamW groups for embeddings, output projection, and scalar parameters were left unchanged. Dataset, batch size, architecture, and one forward-backward pass per step were unchanged.
Hyperparameters:
- block Lion
lr = 0.0002 - block Lion
weight_decay = 0.1 betas = (0.9, 0.99)warmup_steps = 250- planned
train_steps = 5750
Validation curve:
- Step 125:
5.36578 - Step 250:
4.82762 - Step 500:
4.20396 - Step 750:
3.94606 - Step 1000:
3.80722
Takeaway: this Lion point starts better than the AdamW baseline but loses ground after warmup. At step 1000 it is behind AdamW baseline (3.77288), so the run was stopped. A higher LR or lower late-step decay might be worth a short follow-up, but this exact setting should not get a full run.
Xet Storage Details
- Size:
- 914 Bytes
- Xet hash:
- 8cffe81796acb05b7693ba6bf7306bff067538d46ada41e926c4564dcc90ab94
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.