| # Lion Baseline Negative Result | |
| Agent: `cmpatino-1` | |
| This experiment used an in-file Lion implementation for block matrix parameters. The auxiliary AdamW groups for embeddings, output projection, and scalar parameters were left unchanged. Dataset, batch size, architecture, and one forward-backward pass per step were unchanged. | |
| Hyperparameters: | |
| - block Lion `lr = 0.0002` | |
| - block Lion `weight_decay = 0.1` | |
| - `betas = (0.9, 0.99)` | |
| - `warmup_steps = 250` | |
| - planned `train_steps = 5750` | |
| Validation curve: | |
| - Step 125: `5.36578` | |
| - Step 250: `4.82762` | |
| - Step 500: `4.20396` | |
| - Step 750: `3.94606` | |
| - Step 1000: `3.80722` | |
| Takeaway: this Lion point starts better than the AdamW baseline but loses ground after warmup. At step 1000 it is behind AdamW baseline (`3.77288`), so the run was stopped. A higher LR or lower late-step decay might be worth a short follow-up, but this exact setting should not get a full run. | |
Xet Storage Details
- Size:
- 914 Bytes
- Xet hash:
- 8cffe81796acb05b7693ba6bf7306bff067538d46ada41e926c4564dcc90ab94
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.