Buckets:

ml-intern-explorers
/

efficient-optimizer-collab

Files

xet

ml-intern-explorers/efficient-optimizer-collab / artifacts /adamw_tuned_cmpatino-0 /README.md

cmpatino

6 days ago

preview code

download

raw

1.83 kB

	# adamw_tuned_cmpatino-0

	Status: Negative result. Did not beat v2 baseline.

	## What was tried

	Used the README's tuning tip ("halve run length, tune all hparams on the shorter run, then scale back up and retune only WD and LR") to pick best (LR, WD) at 2812 steps, then validated at 5625.

	### Half-length sweeps (2812 steps, multi-LR AdamW v2)

	\| block_wd \| block_lr \| val_loss @ 2812 \|
	\|---:\|---:\|---:\|
	\| 0.05 \| 0.0015 \| 3.44780 \|
	\| 0.10 \| 0.0015 \| 3.46050 \|
	\| 0.20 \| 0.0015 \| 3.44864 \|
	\| 0.05 \| 0.0010 \| 3.43422 ← best \|
	\| 0.05 \| 0.0020 \| 3.52063 \|
	\| 0.05 \| 0.0030 \| 3.50910 \|

	Best half-length config: `block_lr=0.0010, block_wd=0.05` (3.43422), beating the baseline-scaled (wd=0.10/lr=0.0015) at 2812 by 0.026.

	### Full-length validation (5625 steps, lr=0.0010 / wd=0.05)

	`val_loss = 3.30295` at step 5625.

	## Why it failed

	The improvement at half-length did not transfer to full length:

	\| \| wd=0.10 / lr=0.0015 \| wd=0.05 / lr=0.0010 \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| 2812 steps (half) \| 3.46050 \| 3.43422 \| +0.026 better \|
	\| 5625 steps (full) \| 3.28434 \| 3.30295 \| -0.019 worse \|

	The ranking flipped. Likely cause: lower LR converges slower but more stably; the short-run schedule (cooldown starting at step ~844) favors that; the long-run schedule (cooldown starting at step ~1687) doesn't, so the higher LR's extra movement dominates.

	## Lesson

	The README hint that "val loss at step 1,000 does not strongly predict final loss" applies to LR and WD here, even at 50% of the run length. For this multi-LR AdamW recipe, half-length sweeps are not a reliable proxy for tuning LR/WD. Full-length runs are needed.

	## Files

	- `run_validation.sh` — launcher used for the 5625-step run
	- `log_lr0.0010_wd0.05_5625_cmpatino-0.txt` — full training log
	- `results.json` — machine-readable result

Xet Storage Details

Size:: 1.83 kB
Xet hash:: f98b2c7ee25e918865ae06da6e1296c2568f738b5cd52210c5b96a71a64d159a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

	# adamw_tuned_cmpatino-0

	Status: Negative result. Did not beat v2 baseline.

	## What was tried

	Used the README's tuning tip ("halve run length, tune all hparams on the shorter run, then scale back up and retune only WD and LR") to pick best (LR, WD) at 2812 steps, then validated at 5625.

	### Half-length sweeps (2812 steps, multi-LR AdamW v2)

	\| block_wd \| block_lr \| val_loss @ 2812 \|
	\|---:\|---:\|---:\|
	\| 0.05 \| 0.0015 \| 3.44780 \|
	\| 0.10 \| 0.0015 \| 3.46050 \|
	\| 0.20 \| 0.0015 \| 3.44864 \|
	\| 0.05 \| 0.0010 \| 3.43422 ← best \|
	\| 0.05 \| 0.0020 \| 3.52063 \|
	\| 0.05 \| 0.0030 \| 3.50910 \|

	Best half-length config: `block_lr=0.0010, block_wd=0.05` (3.43422), beating the baseline-scaled (wd=0.10/lr=0.0015) at 2812 by 0.026.

	### Full-length validation (5625 steps, lr=0.0010 / wd=0.05)

	`val_loss = 3.30295` at step 5625.

	## Why it failed

	The improvement at half-length did not transfer to full length:

	\| \| wd=0.10 / lr=0.0015 \| wd=0.05 / lr=0.0010 \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| 2812 steps (half) \| 3.46050 \| 3.43422 \| +0.026 better \|
	\| 5625 steps (full) \| 3.28434 \| 3.30295 \| -0.019 worse \|

	The ranking flipped. Likely cause: lower LR converges slower but more stably; the short-run schedule (cooldown starting at step ~844) favors that; the long-run schedule (cooldown starting at step ~1687) doesn't, so the higher LR's extra movement dominates.

	## Lesson

	The README hint that "val loss at step 1,000 does not strongly predict final loss" applies to LR and WD here, even at 50% of the run length. For this multi-LR AdamW recipe, half-length sweeps are not a reliable proxy for tuning LR/WD. Full-length runs are needed.

	## Files

	- `run_validation.sh` — launcher used for the 5625-step run
	- `log_lr0.0010_wd0.05_5625_cmpatino-0.txt` — full training log
	- `results.json` — machine-readable result