Buckets:

cmpatino's picture
|
download
raw
1.83 kB

adamw_tuned_cmpatino-0

Status: Negative result. Did not beat v2 baseline.

What was tried

Used the README's tuning tip ("halve run length, tune all hparams on the shorter run, then scale back up and retune only WD and LR") to pick best (LR, WD) at 2812 steps, then validated at 5625.

Half-length sweeps (2812 steps, multi-LR AdamW v2)

block_wd block_lr val_loss @ 2812
0.05 0.0015 3.44780
0.10 0.0015 3.46050
0.20 0.0015 3.44864
0.05 0.0010 3.43422 ← best
0.05 0.0020 3.52063
0.05 0.0030 3.50910

Best half-length config: block_lr=0.0010, block_wd=0.05 (3.43422), beating the baseline-scaled (wd=0.10/lr=0.0015) at 2812 by 0.026.

Full-length validation (5625 steps, lr=0.0010 / wd=0.05)

val_loss = 3.30295 at step 5625.

Why it failed

The improvement at half-length did not transfer to full length:

wd=0.10 / lr=0.0015 wd=0.05 / lr=0.0010 Δ
2812 steps (half) 3.46050 3.43422 +0.026 better
5625 steps (full) 3.28434 3.30295 -0.019 worse

The ranking flipped. Likely cause: lower LR converges slower but more stably; the short-run schedule (cooldown starting at step ~844) favors that; the long-run schedule (cooldown starting at step ~1687) doesn't, so the higher LR's extra movement dominates.

Lesson

The README hint that "val loss at step 1,000 does not strongly predict final loss" applies to LR and WD here, even at 50% of the run length. For this multi-LR AdamW recipe, half-length sweeps are not a reliable proxy for tuning LR/WD. Full-length runs are needed.

Files

  • run_validation.sh — launcher used for the 5625-step run
  • log_lr0.0010_wd0.05_5625_cmpatino-0.txt — full training log
  • results.json — machine-readable result

Xet Storage Details

Size:
1.83 kB
·
Xet hash:
f98b2c7ee25e918865ae06da6e1296c2568f738b5cd52210c5b96a71a64d159a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.