|
download
raw
12.6 kB

Pretraining Ablations Lab Notes

Experiment: mate-boost, no-outcome, discard-ply-limit

Started: 2026-04-03 Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr

Phase 1: LR Exploration (5K steps each, no_compile)

Batch 1: mate-boost (DONE)

Trial LR val_loss val_acc top5 ppl legal% Note
15 1e-4 4.712 3.6% 15.5% 111.3 73.7% Dominated, still in warmup
16 3e-4 4.019 4.4% 19.3% 55.6 89.1% Solid
17 1e-3 3.629 5.2% 21.6% 37.7 93.3% Winner

Batch 2: no-outcome (DONE)

Trial LR Warmup val_loss val_acc top5 ppl legal% Note
18 1e-4 5% 4.585 4.3% 18.4% 98.0 76.6% Dominated, still in warmup
19 3e-4 0% 3.444 5.8% 24.3% 31.3 94.2% Winner (no-warmup)
20 1e-3 5% 3.529 5.6% 24.3% 34.1 94.5% Close second, still warming

Batch 3: discard-ply-limit (DONE)

Trial LR val_loss val_acc top5 ppl legal% Note
21 1e-4 4.743 3.7% 16.3% 114.7 72.8% Dominated, still in warmup
22 3e-4 4.054 4.6% 20.1% 57.7 87.4% Middle ground
23 1e-3 3.617 5.4% 22.6% 37.2 93.2% Winner

Phase 2: Full Training (200K steps, torch.compile ON)

Trial Ablation LR Resumed From GPU Status
24 mate-boost 1e-3 trial 17 @ 5K 0 RUNNING
25 no-outcome 3e-4 trial 19 @ 5K 1 RUNNING
26 discard-ply-limit 1e-3 trial 23 @ 5K 2 RUNNING

torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps. ETA for completion: ~2026-04-04 04:30 UTC.

Known Issues

  • pause_after_steps not working: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval.

Log

2026-04-03 04:38 UTC

  • Disabled MPS (was funneling all trials to GPU 0)
  • Removed sdpa_math from configs (NVIDIA GPUs use flash attention)
  • Launched batch 1: mate-boost x3 LRs

2026-04-03 05:22 UTC

  • Batch 1 complete. lr=1e-3 dominates across all metrics.
  • pause_after_steps bug: trials continued past 5K, had to kill at ~5200.
  • Killed batch 1, launched batch 2: no-outcome x3 LRs.
  • Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request.
  • ETA for batch 2 eval: ~05:57 UTC

2026-04-03 05:50 UTC

  • New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000.
  • Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455
  • Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710
  • Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102
  • Set up 5-min cron to catch 5K eval, kill trials, launch batch 3.
  • Cost so far: $3.36 (1h27m elapsed)

2026-04-03 05:57 UTC

  • Batch 2 (no-outcome) complete at 5K steps. Results:
    • lr=1e-4: val_loss=4.585 (dominated, warmup too slow)
    • lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER
    • lr=1e-3: val_loss=3.529 — close second but still warming up
  • Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23
  • Note: discard-ply-limit discards ~60% of games. Watch step times for slowdown.
  • ETA for batch 3 eval: ~06:30 UTC

2026-04-03 06:39 UTC

  • Batch 3 (discard-ply-limit) complete at 5K steps. Results:
    • lr=1e-4: val_loss=4.743 (dominated)
    • lr=3e-4: val_loss=4.054 (middle)
    • lr=1e-3: val_loss=3.617, acc=5.4% — WINNER
  • Phase 1 COMPLETE. All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome.
  • Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps.
  • Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON.
  • Switched cron from 5-min to hourly (long runs, ~22h remaining).
  • Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments.

Phase 1 Winners

Ablation Winner Trial LR val_loss@5K acc@5K
mate-boost 17 1e-3 3.629 5.2%
no-outcome 19 3e-4 3.444 5.8%
discard-ply-limit 23 1e-3 3.617 5.4%

2026-04-03 07:10 UTC — Phase 2 check-in (30 min in)

All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies.

Trial Ablation Step train_loss acc LR g/s
24 mate-boost 9600 3.407 5.8% 9.6e-4 640
25 no-outcome 9400 3.276 6.3% 3.0e-4 634
26 discard-ply-limit 9500 3.399 6.1% 9.5e-4 652
  • Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR.
  • Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup.
  • Cost: $6.30 (2h45m elapsed). Budget on track.
  • Fixed pause_after_steps and lab_resume bugs in code.
  • HF bucket synced (metrics only, checkpoints excluded).

2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done)

Trial Ablation Step train_loss acc LR gn
24 mate-boost 18700 3.283 6.2% 9.95e-4 0.22
25 no-outcome 18300 3.160 6.7% 2.94e-4 0.53
26 discard-ply-limit 18700 3.254 6.9% 9.95e-4 0.20
  • All past warmup (peaked at 1e-3), now in cosine decay. Very stable.
  • Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy.
  • ~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC.

2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%)

Trial Ablation Step train_loss acc LR gn
24 mate-boost 27800 3.242 6.4% 9.81e-4 0.13
25 no-outcome 27200 3.123 7.1% 2.88e-4 0.32
26 discard-ply-limit 27800 3.248 6.8% 9.81e-4 0.14
  • Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy.
  • Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence.

2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%)

Trial Ablation Step train_loss acc LR
24 mate-boost 37000 3.246 6.5% 9.56e-4
25 no-outcome 36100 3.133 6.7% 2.79e-4
26 discard-ply-limit 36900 3.191 7.4% 9.56e-4
  • Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25.
  • Trial 25 still lowest loss but accuracy plateauing. Different distribution?
  • HF bucket synced (metrics + lab notes).

2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%)

Trial Ablation Step train_loss acc LR
24 mate-boost 46100 3.213 6.5% 9.22e-4
25 no-outcome 44900 3.115 6.9% 2.68e-4
26 discard-ply-limit 46000 3.215 7.1% 9.23e-4
  • Stable. Trial 26 still best accuracy, trial 25 still lowest loss.
  • All losses still gradually decreasing — no signs of plateau yet.

2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%)

Trial Ablation Step train_loss acc LR
24 mate-boost 129300 3.200 6.5% 3.74e-4
25 no-outcome 125000 3.078 6.9% 1.13e-4
26 discard-ply-limit 128500 3.193 7.6% 3.80e-4
  • Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6%
  • Trial 26 train acc now at 7.6-7.8% — clearly best accuracy.
  • Accuracy ceiling computation running concurrently on CPUs (~65% done).
  • ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC.

2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%)

Trial Ablation Step train_loss acc LR
24 mate-boost 174500 3.170 6.7% 1.39e-4
25 no-outcome 169800 3.082 7.2% 4.49e-5
26 discard-ply-limit 173600 3.171 7.6% 1.42e-4
  • ~26K steps remaining, ETA ~04:00 UTC.
  • Ceiling computation DONE (1024 rollouts, 88K positions, 5.7h):
    • Unconditional: 6.43% [6.36, 6.50]
    • MC corrected: 6.68% [6.60, 6.75]
    • MC naive: 6.89% [6.81, 6.96]
    • Bracket: 0.21pp (was 0.66pp at 128 rollouts)
    • PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max.
  • Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF.

2026-04-04 02:10 UTC — Trial 24 NaN! Probes started.

  • Trial 24 (mate-boost) hit NaN between step 175K-180K. Killed at 184K.
    • Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59%
    • Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final.
    • Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch.
  • Trials 25/26 healthy at step ~179K/183K.
  • Started post-training for trial 24:
    • Uploading 175K checkpoint to HF bucket (background)
    • Running linear probes on GPU 0 (background)
  • GPU 0 now doing probes. GPUs 1/2 still training.

2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing.

  • Trial 26 (discard-ply-limit) COMPLETED at 200K steps:
    • val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4%
    • Best accuracy of all 3 ablations
  • Trial 25 (no-outcome) at step 196K — ~25 min remaining.
  • Started post-training for trial 26: checkpoint upload + probes on GPU 2.
  • Trial 24 probes still running on GPU 0.

Final Results (as trials complete)

Trial Ablation Steps Best val_loss Best acc Status
24 mate-boost 175K (NaN@177K) 3.186 6.59% done (NaN)
25 no-outcome 196K/200K TBD TBD running
26 discard-ply-limit 200K 3.147 7.85% done
25 no-outcome 200K 3.089 6.83% done
26 discard-ply-limit 200K 3.147 7.85% done
baseline (pawn-base) 100K ~3.06 ~6.90% reference

2026-04-04 04:45 UTC — ALL TRAINING COMPLETE

  • Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83%
  • All GPUs idle. Uploading final checkpoints to HF.
  • Probes skipped (performance issues — runs on CPU, too slow on pod).
  • Drafting final report.

2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%)

Trial Ablation Step train_loss acc LR
24 mate-boost 55300 3.237 6.2% 8.80e-4
25 no-outcome 53800 3.122 6.8% 2.55e-4
26 discard-ply-limit 55200 3.214 7.0% 8.80e-4
  • Cruising. Loss improvement slowing (expected as cosine decay kicks in).
  • Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly.

2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%)

Trial Ablation Step train_loss acc LR
24 mate-boost 73600 3.204 6.6% 7.73e-4
25 no-outcome 71600 3.094 7.0% 2.23e-4
26 discard-ply-limit 73500 3.192 7.5% 7.74e-4
  • Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing.
  • HF bucket synced (metrics + lab notes + transcript).

2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%)

Trial Ablation Step train_loss acc LR
24 mate-boost 101600 3.206 6.5% 5.75e-4
25 no-outcome 98300 3.090 6.9% 1.69e-4
26 discard-ply-limit 101000 3.179 7.4% 5.80e-4
  • 100K milestone reached. This is where baseline PAWN-Base stopped.
  • Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps.
  • Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token.
  • Trial 26 (discard-ply-limit) best accuracy at 7.4%.
  • LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4.
  • ETA unchanged: 2026-04-04 04:30 UTC. Cost so far: ~$26 (11.3h × $2.31).

Xet Storage Details

Size:
12.6 kB
·
Xet hash:
b1c447b86e7834339925d889251a0a3dedacab8f79b5d138635bea8afab1cb1a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.