12.6 kB

Pretraining Ablations Lab Notes

Experiment: mate-boost, no-outcome, discard-ply-limit

Started: 2026-04-03 Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr

Phase 1: LR Exploration (5K steps each, no_compile)

Batch 1: mate-boost (DONE)

Trial	LR	val_loss	val_acc	top5	ppl	legal%	Note
15	1e-4	4.712	3.6%	15.5%	111.3	73.7%	Dominated, still in warmup
16	3e-4	4.019	4.4%	19.3%	55.6	89.1%	Solid
17	1e-3	3.629	5.2%	21.6%	37.7	93.3%	Winner

Batch 2: no-outcome (DONE)

Trial	LR	Warmup	val_loss	val_acc	top5	ppl	legal%	Note
18	1e-4	5%	4.585	4.3%	18.4%	98.0	76.6%	Dominated, still in warmup
19	3e-4	0%	3.444	5.8%	24.3%	31.3	94.2%	Winner (no-warmup)
20	1e-3	5%	3.529	5.6%	24.3%	34.1	94.5%	Close second, still warming

Batch 3: discard-ply-limit (DONE)

Trial	LR	val_loss	val_acc	top5	ppl	legal%	Note
21	1e-4	4.743	3.7%	16.3%	114.7	72.8%	Dominated, still in warmup
22	3e-4	4.054	4.6%	20.1%	57.7	87.4%	Middle ground
23	1e-3	3.617	5.4%	22.6%	37.2	93.2%	Winner

Phase 2: Full Training (200K steps, torch.compile ON)

Trial	Ablation	LR	Resumed From	GPU	Status
24	mate-boost	1e-3	trial 17 @ 5K	0	RUNNING
25	no-outcome	3e-4	trial 19 @ 5K	1	RUNNING
26	discard-ply-limit	1e-3	trial 23 @ 5K	2	RUNNING

torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps. ETA for completion: ~2026-04-04 04:30 UTC.

Known Issues

pause_after_steps not working: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval.

Log

2026-04-03 04:38 UTC

Disabled MPS (was funneling all trials to GPU 0)
Removed sdpa_math from configs (NVIDIA GPUs use flash attention)
Launched batch 1: mate-boost x3 LRs

2026-04-03 05:22 UTC

Batch 1 complete. lr=1e-3 dominates across all metrics.
pause_after_steps bug: trials continued past 5K, had to kill at ~5200.
Killed batch 1, launched batch 2: no-outcome x3 LRs.
Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request.
ETA for batch 2 eval: ~05:57 UTC

2026-04-03 05:50 UTC

New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000.
Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455
Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710
Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102
Set up 5-min cron to catch 5K eval, kill trials, launch batch 3.
Cost so far: $3.36 (1h27m elapsed)

2026-04-03 05:57 UTC

Batch 2 (no-outcome) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.585 (dominated, warmup too slow)
- lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER
- lr=1e-3: val_loss=3.529 — close second but still warming up
Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23
Note: discard-ply-limit discards ~60% of games. Watch step times for slowdown.
ETA for batch 3 eval: ~06:30 UTC

2026-04-03 06:39 UTC

Batch 3 (discard-ply-limit) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.743 (dominated)
- lr=3e-4: val_loss=4.054 (middle)
- lr=1e-3: val_loss=3.617, acc=5.4% — WINNER
Phase 1 COMPLETE. All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome.
Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps.
Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON.
Switched cron from 5-min to hourly (long runs, ~22h remaining).
Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments.

Phase 1 Winners

Ablation	Winner Trial	LR	val_loss@5K	acc@5K
mate-boost	17	1e-3	3.629	5.2%
no-outcome	19	3e-4	3.444	5.8%
discard-ply-limit	23	1e-3	3.617	5.4%

2026-04-03 07:10 UTC — Phase 2 check-in (30 min in)

All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies.

Trial	Ablation	Step	train_loss	acc	LR	g/s
24	mate-boost	9600	3.407	5.8%	9.6e-4	640
25	no-outcome	9400	3.276	6.3%	3.0e-4	634
26	discard-ply-limit	9500	3.399	6.1%	9.5e-4	652

Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR.
Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup.
Cost: ~~$6.30 (~~2h45m elapsed). Budget on track.
Fixed pause_after_steps and lab_resume bugs in code.
HF bucket synced (metrics only, checkpoints excluded).

2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done)

Trial	Ablation	Step	train_loss	acc	LR	gn
24	mate-boost	18700	3.283	6.2%	9.95e-4	0.22
25	no-outcome	18300	3.160	6.7%	2.94e-4	0.53
26	discard-ply-limit	18700	3.254	6.9%	9.95e-4	0.20

All past warmup (peaked at 1e-3), now in cosine decay. Very stable.
Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy.
~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC.

2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%)

Trial	Ablation	Step	train_loss	acc	LR	gn
24	mate-boost	27800	3.242	6.4%	9.81e-4	0.13
25	no-outcome	27200	3.123	7.1%	2.88e-4	0.32
26	discard-ply-limit	27800	3.248	6.8%	9.81e-4	0.14

Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy.
Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence.

2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	37000	3.246	6.5%	9.56e-4
25	no-outcome	36100	3.133	6.7%	2.79e-4
26	discard-ply-limit	36900	3.191	7.4%	9.56e-4

Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25.
Trial 25 still lowest loss but accuracy plateauing. Different distribution?
HF bucket synced (metrics + lab notes).

2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	46100	3.213	6.5%	9.22e-4
25	no-outcome	44900	3.115	6.9%	2.68e-4
26	discard-ply-limit	46000	3.215	7.1%	9.23e-4

Stable. Trial 26 still best accuracy, trial 25 still lowest loss.
All losses still gradually decreasing — no signs of plateau yet.

2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	129300	3.200	6.5%	3.74e-4
25	no-outcome	125000	3.078	6.9%	1.13e-4
26	discard-ply-limit	128500	3.193	7.6%	3.80e-4

Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6%
Trial 26 train acc now at 7.6-7.8% — clearly best accuracy.
Accuracy ceiling computation running concurrently on CPUs (~65% done).
ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC.

2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	174500	3.170	6.7%	1.39e-4
25	no-outcome	169800	3.082	7.2%	4.49e-5
26	discard-ply-limit	173600	3.171	7.6%	1.42e-4

~26K steps remaining, ETA ~04:00 UTC.
Ceiling computation DONE (1024 rollouts, 88K positions, 5.7h):
- Unconditional: 6.43% [6.36, 6.50]
- MC corrected: 6.68% [6.60, 6.75]
- MC naive: 6.89% [6.81, 6.96]
- Bracket: 0.21pp (was 0.66pp at 128 rollouts)
- PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max.
Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF.

2026-04-04 02:10 UTC — Trial 24 NaN! Probes started.

Trial 24 (mate-boost) hit NaN between step 175K-180K. Killed at 184K.
- Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59%
- Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final.
- Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch.
Trials 25/26 healthy at step ~179K/183K.
Started post-training for trial 24:
- Uploading 175K checkpoint to HF bucket (background)
- Running linear probes on GPU 0 (background)
GPU 0 now doing probes. GPUs 1/2 still training.

2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing.

Trial 26 (discard-ply-limit) COMPLETED at 200K steps:
- val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4%
- Best accuracy of all 3 ablations
Trial 25 (no-outcome) at step 196K — ~25 min remaining.
Started post-training for trial 26: checkpoint upload + probes on GPU 2.
Trial 24 probes still running on GPU 0.

Final Results (as trials complete)

Trial	Ablation	Steps	Best val_loss	Best acc	Status
24	mate-boost	175K (NaN@177K)	3.186	6.59%	done (NaN)
25	no-outcome	196K/200K	TBD	TBD	running
26	discard-ply-limit	200K	3.147	7.85%	done
25	no-outcome	200K	3.089	6.83%	done
26	discard-ply-limit	200K	3.147	7.85%	done
—	baseline (pawn-base)	100K	~3.06	~6.90%	reference

2026-04-04 04:45 UTC — ALL TRAINING COMPLETE

Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83%
All GPUs idle. Uploading final checkpoints to HF.
Probes skipped (performance issues — runs on CPU, too slow on pod).
Drafting final report.

2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	55300	3.237	6.2%	8.80e-4
25	no-outcome	53800	3.122	6.8%	2.55e-4
26	discard-ply-limit	55200	3.214	7.0%	8.80e-4

Cruising. Loss improvement slowing (expected as cosine decay kicks in).
Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly.

2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	73600	3.204	6.6%	7.73e-4
25	no-outcome	71600	3.094	7.0%	2.23e-4
26	discard-ply-limit	73500	3.192	7.5%	7.74e-4

Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing.
HF bucket synced (metrics + lab notes + transcript).

2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%)

Trial	Ablation	Step	train_loss	acc	LR
24	mate-boost	101600	3.206	6.5%	5.75e-4
25	no-outcome	98300	3.090	6.9%	1.69e-4
26	discard-ply-limit	101000	3.179	7.4%	5.80e-4

100K milestone reached. This is where baseline PAWN-Base stopped.
Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps.
Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token.
Trial 26 (discard-ply-limit) best accuracy at 7.4%.
LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4.
ETA unchanged: ~~2026-04-04 04:30 UTC. Cost so far: ~$26 (~~11.3h × $2.31).

Xet Storage Details

Size:: 12.6 kB
Xet hash:: b1c447b86e7834339925d889251a0a3dedacab8f79b5d138635bea8afab1cb1a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.