Pretraining Ablations Lab Notes
Experiment: mate-boost, no-outcome, discard-ply-limit
Started: 2026-04-03
Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr
Phase 1: LR Exploration (5K steps each, no_compile)
Batch 1: mate-boost (DONE)
| Trial |
LR |
val_loss |
val_acc |
top5 |
ppl |
legal% |
Note |
| 15 |
1e-4 |
4.712 |
3.6% |
15.5% |
111.3 |
73.7% |
Dominated, still in warmup |
| 16 |
3e-4 |
4.019 |
4.4% |
19.3% |
55.6 |
89.1% |
Solid |
| 17 |
1e-3 |
3.629 |
5.2% |
21.6% |
37.7 |
93.3% |
Winner |
Batch 2: no-outcome (DONE)
| Trial |
LR |
Warmup |
val_loss |
val_acc |
top5 |
ppl |
legal% |
Note |
| 18 |
1e-4 |
5% |
4.585 |
4.3% |
18.4% |
98.0 |
76.6% |
Dominated, still in warmup |
| 19 |
3e-4 |
0% |
3.444 |
5.8% |
24.3% |
31.3 |
94.2% |
Winner (no-warmup) |
| 20 |
1e-3 |
5% |
3.529 |
5.6% |
24.3% |
34.1 |
94.5% |
Close second, still warming |
Batch 3: discard-ply-limit (DONE)
| Trial |
LR |
val_loss |
val_acc |
top5 |
ppl |
legal% |
Note |
| 21 |
1e-4 |
4.743 |
3.7% |
16.3% |
114.7 |
72.8% |
Dominated, still in warmup |
| 22 |
3e-4 |
4.054 |
4.6% |
20.1% |
57.7 |
87.4% |
Middle ground |
| 23 |
1e-3 |
3.617 |
5.4% |
22.6% |
37.2 |
93.2% |
Winner |
Phase 2: Full Training (200K steps, torch.compile ON)
| Trial |
Ablation |
LR |
Resumed From |
GPU |
Status |
| 24 |
mate-boost |
1e-3 |
trial 17 @ 5K |
0 |
RUNNING |
| 25 |
no-outcome |
3e-4 |
trial 19 @ 5K |
1 |
RUNNING |
| 26 |
discard-ply-limit |
1e-3 |
trial 23 @ 5K |
2 |
RUNNING |
torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps.
ETA for completion: ~2026-04-04 04:30 UTC.
Known Issues
- pause_after_steps not working: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval.
Log
2026-04-03 04:38 UTC
- Disabled MPS (was funneling all trials to GPU 0)
- Removed sdpa_math from configs (NVIDIA GPUs use flash attention)
- Launched batch 1: mate-boost x3 LRs
2026-04-03 05:22 UTC
- Batch 1 complete. lr=1e-3 dominates across all metrics.
- pause_after_steps bug: trials continued past 5K, had to kill at ~5200.
- Killed batch 1, launched batch 2: no-outcome x3 LRs.
- Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request.
- ETA for batch 2 eval: ~05:57 UTC
2026-04-03 05:50 UTC
- New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000.
- Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455
- Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710
- Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102
- Set up 5-min cron to catch 5K eval, kill trials, launch batch 3.
- Cost so far: $3.36 (1h27m elapsed)
2026-04-03 05:57 UTC
- Batch 2 (no-outcome) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.585 (dominated, warmup too slow)
- lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER
- lr=1e-3: val_loss=3.529 — close second but still warming up
- Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23
- Note: discard-ply-limit discards ~60% of games. Watch step times for slowdown.
- ETA for batch 3 eval: ~06:30 UTC
2026-04-03 06:39 UTC
- Batch 3 (discard-ply-limit) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.743 (dominated)
- lr=3e-4: val_loss=4.054 (middle)
- lr=1e-3: val_loss=3.617, acc=5.4% — WINNER
- Phase 1 COMPLETE. All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome.
- Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps.
- Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON.
- Switched cron from 5-min to hourly (long runs, ~22h remaining).
- Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments.
Phase 1 Winners
| Ablation |
Winner Trial |
LR |
val_loss@5K |
acc@5K |
| mate-boost |
17 |
1e-3 |
3.629 |
5.2% |
| no-outcome |
19 |
3e-4 |
3.444 |
5.8% |
| discard-ply-limit |
23 |
1e-3 |
3.617 |
5.4% |
2026-04-03 07:10 UTC — Phase 2 check-in (30 min in)
All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies.
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
g/s |
| 24 |
mate-boost |
9600 |
3.407 |
5.8% |
9.6e-4 |
640 |
| 25 |
no-outcome |
9400 |
3.276 |
6.3% |
3.0e-4 |
634 |
| 26 |
discard-ply-limit |
9500 |
3.399 |
6.1% |
9.5e-4 |
652 |
- Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR.
- Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup.
- Cost:
$6.30 (2h45m elapsed). Budget on track.
- Fixed
pause_after_steps and lab_resume bugs in code.
- HF bucket synced (metrics only, checkpoints excluded).
2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
gn |
| 24 |
mate-boost |
18700 |
3.283 |
6.2% |
9.95e-4 |
0.22 |
| 25 |
no-outcome |
18300 |
3.160 |
6.7% |
2.94e-4 |
0.53 |
| 26 |
discard-ply-limit |
18700 |
3.254 |
6.9% |
9.95e-4 |
0.20 |
- All past warmup (peaked at 1e-3), now in cosine decay. Very stable.
- Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy.
- ~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC.
2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
gn |
| 24 |
mate-boost |
27800 |
3.242 |
6.4% |
9.81e-4 |
0.13 |
| 25 |
no-outcome |
27200 |
3.123 |
7.1% |
2.88e-4 |
0.32 |
| 26 |
discard-ply-limit |
27800 |
3.248 |
6.8% |
9.81e-4 |
0.14 |
- Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy.
- Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence.
2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
37000 |
3.246 |
6.5% |
9.56e-4 |
| 25 |
no-outcome |
36100 |
3.133 |
6.7% |
2.79e-4 |
| 26 |
discard-ply-limit |
36900 |
3.191 |
7.4% |
9.56e-4 |
- Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25.
- Trial 25 still lowest loss but accuracy plateauing. Different distribution?
- HF bucket synced (metrics + lab notes).
2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
46100 |
3.213 |
6.5% |
9.22e-4 |
| 25 |
no-outcome |
44900 |
3.115 |
6.9% |
2.68e-4 |
| 26 |
discard-ply-limit |
46000 |
3.215 |
7.1% |
9.23e-4 |
- Stable. Trial 26 still best accuracy, trial 25 still lowest loss.
- All losses still gradually decreasing — no signs of plateau yet.
2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
129300 |
3.200 |
6.5% |
3.74e-4 |
| 25 |
no-outcome |
125000 |
3.078 |
6.9% |
1.13e-4 |
| 26 |
discard-ply-limit |
128500 |
3.193 |
7.6% |
3.80e-4 |
- Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6%
- Trial 26 train acc now at 7.6-7.8% — clearly best accuracy.
- Accuracy ceiling computation running concurrently on CPUs (~65% done).
- ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC.
2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
174500 |
3.170 |
6.7% |
1.39e-4 |
| 25 |
no-outcome |
169800 |
3.082 |
7.2% |
4.49e-5 |
| 26 |
discard-ply-limit |
173600 |
3.171 |
7.6% |
1.42e-4 |
- ~26K steps remaining, ETA ~04:00 UTC.
- Ceiling computation DONE (1024 rollouts, 88K positions, 5.7h):
- Unconditional: 6.43% [6.36, 6.50]
- MC corrected: 6.68% [6.60, 6.75]
- MC naive: 6.89% [6.81, 6.96]
- Bracket: 0.21pp (was 0.66pp at 128 rollouts)
- PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max.
- Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF.
2026-04-04 02:10 UTC — Trial 24 NaN! Probes started.
- Trial 24 (mate-boost) hit NaN between step 175K-180K. Killed at 184K.
- Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59%
- Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final.
- Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch.
- Trials 25/26 healthy at step ~179K/183K.
- Started post-training for trial 24:
- Uploading 175K checkpoint to HF bucket (background)
- Running linear probes on GPU 0 (background)
- GPU 0 now doing probes. GPUs 1/2 still training.
2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing.
- Trial 26 (discard-ply-limit) COMPLETED at 200K steps:
- val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4%
- Best accuracy of all 3 ablations
- Trial 25 (no-outcome) at step 196K — ~25 min remaining.
- Started post-training for trial 26: checkpoint upload + probes on GPU 2.
- Trial 24 probes still running on GPU 0.
Final Results (as trials complete)
| Trial |
Ablation |
Steps |
Best val_loss |
Best acc |
Status |
| 24 |
mate-boost |
175K (NaN@177K) |
3.186 |
6.59% |
done (NaN) |
| 25 |
no-outcome |
196K/200K |
TBD |
TBD |
running |
| 26 |
discard-ply-limit |
200K |
3.147 |
7.85% |
done |
| 25 |
no-outcome |
200K |
3.089 |
6.83% |
done |
| 26 |
discard-ply-limit |
200K |
3.147 |
7.85% |
done |
| — |
baseline (pawn-base) |
100K |
~3.06 |
~6.90% |
reference |
2026-04-04 04:45 UTC — ALL TRAINING COMPLETE
- Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83%
- All GPUs idle. Uploading final checkpoints to HF.
- Probes skipped (performance issues — runs on CPU, too slow on pod).
- Drafting final report.
2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
55300 |
3.237 |
6.2% |
8.80e-4 |
| 25 |
no-outcome |
53800 |
3.122 |
6.8% |
2.55e-4 |
| 26 |
discard-ply-limit |
55200 |
3.214 |
7.0% |
8.80e-4 |
- Cruising. Loss improvement slowing (expected as cosine decay kicks in).
- Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly.
2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
73600 |
3.204 |
6.6% |
7.73e-4 |
| 25 |
no-outcome |
71600 |
3.094 |
7.0% |
2.23e-4 |
| 26 |
discard-ply-limit |
73500 |
3.192 |
7.5% |
7.74e-4 |
- Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing.
- HF bucket synced (metrics + lab notes + transcript).
2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%)
| Trial |
Ablation |
Step |
train_loss |
acc |
LR |
| 24 |
mate-boost |
101600 |
3.206 |
6.5% |
5.75e-4 |
| 25 |
no-outcome |
98300 |
3.090 |
6.9% |
1.69e-4 |
| 26 |
discard-ply-limit |
101000 |
3.179 |
7.4% |
5.80e-4 |
- 100K milestone reached. This is where baseline PAWN-Base stopped.
- Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps.
- Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token.
- Trial 26 (discard-ply-limit) best accuracy at 7.4%.
- LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4.
- ETA unchanged:
2026-04-04 04:30 UTC. Cost so far: ~$26 (11.3h × $2.31).