Buckets:
Pretraining Ablation Report: mate-boost, no-outcome, discard-ply-limit
Date: 2026-04-04
Pod: 3x RTX 6000 Ada (48GB), RunPod
Total cost: $56 (24.3h x $2.31/hr)
Baseline: PAWN-Base (thomas-schweich/pawn-base), 35.8M params, 100K steps
Summary
Three ablations of PAWN-Base pretraining were run to 200K steps (2x baseline), each modifying one aspect of the random game generation pipeline:
| Ablation | Config | Steps | Best val_loss | Best acc | top5 | ppl | legal% |
|---|---|---|---|---|---|---|---|
| baseline | standard | 100K | ~3.06 | 6.90% | — | — | — |
| mate-boost | mate_boost: 1.0 |
175K* | 3.186 | 6.59% | 25.3% | 24.2 | 99.3% |
| no-outcome | no_outcome_token: true |
200K | 3.089 | 6.83% | 27.6% | 22.0 | 99.6% |
| discard-ply-limit | discard_ply_limit: true |
200K | 3.147 | 7.85% | 27.8% | 23.3 | 99.4% |
*mate-boost hit NaN divergence between step 175K-180K. Last good checkpoint at 175K.
Key Findings
1. discard-ply-limit achieves highest accuracy (7.85%)
Training only on naturally-ended games (discarding ply-limit truncations) produced the highest top-1 accuracy at 7.85% — 14% above baseline and above the unconditional theoretical ceiling (6.43%). This model clearly exploits the outcome token more effectively than the baseline.
Against the updated theoretical ceiling (1024 rollouts):
- vs unconditional (6.43%): 122% — well above random legal move prediction
- vs MC corrected (6.68%): 117% — exceeds the corrected Bayes-optimal estimate
- vs MC naive (6.89%): 114% — exceeds even the upward-biased estimate
This is surprising — the model exceeds our Bayes-optimal ceiling estimates. Possible explanations:
- The model exploits sequential structure beyond what the outcome token alone provides
- The theoretical ceiling estimates may be conservative (the corrected estimator is known to be biased downward)
- 200K steps of training on cleaner data (no truncated games) may produce qualitatively different representations
2. no-outcome achieves lowest loss (3.089) but lower accuracy
Stripping the outcome token forces the model to make unconditional predictions. The lower loss makes sense — the model doesn't need to "hedge" based on outcome information, and can simply model P(move | history). But accuracy is lower (6.83% vs 7.85% for discard-ply-limit), confirming the outcome token provides useful signal for move prediction.
Interestingly, the no-outcome model at 200K steps (loss 3.089) surpasses the baseline at 100K steps (loss ~3.06) despite having strictly less information (no outcome token). The extra 100K steps of training compensate.
3. mate-boost underperforms and is unstable
Always taking mate-in-1 (producing shorter, more decisive games) led to the worst results:
- Highest loss (3.186) and lowest accuracy (6.59%) among the three ablations
- Hit NaN divergence at ~177K steps (the only ablation to do so)
- Training loss was essentially flat from ~100K onward
The shorter games (~134 avg ply vs ~238) may reduce sequence diversity per batch. With fewer distinct positions per game, the model sees less variety, potentially explaining the lower accuracy. The NaN may be related to loss concentration on fewer tokens per sequence.
4. Loss vs accuracy diverge
The no-outcome model has the best loss but discard-ply-limit has the best accuracy. These metrics measure different things:
- Loss rewards well-calibrated probability distributions over all legal moves
- Accuracy rewards putting the most mass on the single correct move
The discard-ply-limit model may be "spikier" in its predictions (confident on the right move, less calibrated overall), while the no-outcome model distributes probability more smoothly.
5. All models achieve near-perfect legal move rates (>99%)
All three ablations produce legal moves >99.3% of the time at 200K steps — the model has thoroughly learned the rules of chess regardless of data generation method.
Theoretical Accuracy Ceiling (updated)
Ran with 1024 rollouts/move (8x previous), 5000 games, 7.5% sample rate (88,412 positions):
| Metric | Value | 95% CI |
|---|---|---|
| Unconditional (E[1/N_legal]) | 6.43% | [6.36, 6.50] |
| Naive conditional (1-ply filter) | 6.43% | [6.36, 6.50] |
| MC conditional (naive, biased up) | 6.89% | [6.81, 6.96] |
| MC conditional (corrected, biased down) | 6.68% | [6.60, 6.75] |
| Bias bracket | 0.21pp |
The bracket narrowed from 0.66pp (128 rollouts) to 0.21pp — 3x tighter. The corrected estimate barely moved (6.67% -> 6.68%), confirming it was already well-estimated. The naive estimate dropped from 7.34% to 6.89%, revealing significant upward bias in the original.
The discard-ply-limit model at 7.85% substantially exceeds all ceiling estimates. This suggests either:
- The model exploits sequential correlations beyond what outcome conditioning captures
- The discard-ply-limit data distribution has different statistical properties than the standard distribution used for ceiling computation
- Both — the cleaner data and the model's sequence modeling reinforce each other
Learning Rate Selection (Phase 1)
Each ablation was explored at 3 learning rates (1e-4, 3e-4, 1e-3) for 5K steps:
| Ablation | LR | val_loss@5K | Selected |
|---|---|---|---|
| mate-boost | 1e-4 / 3e-4 / 1e-3 | 4.71 / 4.02 / 3.63 | 1e-3 |
| no-outcome | 1e-4 / 3e-4 / 1e-3 | 4.59 / 3.44 / 3.53 | 3e-4 |
| discard-ply-limit | 1e-4 / 3e-4 / 1e-3 | 4.74 / 4.05 / 3.62 | 1e-3 |
Note: 1e-4 was dominated across all ablations — 5% warmup over 200K steps means the LR only reaches 5e-5 by step 5K, barely out of the warmup phase.
The no-outcome ablation preferred a lower LR (3e-4), possibly because the shorter effective sequence (no outcome token) changes the loss landscape. The no-outcome trial also used no warmup (warmup_frac=0), which worked well for the lower LR.
Anomalies
- mate-boost NaN at ~177K steps: Only ablation to diverge. Using bfloat16 AMP. Cause unclear — may relate to shorter game lengths concentrating loss on fewer tokens. The 175K checkpoint is usable as the loss had plateaued.
- pause_after_steps bug: The pretraining trainer didn't implement this feature (only adapter training did). Fixed during the session by adding to TrainingConfig and trainer.py.
- lab_resume checkpoint bug: The MCP server looked for
best/finaldirs but pretraining usesstep_XXXXXXXXnaming. Fixed to fall back to highest step dir. - compute_theoretical_ceiling.py bugs: (1) Saved JSON records last batch's n_positions instead of cumulative total. (2) Dead code
getattr(args, 'batch_games'). Neither affects results.
Artifacts
All stored in hf://buckets/thomas-schweich/pretraining-ablations:
| Path | Contents |
|---|---|
logs/trial_00{24,25,26}/ |
metrics.jsonl, configs |
checkpoints/trial_0024_mate-boost/step_00175000/ |
Best mate-boost checkpoint |
checkpoints/trial_0024_mate-boost/step_00100000/ |
100K checkpoint |
checkpoints/trial_0025_no-outcome/step_00200000/ |
Final no-outcome checkpoint |
checkpoints/trial_0025_no-outcome/step_00100000/ |
100K checkpoint |
checkpoints/trial_0026_discard-ply-limit/step_00200000/ |
Final discard-ply-limit checkpoint |
checkpoints/trial_0026_discard-ply-limit/step_00100000/ |
100K checkpoint |
data/theoretical_ceiling_1024.json |
Ceiling computation results |
lab-notes.md |
Detailed experiment log |
chat_transcripts/ |
Session transcript |
Next Steps
- Run linear probes on all three final checkpoints to compare internal representations (piece type, check status, castling, material, game phase). The probes were prepared but skipped due to performance issues — the probe evaluation script runs forward passes inefficiently on CPU. Consider keeping hidden states on GPU during probe training.
- Per-ply accuracy evaluation with
scripts/eval_accuracy.pyto understand where each ablation helps/hurts (opening vs middlegame vs endgame). - Compute ceiling for discard-ply-limit distribution — the current ceiling assumes standard random games. The discard-ply-limit distribution is different (no truncated games), so the ceiling may be higher, which could explain the model exceeding our estimates.
- Investigate mate-boost NaN — check gradient norms in the metrics JSONL before the divergence point. Consider rerunning with float32 accumulation or lower LR.
- Adapter finetuning on Lichess data using these ablated pretrained weights as backbones — the key downstream question.
Xet Storage Details
- Size:
- 8.73 kB
- Xet hash:
- 9412c445acdd4308e2a26ac9cf735539771155789f347acf55808cd2a5ca73d1
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.