8.73 kB

Pretraining Ablation Report: mate-boost, no-outcome, discard-ply-limit

Date: 2026-04-04 Pod: 3x RTX 6000 Ada (48GB), RunPod Total cost: ~~$56 (~~24.3h x $2.31/hr) Baseline: PAWN-Base (thomas-schweich/pawn-base), 35.8M params, 100K steps

Summary

Three ablations of PAWN-Base pretraining were run to 200K steps (2x baseline), each modifying one aspect of the random game generation pipeline:

Ablation	Config	Steps	Best val_loss	Best acc	top5	ppl	legal%
baseline	standard	100K	~3.06	6.90%	—	—	—
mate-boost	`mate_boost: 1.0`	175K*	3.186	6.59%	25.3%	24.2	99.3%
no-outcome	`no_outcome_token: true`	200K	3.089	6.83%	27.6%	22.0	99.6%
discard-ply-limit	`discard_ply_limit: true`	200K	3.147	7.85%	27.8%	23.3	99.4%

*mate-boost hit NaN divergence between step 175K-180K. Last good checkpoint at 175K.

Key Findings

1. discard-ply-limit achieves highest accuracy (7.85%)

Training only on naturally-ended games (discarding ply-limit truncations) produced the highest top-1 accuracy at 7.85% — 14% above baseline and above the unconditional theoretical ceiling (6.43%). This model clearly exploits the outcome token more effectively than the baseline.

Against the updated theoretical ceiling (1024 rollouts):

vs unconditional (6.43%): 122% — well above random legal move prediction
vs MC corrected (6.68%): 117% — exceeds the corrected Bayes-optimal estimate
vs MC naive (6.89%): 114% — exceeds even the upward-biased estimate

This is surprising — the model exceeds our Bayes-optimal ceiling estimates. Possible explanations:

The model exploits sequential structure beyond what the outcome token alone provides
The theoretical ceiling estimates may be conservative (the corrected estimator is known to be biased downward)
200K steps of training on cleaner data (no truncated games) may produce qualitatively different representations

2. no-outcome achieves lowest loss (3.089) but lower accuracy

Stripping the outcome token forces the model to make unconditional predictions. The lower loss makes sense — the model doesn't need to "hedge" based on outcome information, and can simply model P(move | history). But accuracy is lower (6.83% vs 7.85% for discard-ply-limit), confirming the outcome token provides useful signal for move prediction.

Interestingly, the no-outcome model at 200K steps (loss 3.089) surpasses the baseline at 100K steps (loss ~3.06) despite having strictly less information (no outcome token). The extra 100K steps of training compensate.

3. mate-boost underperforms and is unstable

Always taking mate-in-1 (producing shorter, more decisive games) led to the worst results:

Highest loss (3.186) and lowest accuracy (6.59%) among the three ablations
Hit NaN divergence at ~177K steps (the only ablation to do so)
Training loss was essentially flat from ~100K onward

The shorter games (~134 avg ply vs ~238) may reduce sequence diversity per batch. With fewer distinct positions per game, the model sees less variety, potentially explaining the lower accuracy. The NaN may be related to loss concentration on fewer tokens per sequence.

4. Loss vs accuracy diverge

The no-outcome model has the best loss but discard-ply-limit has the best accuracy. These metrics measure different things:

Loss rewards well-calibrated probability distributions over all legal moves
Accuracy rewards putting the most mass on the single correct move

The discard-ply-limit model may be "spikier" in its predictions (confident on the right move, less calibrated overall), while the no-outcome model distributes probability more smoothly.

5. All models achieve near-perfect legal move rates (>99%)

All three ablations produce legal moves >99.3% of the time at 200K steps — the model has thoroughly learned the rules of chess regardless of data generation method.

Theoretical Accuracy Ceiling (updated)

Ran with 1024 rollouts/move (8x previous), 5000 games, 7.5% sample rate (88,412 positions):

Metric	Value	95% CI
Unconditional (E[1/N_legal])	6.43%	[6.36, 6.50]
Naive conditional (1-ply filter)	6.43%	[6.36, 6.50]
MC conditional (naive, biased up)	6.89%	[6.81, 6.96]
MC conditional (corrected, biased down)	6.68%	[6.60, 6.75]
Bias bracket	0.21pp

The bracket narrowed from 0.66pp (128 rollouts) to 0.21pp — 3x tighter. The corrected estimate barely moved (6.67% -> 6.68%), confirming it was already well-estimated. The naive estimate dropped from 7.34% to 6.89%, revealing significant upward bias in the original.

The discard-ply-limit model at 7.85% substantially exceeds all ceiling estimates. This suggests either:

The model exploits sequential correlations beyond what outcome conditioning captures
The discard-ply-limit data distribution has different statistical properties than the standard distribution used for ceiling computation
Both — the cleaner data and the model's sequence modeling reinforce each other

Learning Rate Selection (Phase 1)

Each ablation was explored at 3 learning rates (1e-4, 3e-4, 1e-3) for 5K steps:

Ablation	LR	val_loss@5K	Selected
mate-boost	1e-4 / 3e-4 / 1e-3	4.71 / 4.02 / 3.63	1e-3
no-outcome	1e-4 / 3e-4 / 1e-3	4.59 / 3.44 / 3.53	3e-4
discard-ply-limit	1e-4 / 3e-4 / 1e-3	4.74 / 4.05 / 3.62	1e-3

Note: 1e-4 was dominated across all ablations — 5% warmup over 200K steps means the LR only reaches 5e-5 by step 5K, barely out of the warmup phase.

The no-outcome ablation preferred a lower LR (3e-4), possibly because the shorter effective sequence (no outcome token) changes the loss landscape. The no-outcome trial also used no warmup (warmup_frac=0), which worked well for the lower LR.

Anomalies

mate-boost NaN at ~177K steps: Only ablation to diverge. Using bfloat16 AMP. Cause unclear — may relate to shorter game lengths concentrating loss on fewer tokens. The 175K checkpoint is usable as the loss had plateaued.
pause_after_steps bug: The pretraining trainer didn't implement this feature (only adapter training did). Fixed during the session by adding to TrainingConfig and trainer.py.
lab_resume checkpoint bug: The MCP server looked for best/final dirs but pretraining uses step_XXXXXXXX naming. Fixed to fall back to highest step dir.
compute_theoretical_ceiling.py bugs: (1) Saved JSON records last batch's n_positions instead of cumulative total. (2) Dead code getattr(args, 'batch_games'). Neither affects results.

Artifacts

All stored in hf://buckets/thomas-schweich/pretraining-ablations:

Path	Contents
`logs/trial_00{24,25,26}/`	metrics.jsonl, configs
`checkpoints/trial_0024_mate-boost/step_00175000/`	Best mate-boost checkpoint
`checkpoints/trial_0024_mate-boost/step_00100000/`	100K checkpoint
`checkpoints/trial_0025_no-outcome/step_00200000/`	Final no-outcome checkpoint
`checkpoints/trial_0025_no-outcome/step_00100000/`	100K checkpoint
`checkpoints/trial_0026_discard-ply-limit/step_00200000/`	Final discard-ply-limit checkpoint
`checkpoints/trial_0026_discard-ply-limit/step_00100000/`	100K checkpoint
`data/theoretical_ceiling_1024.json`	Ceiling computation results
`lab-notes.md`	Detailed experiment log
`chat_transcripts/`	Session transcript

Next Steps

Run linear probes on all three final checkpoints to compare internal representations (piece type, check status, castling, material, game phase). The probes were prepared but skipped due to performance issues — the probe evaluation script runs forward passes inefficiently on CPU. Consider keeping hidden states on GPU during probe training.
Per-ply accuracy evaluation with scripts/eval_accuracy.py to understand where each ablation helps/hurts (opening vs middlegame vs endgame).
Compute ceiling for discard-ply-limit distribution — the current ceiling assumes standard random games. The discard-ply-limit distribution is different (no truncated games), so the ceiling may be higher, which could explain the model exceeding our estimates.
Investigate mate-boost NaN — check gradient norms in the metrics JSONL before the divergence point. Consider rerunning with float32 accumulation or lower LR.
Adapter finetuning on Lichess data using these ablated pretrained weights as backbones — the key downstream question.

Xet Storage Details

Size:: 8.73 kB
Xet hash:: 9412c445acdd4308e2a26ac9cf735539771155789f347acf55808cd2a5ca73d1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.