28.6 kB

PAWN (Playstyle-Agnostic World-model Network for Chess)

A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0.

Repository Structure

pawn/
├── engine/          # Rust chess engine with PyO3 bindings (via shakmaty)
├── pawn/            # Core Python package
│   ├── config.py    # CLMConfig (small/base/large), TrainingConfig
│   ├── model.py     # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings)
│   ├── data.py      # On-the-fly random game data pipeline
│   ├── lichess_data.py  # Lichess PGN data pipeline + legal mask computation
│   ├── trainer.py   # Pretraining loop
│   ├── gpu.py       # GPU auto-detection (compile/AMP/SDPA backend)
│   ├── logging.py   # MetricsLogger (JSONL output)
│   ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push
│   ├── adapters/    # Bottleneck, LoRA, FiLM, sparse, hybrid
│   ├── eval_suite/  # Probes, generation tests, diagnostics, lichess eval
│   └── dashboard/   # Solara training dashboard (metrics, charts, runner)
├── scripts/         # Training and evaluation entry points
├── tests/           # Unit tests
├── deploy/          # Runpod deployment scripts
└── docs/            # Architecture, training, adapter docs

Building

This is a uv workspace. The root project is the pawn Python package; engine/ is the sole workspace member.

# Build the Rust chess engine (required before anything else)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
uv sync --extra rocm      # AMD (ROCm 7.1)
uv sync --extra cu128     # NVIDIA (CUDA 12.8)

# Run tests
uv run pytest tests/

# Pretrain from scratch (local dev)
uv run python scripts/train.py --variant base --local-checkpoints

The only extras are GPU backends (rocm or cu128). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify --extra rocm or --extra cu128.

GPU requirement: configure_gpu() (called by every training and eval script) raises RuntimeError if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable PAWN_ALLOW_CPU=1 overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call configure_gpu() and run fine on CPU without the override.

Engine (`engine/`)

Single source of truth for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.

Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
PyO3 bindings expose chess_engine module to Python
Key functions: generate_random_games(), parse_pgn_file(), compute_legal_token_masks_sparse(), extract_board_states(), export_move_vocabulary(), compute_accuracy_ceiling()

Model

Architecture

Decoder-only transformer, next-token prediction over 4,278 tokens
Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
Factored embeddings: src_embed[s] + dst_embed[d] + promo_embed[p]
Sequence format: [outcome] [ply_1] ... [ply_N] [PAD] ... [PAD] (256 tokens)

Variants

CLMConfig.small(): d=256, 8 layers, 4 heads, ~9.5M params
CLMConfig.base(): d=512, 8 layers, 8 heads, ~35.8M params (default)
CLMConfig.large(): d=640, 10 layers, 8 heads, ~68.4M params
CLMConfig.toy(): d=64, 2 layers, for tests only

Training

All training scripts require one of --hf-repo REPO_ID or --local-checkpoints (mutually exclusive). Use --local-checkpoints for local dev; use --hf-repo for any run where you need durable checkpoints.

Pretraining

# Single model
uv run python scripts/train.py --variant base --local-checkpoints

# All three variants simultaneously (shared data batches, sequential GPU)
uv run python scripts/train_all.py --local-checkpoints

# Resume from checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints

scripts/train.py key args:

--variant {small|base|large|toy} — model size (default: base)
--resume PATH — resume from checkpoint directory
--total-steps N — training steps (default: 100,000)
--batch-size N — batch size (default: 256)
--discard-ply-limit — only train on naturally-ended games (no ply-limit truncation)
Architecture overrides: --d-model, --n-layers, --n-heads, --d-ff, --lr, --weight-decay, --warmup-steps

scripts/train_all.py additional args:

--shm-checkpoints — write checkpoints to /dev/shm (requires --hf-repo, volatile)
--run-evals — auto-run probes + diagnostics after training completes
--publish-results — push eval results to HF
--patience N — per-model early stopping patience (eval intervals without improvement)

Adapter Training

All adapter scripts require --checkpoint PATH (pretrained weights) and --pgn PATH (Lichess PGN file). They freeze the backbone and train only adapter parameters.

# Example: train a LoRA adapter on Lichess 1800-1900 games
uv run python scripts/train_lora.py \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
    --lora-rank 4 --lr 3e-4 --local-checkpoints

Script	Adapter	Key args	Typical params
`train_bottleneck.py`	Houlsby MLP	`--bottleneck-dim 8`	~131K
`train_lora.py`	Low-rank attention	`--lora-rank 4 --lora-targets qkvo`	~65K
`train_film.py`	Channel-wise affine	`--no-output-film`	~17K
`train_sparse.py`	Binary mask	`--density 0.01 --sparse-targets qkvo`	~503K-2.7M
`train_hybrid.py`	LoRA + FiLM	`--lora-rank 4 --film-lr 1e-3`	~65K
`train_tiny.py`	None (from scratch)	`--d-model 84 --n-layers 2`	~524K

Common adapter args: --epochs 50, --batch-size 64, --lr 3e-4, --patience 10, --val-every 1, --max-games 12000, --min-ply 10

Common CLI Patterns

--sdpa-math — force MATH SDPA backend (required for ROCm + torch.compile)
--no-compile — disable torch.compile
--no-amp — disable mixed precision
--num-workers N — DataLoader workers (default: 8 for adapters, 4 for pretraining)
--device {cuda|cpu} — device selection
--wandb — enable Weights & Biases logging

Evaluation & Metrics

Linear Probes

uv run python scripts/eval_probes.py --log-dir logs --device cuda

Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: --n-games 4096, --n-val-games 1024, --n-epochs 20, --run RUN_NAME (specific run).

Move Prediction Accuracy

uv run python scripts/eval_accuracy.py \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
    --adapter-checkpoint logs/run_*/checkpoints/best

MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: --min-eval-ply 10, --max-games 50000, --per-ply.

Theoretical Accuracy Ceilings

uv run python scripts/compute_theoretical_ceiling.py

Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.

Export to HuggingFace

uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS

Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.

Checkpoints

Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID:

thomas-schweich/pawn-small — 9.5M params, CLMConfig.small()
thomas-schweich/pawn-base — 35.8M params, CLMConfig.base()
thomas-schweich/pawn-large — 68.4M params, CLMConfig.large()

All scripts accept HF repo IDs for --checkpoint (e.g. --checkpoint thomas-schweich/pawn-base). Weights are downloaded and cached automatically via huggingface_hub.

Checkpoint Format (safetensors)

Checkpoints are directories, not single files:

step_00065000/
├── model.safetensors        # model weights
├── optimizer.safetensors    # flattened optimizer state
├── training_state.json      # step, scheduler, scaler, RNG (base64)
├── config.json              # model + training config
└── .complete                # SHA-256 hashes of all files (integrity sentinel)

Central module: pawn/checkpoint.py. All save/load goes through this module. Legacy .pt files are still loadable (backward compatible).

Checkpoint Storage Modes

All training scripts require one of:

--hf-repo REPO_ID — push checkpoints to a HuggingFace branch as they're written (durable)
--local-checkpoints — save locally only (for development without an HF account)

HF mode creates a run/{run_id} branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied.

Optional: --shm-checkpoints writes checkpoints to /dev/shm (RAM-backed filesystem, instant writes). Requires --hf-repo since /dev/shm is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals.

Data Integrity

Every checkpoint write is atomic: files are written to a .tmp directory, then renamed. The .complete sentinel contains SHA-256 hashes of every file in the checkpoint. Hashes are always verified on load — no exceptions.

IncompleteCheckpointError — raised when .complete sentinel is missing
CheckpointIntegrityError — raised when any hash mismatches

Never use kill -9 on training processes. SIGTERM is handled gracefully: a flag is set, the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.

Never rsync checkpoint files from running pods. Checkpoints are pushed to HuggingFace from the trainer. Load via HF repo ID (e.g. --checkpoint thomas-schweich/pawn-base).

RunPod Operations

Docker Image

A single Docker image (thomasschweich/pawn:latest) is automatically built and pushed to Docker Hub by CI on every merge to main. No manual builds needed.

The image is based on runpod/pytorch (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at /opt/pawn on pods. SSH in and run experiments directly.

To build locally (rarely needed):

docker build --platform linux/amd64 \
    --build-arg GIT_HASH=$(git rev-parse HEAD) \
    -t thomasschweich/pawn:latest .

Pod Lifecycle

Use deploy/pod.sh for all pod management. Requires runpodctl (wget -qO- cli.runpod.net | sudo bash).

# Create a pod
bash deploy/pod.sh create myexp --gpu h100

# SSH into it
bash deploy/pod.sh ssh myexp

# Launch training
bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}

# Stop (preserves volume, stops billing)
bash deploy/pod.sh stop myexp

# Delete (destroys everything)
bash deploy/pod.sh delete myexp

GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100. Pod configs are cached in ~/.config/pawn/pods/<name>.env.

GPU Selection

Benchmarks from pretraining 3 models concurrently (train_all.py, batch=256):

GPU	VRAM	$/hr	Step time	100K cost	Notes
B200	192GB	$4.99	0.28s	~$39	Fastest
H200 SXM	80GB	$3.59	0.34s	~$34	Best wall-clock/cost balance
RTX PRO 6000	48GB	$1.89	0.62s	~$33	Cheapest viable
A100 PCIe	80GB	$1.39	0.79s	~$30	Cheapest overall
L40S	48GB	$0.86	1.37s	~$33	Slow but cheap
RTX 5090/4090/3090	24-32GB	—	OOM	—	Insufficient VRAM for 3 models

Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.

Required Pod Configuration

Always attach a network volume. Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
Set HF_TOKEN as a pod environment variable for automatic HuggingFace authentication. The entrypoint persists it to ~/.cache/huggingface/token.
PAWN_MODEL=thomas-schweich/pawn-base — auto-pull a checkpoint on startup (runner target).
PAWN_CMD — training command to execute (alternative to Docker CMD args).

Pod Safety

Stop pods with runpodctl pod stop or bash deploy/pod.sh stop — sends SIGTERM, trainer saves and pushes before exiting.
Never runpodctl pod delete while training is running — data loss risk.
Never kill -9 training processes — use SIGTERM (plain kill), which triggers graceful shutdown.
Never rsync checkpoint files from running pods — load via HF repo ID instead.

HuggingFace Bucket Backups

Use HF buckets (hf://buckets/...) to back up experiment data from pods. Buckets are not datasets or repos — they use the sync_bucket API, not upload_file with repo_type.

Key constraint: upload bandwidth from pods is ~1.8 MB/s. Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively.

During training — sync metrics only (instant):

from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN)
api.sync_bucket(
    source="/workspace/logs",
    dest="hf://buckets/OWNER/BUCKET/logs",
    exclude=["*/checkpoints/*"],
)

For individual files (lab notes, transcripts) — stage in a temp dir:

import tempfile, shutil, os
with tempfile.TemporaryDirectory() as td:
    shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md"))
    api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET")

upload_file(..., repo_type="bucket") does not work — buckets are not a valid repo type for that API. Always use sync_bucket.

After training — sync only best/final checkpoints:

api.sync_bucket(
    source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST",
    dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best",
)

HF_TOKEN: Stored in /opt/pawn/.env on pods. Source it or export before calling the API.

Monitoring Training Progress

Key Principle: Write Scripts to Disk for Pre-Approval

When setting up recurring monitoring, always write the monitoring script to a file first so the user can review and pre-approve it. This avoids repeated permission prompts when /loop fires.

Pattern:

Write a bash script to disk (e.g., scripts/check_my_run.sh)
User reviews and approves the script
Schedule with /loop 15m bash scripts/check_my_run.sh

Example monitoring script:

#!/usr/bin/env bash
# scripts/check_my_run.sh — monitor a specific training run
set -euo pipefail
bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>

Or for local-only monitoring:

#!/usr/bin/env bash
set -euo pipefail
bash /home/tas/pawn/scripts/check_progress.sh --sync

Available Monitoring Tools

Tool	What it does
`scripts/monitor_training.sh [POD_ID]`	SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches
`scripts/check_progress.sh [LOG_DIR]`	Show progress from local `logs/` directory
`python -m pawn.dashboard --log-dir logs`	Solara web dashboard with interactive charts

Dashboard

python -m pawn.dashboard --log-dir logs

Reads metrics.jsonl files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).

Logs

Training metrics in logs/ (gitignored). Each run gets a timestamped directory with metrics.jsonl and a random slug (e.g., run_20260325_140000_zesty-osprey/).

MetricsLogger (pawn/logging.py) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.

Hyperparameter Sweeps

Optuna integration via pawn/sweep.py and scripts/sweep.py:

uv run python scripts/sweep.py \
    --adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
    --total-steps 20000 --pruner hyperband \
    --checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \
    --local-checkpoints

Supports all adapter types + architecture search. GPU affinity assigns CUDA_VISIBLE_DEVICES = trial.number % n_gpus. SQLite-backed study persistence. Pruner options: hyperband, median, none.

Key Patterns & Gotchas

DataLoader workers must use multiprocessing_context='spawn' — the Rust engine uses rayon, and fork after rayon init causes deadlocks.
SDPA_BACKEND must be set before torch.compile() — compiled code captures the backend at trace time. apply_gpu_config() handles this.
ROCm works: The only known ROCm issue is a stride mismatch in flash attention backward when combined with torch.compile + AMP. The workaround is --sdpa-math (use the MATH SDPA backend instead of flash), which configure_gpu() applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. Do not assume bugs are ROCm-specific. Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue.
Sparse logit projection: forward_hidden() returns (B,T,d_model), then only loss-masked positions project through lm_head — avoids full (B,T,V) materialization.
Legal mask via Rust: LegalMaskBuilder replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense).
GPU auto-detection: pawn.gpu.configure_gpu() selects compile/AMP/SDPA settings. apply_gpu_config() applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated.
Factored embeddings: each move token decomposes into src_embed[s] + dst_embed[d] + promo_embed[p], reducing embedding parameters by ~32x.

Current task

Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit

You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model.

You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the pawn-lab MCP server and the manage-pod skill workflow.

As your first action, set up the check-in crons per the manage-pod skill, then proceed.

The Three Ablations

Ablation	Config	Hypothesis
mate-boost	`mate_boost: 1.0`	Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames.
no-outcome	`no_outcome_token: true`	Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts.
discard-ply-limit	`discard_ply_limit: true`	Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings.

There is no separate baseline run — the existing PAWN-Base checkpoint (thomas-schweich/pawn-base) trained at 100K steps is the baseline reference.

Unified Config

All training uses the unified scripts/train.py with Pydantic RunConfig. Call lab_schema to see all available fields. Configs are JSON dicts passed to lab_launch.

Base config shared by all ablations:

{
  "run_type": "pretrain",
  "variant": "base",
  "total_steps": 200000,
  "batch_size": 256,
  "lr": 3e-4,
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "eval_interval": 5000,
  "num_workers": 4
}

Each ablation adds its flag on top. Example for mate-boost:

{
  "run_type": "pretrain",
  "variant": "base",
  "mate_boost": 1.0,
  "total_steps": 200000,
  "batch_size": 256,
  "lr": 3e-4,
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "eval_interval": 5000,
  "num_workers": 4
}

Procedure

Phase 1: Learning Rate Exploration (~1 hour)

For each of the 3 ablations, launch 3 trials with different learning rates using pause_after_steps:

lab_launch(config={
  "run_type": "pretrain",
  "variant": "base",
  "mate_boost": 1.0,           # ← ablation flag
  "total_steps": 200000,
  "pause_after_steps": 5000,   # ← pause for comparison
  "batch_size": 256,
  "lr": 1e-4,                  # ← vary this: 1e-4, 3e-4, 1e-3
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "no_compile": true,          # ← skip compile for short exploration
  "eval_interval": 2500,
  "num_workers": 4
})

That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use no_compile: true to skip the 15-30 min compile overhead for these short runs.

After all 9 pause, call lab_results to compare val_loss at 5K steps. Pick the best LR for each ablation.

Phase 2: Full Training (~10-14 hours)

Resume the best LR for each ablation to 200K steps with torch.compile enabled:

lab_resume(trial_id=BEST_MATE_BOOST)      # no pause → runs to total_steps
lab_resume(trial_id=BEST_NO_OUTCOME)
lab_resume(trial_id=BEST_DISCARD_PLY)

lab_resume clears pause_after_steps by default, so the resumed trials run to completion at total_steps=200000. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact.

Important: Do NOT pass no_compile: true for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps.

With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51.

Phase 3: Evaluation

After all 3 runs converge, compare:

Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps)
Learning curves (loss vs step for all 3, plotted together)
Per-ply accuracy via scripts/eval_accuracy.py if time permits

Monitoring

Use the manage-pod skill check-in pattern:

Mini check-in (every 5 min): lab_events — launch next trial if GPU idle
Full check-in (hourly): lab_status + lab_log for each running trial + update lab notes

During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast. During Phase 2 (3 long runs), switch to hourly after confirming stability.

What to watch for

NaN loss: Kill immediately. Try lr/3. Use lab_log to check.
discard-ply-limit throughput: This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce num_workers to free CPU for data generation.
mate-boost game lengths: Check lab_log — mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128.
GPU assignment: The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly.

Cost Budget

Phase	Trials	Time	Cost
LR exploration	9 × 5K steps	~1h	~$3
Full training	3 × 195K steps (parallel)	~22h	~$51
Total		~23h	~$53

Set cost tracking: lab_set_cost(cost_per_hour=2.31)

Lab Notes

Write experiment state to runs/lab-notes.md (survives context compaction via PostCompact hook). Include:

Phase 1 results table (LR × ablation → val_loss at 5K steps)
Which LR was selected for each ablation and why
Phase 2 progress (step, loss, ETA per ablation)
Any anomalies or adjustments made

Current State (as of 2026-04-03 19:30 UTC)

Phase 2: Full Training (RUNNING)

Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K:

Trial	GPU	Ablation	LR	Step (~19:00)	train_loss	acc
24	0	mate-boost	1e-3	~120K	3.19	6.8%
25	1	no-outcome	3e-4 (no warmup)	~116K	3.09	7.0%
26	2	discard-ply-limit	1e-3	~119K	3.18	7.5%

ETA: ~~2026-04-04 04:30 UTC (~~9h remaining)
Hourly cron monitors via lab_events + lab_log
Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps

Concurrent: Accuracy Ceiling Computation (RUNNING)

Running scripts/compute_theoretical_ceiling.py on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train:

RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \
    scripts/compute_theoretical_ceiling.py \
    --n-games 5000 --rollouts 1024 --sample-rate 0.075 \
    --model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json

1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp
Processes in 10 batches of 500 games, printing intermediate estimates
ETA: ~3-4 hours (finishes well before training)
Output: /workspace/data/theoretical_ceiling_1024.json

Bug Fixes Applied This Session

pause_after_steps: Was missing from trainer.py (only in adapter training). Added to TrainingConfig and trainer.py train loop. Wired through scripts/train.py.
lab_resume checkpoint discovery: Was looking for checkpoints/best or checkpoints/final only. Pretraining uses step_XXXXXXXX naming. Added fallback to pick highest step_* dir.

HF Bucket Backup

All experiment data syncs to hf://buckets/thomas-schweich/pretraining-ablations:

Metrics/configs synced every ~4 hours (excluding checkpoints)
100K-step checkpoints for all 3 ablations uploaded
Lab notes + chat transcript synced periodically
HF_TOKEN in /opt/pawn/.env
See "HuggingFace Bucket Backups" section above for API patterns

Lab Notes

Detailed log at /workspace/runs/lab-notes.md (symlinked from /opt/pawn/runs/lab-notes.md). Survives context compaction via PostCompact hook. Read this first when resuming.

Post-Training Procedure

When each training run completes (in any order):

Upload best checkpoint to HF bucket in the background (see bucket sync patterns above)

Run linear probes on the best checkpoint in the background:

PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \
    --checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \
    --log-dir /workspace/eval_probes/trial_XXXX --device cuda:N

Use the GPU that just freed up. Run in background so other completions aren't blocked.

After the last probe run is kicked off, draft the final report comparing all 3 ablations + baseline:
- val_loss and accuracy curves
- Probe results (piece type, check, castling, material, game phase)
- Accuracy ceiling comparison
- Key findings and surprises
Update the report with probe results as they finish
Upload everything to HF bucket: final checkpoints, probe results, report, lab notes

Success Criteria

All three ablations train to 200K steps without divergence
Clear comparison of val_loss and accuracy across ablations
At least one ablation shows meaningful difference from baseline (better or worse — both are informative)

Xet Storage Details

Size:: 28.6 kB
Xet hash:: 2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

PAWN (Playstyle-Agnostic World-model Network for Chess)

Repository Structure

Building

Engine (engine/)

Model

Architecture

Variants

Training

Pretraining

Adapter Training

Common CLI Patterns

Evaluation & Metrics

Linear Probes

Move Prediction Accuracy

Theoretical Accuracy Ceilings

Export to HuggingFace

Checkpoints

Checkpoint Format (safetensors)

Checkpoint Storage Modes

Data Integrity

RunPod Operations

Docker Image

Pod Lifecycle

GPU Selection

Required Pod Configuration

Pod Safety

HuggingFace Bucket Backups

Monitoring Training Progress

Key Principle: Write Scripts to Disk for Pre-Approval

Available Monitoring Tools

Dashboard

Logs

Hyperparameter Sweeps

Key Patterns & Gotchas

Current task

Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit

The Three Ablations

Unified Config

Procedure

Phase 1: Learning Rate Exploration (~1 hour)

Phase 2: Full Training (~10-14 hours)

Phase 3: Evaluation

Monitoring

What to watch for

Cost Budget

Lab Notes

Current State (as of 2026-04-03 19:30 UTC)

Phase 2: Full Training (RUNNING)

Concurrent: Accuracy Ceiling Computation (RUNNING)

Bug Fixes Applied This Session

HF Bucket Backup

Lab Notes

Post-Training Procedure

Success Criteria

Xet Storage Details

Engine (`engine/`)