Buckets:
PAWN (Playstyle-Agnostic World-model Network for Chess)
A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0.
Repository Structure
pawn/
├── engine/ # Rust chess engine with PyO3 bindings (via shakmaty)
├── pawn/ # Core Python package
│ ├── config.py # CLMConfig (small/base/large), TrainingConfig
│ ├── model.py # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings)
│ ├── data.py # On-the-fly random game data pipeline
│ ├── lichess_data.py # Lichess PGN data pipeline + legal mask computation
│ ├── trainer.py # Pretraining loop
│ ├── gpu.py # GPU auto-detection (compile/AMP/SDPA backend)
│ ├── logging.py # MetricsLogger (JSONL output)
│ ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push
│ ├── adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
│ ├── eval_suite/ # Probes, generation tests, diagnostics, lichess eval
│ └── dashboard/ # Solara training dashboard (metrics, charts, runner)
├── scripts/ # Training and evaluation entry points
├── tests/ # Unit tests
├── deploy/ # Runpod deployment scripts
└── docs/ # Architecture, training, adapter docs
Building
This is a uv workspace. The root project is the pawn Python package; engine/ is the sole workspace member.
# Build the Rust chess engine (required before anything else)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
uv sync --extra rocm # AMD (ROCm 7.1)
uv sync --extra cu128 # NVIDIA (CUDA 12.8)
# Run tests
uv run pytest tests/
# Pretrain from scratch (local dev)
uv run python scripts/train.py --variant base --local-checkpoints
The only extras are GPU backends (rocm or cu128). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify --extra rocm or --extra cu128.
GPU requirement: configure_gpu() (called by every training and eval script) raises RuntimeError if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable PAWN_ALLOW_CPU=1 overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call configure_gpu() and run fine on CPU without the override.
Engine (engine/)
Single source of truth for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.
- Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
- PyO3 bindings expose
chess_enginemodule to Python - Key functions:
generate_random_games(),parse_pgn_file(),compute_legal_token_masks_sparse(),extract_board_states(),export_move_vocabulary(),compute_accuracy_ceiling()
Model
Architecture
- Decoder-only transformer, next-token prediction over 4,278 tokens
- Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
- Factored embeddings:
src_embed[s] + dst_embed[d] + promo_embed[p] - Sequence format:
[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD](256 tokens)
Variants
CLMConfig.small(): d=256, 8 layers, 4 heads, ~9.5M paramsCLMConfig.base(): d=512, 8 layers, 8 heads, ~35.8M params (default)CLMConfig.large(): d=640, 10 layers, 8 heads, ~68.4M paramsCLMConfig.toy(): d=64, 2 layers, for tests only
Training
All training scripts require one of --hf-repo REPO_ID or --local-checkpoints (mutually exclusive). Use --local-checkpoints for local dev; use --hf-repo for any run where you need durable checkpoints.
Pretraining
# Single model
uv run python scripts/train.py --variant base --local-checkpoints
# All three variants simultaneously (shared data batches, sequential GPU)
uv run python scripts/train_all.py --local-checkpoints
# Resume from checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints
scripts/train.py key args:
--variant {small|base|large|toy}— model size (default: base)--resume PATH— resume from checkpoint directory--total-steps N— training steps (default: 100,000)--batch-size N— batch size (default: 256)--discard-ply-limit— only train on naturally-ended games (no ply-limit truncation)- Architecture overrides:
--d-model,--n-layers,--n-heads,--d-ff,--lr,--weight-decay,--warmup-steps
scripts/train_all.py additional args:
--shm-checkpoints— write checkpoints to/dev/shm(requires--hf-repo, volatile)--run-evals— auto-run probes + diagnostics after training completes--publish-results— push eval results to HF--patience N— per-model early stopping patience (eval intervals without improvement)
Adapter Training
All adapter scripts require --checkpoint PATH (pretrained weights) and --pgn PATH (Lichess PGN file). They freeze the backbone and train only adapter parameters.
# Example: train a LoRA adapter on Lichess 1800-1900 games
uv run python scripts/train_lora.py \
--checkpoint thomas-schweich/pawn-base \
--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
--lora-rank 4 --lr 3e-4 --local-checkpoints
| Script | Adapter | Key args | Typical params |
|---|---|---|---|
train_bottleneck.py |
Houlsby MLP | --bottleneck-dim 8 |
~131K |
train_lora.py |
Low-rank attention | --lora-rank 4 --lora-targets qkvo |
~65K |
train_film.py |
Channel-wise affine | --no-output-film |
~17K |
train_sparse.py |
Binary mask | --density 0.01 --sparse-targets qkvo |
~503K-2.7M |
train_hybrid.py |
LoRA + FiLM | --lora-rank 4 --film-lr 1e-3 |
~65K |
train_tiny.py |
None (from scratch) | --d-model 84 --n-layers 2 |
~524K |
Common adapter args: --epochs 50, --batch-size 64, --lr 3e-4, --patience 10, --val-every 1, --max-games 12000, --min-ply 10
Common CLI Patterns
--sdpa-math— force MATH SDPA backend (required for ROCm + torch.compile)--no-compile— disable torch.compile--no-amp— disable mixed precision--num-workers N— DataLoader workers (default: 8 for adapters, 4 for pretraining)--device {cuda|cpu}— device selection--wandb— enable Weights & Biases logging
Evaluation & Metrics
Linear Probes
uv run python scripts/eval_probes.py --log-dir logs --device cuda
Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: --n-games 4096, --n-val-games 1024, --n-epochs 20, --run RUN_NAME (specific run).
Move Prediction Accuracy
uv run python scripts/eval_accuracy.py \
--checkpoint thomas-schweich/pawn-base \
--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
--adapter-checkpoint logs/run_*/checkpoints/best
MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: --min-eval-ply 10, --max-games 50000, --per-ply.
Theoretical Accuracy Ceilings
uv run python scripts/compute_theoretical_ceiling.py
Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.
Export to HuggingFace
uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS
Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.
Checkpoints
Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID:
thomas-schweich/pawn-small— 9.5M params,CLMConfig.small()thomas-schweich/pawn-base— 35.8M params,CLMConfig.base()thomas-schweich/pawn-large— 68.4M params,CLMConfig.large()
All scripts accept HF repo IDs for --checkpoint (e.g. --checkpoint thomas-schweich/pawn-base). Weights are downloaded and cached automatically via huggingface_hub.
Checkpoint Format (safetensors)
Checkpoints are directories, not single files:
step_00065000/
├── model.safetensors # model weights
├── optimizer.safetensors # flattened optimizer state
├── training_state.json # step, scheduler, scaler, RNG (base64)
├── config.json # model + training config
└── .complete # SHA-256 hashes of all files (integrity sentinel)
Central module: pawn/checkpoint.py. All save/load goes through this module.
Legacy .pt files are still loadable (backward compatible).
Checkpoint Storage Modes
All training scripts require one of:
--hf-repo REPO_ID— push checkpoints to a HuggingFace branch as they're written (durable)--local-checkpoints— save locally only (for development without an HF account)
HF mode creates a run/{run_id} branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied.
Optional: --shm-checkpoints writes checkpoints to /dev/shm (RAM-backed filesystem, instant writes). Requires --hf-repo since /dev/shm is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals.
Data Integrity
Every checkpoint write is atomic: files are written to a .tmp directory, then renamed.
The .complete sentinel contains SHA-256 hashes of every file in the checkpoint.
Hashes are always verified on load — no exceptions.
IncompleteCheckpointError— raised when.completesentinel is missingCheckpointIntegrityError— raised when any hash mismatches
Never use kill -9 on training processes. SIGTERM is handled gracefully: a flag is set,
the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.
Never rsync checkpoint files from running pods. Checkpoints are pushed to HuggingFace
from the trainer. Load via HF repo ID (e.g. --checkpoint thomas-schweich/pawn-base).
RunPod Operations
Docker Image
A single Docker image (thomasschweich/pawn:latest) is automatically built and pushed to Docker Hub by CI on every merge to main. No manual builds needed.
The image is based on runpod/pytorch (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at /opt/pawn on pods. SSH in and run experiments directly.
To build locally (rarely needed):
docker build --platform linux/amd64 \
--build-arg GIT_HASH=$(git rev-parse HEAD) \
-t thomasschweich/pawn:latest .
Pod Lifecycle
Use deploy/pod.sh for all pod management. Requires runpodctl (wget -qO- cli.runpod.net | sudo bash).
# Create a pod
bash deploy/pod.sh create myexp --gpu h100
# SSH into it
bash deploy/pod.sh ssh myexp
# Launch training
bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}
# Stop (preserves volume, stops billing)
bash deploy/pod.sh stop myexp
# Delete (destroys everything)
bash deploy/pod.sh delete myexp
GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100. Pod configs are cached in ~/.config/pawn/pods/<name>.env.
GPU Selection
Benchmarks from pretraining 3 models concurrently (train_all.py, batch=256):
| GPU | VRAM | $/hr | Step time | 100K cost | Notes |
|---|---|---|---|---|---|
| B200 | 192GB | $4.99 | 0.28s | ~$39 | Fastest |
| H200 SXM | 80GB | $3.59 | 0.34s | ~$34 | Best wall-clock/cost balance |
| RTX PRO 6000 | 48GB | $1.89 | 0.62s | ~$33 | Cheapest viable |
| A100 PCIe | 80GB | $1.39 | 0.79s | ~$30 | Cheapest overall |
| L40S | 48GB | $0.86 | 1.37s | ~$33 | Slow but cheap |
| RTX 5090/4090/3090 | 24-32GB | — | OOM | — | Insufficient VRAM for 3 models |
Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.
Required Pod Configuration
- Always attach a network volume. Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
- Set
HF_TOKENas a pod environment variable for automatic HuggingFace authentication. The entrypoint persists it to~/.cache/huggingface/token. PAWN_MODEL=thomas-schweich/pawn-base— auto-pull a checkpoint on startup (runner target).PAWN_CMD— training command to execute (alternative to Docker CMD args).
Pod Safety
- Stop pods with
runpodctl pod stoporbash deploy/pod.sh stop— sends SIGTERM, trainer saves and pushes before exiting. - Never
runpodctl pod deletewhile training is running — data loss risk. - Never
kill -9training processes — use SIGTERM (plainkill), which triggers graceful shutdown. - Never rsync checkpoint files from running pods — load via HF repo ID instead.
HuggingFace Bucket Backups
Use HF buckets (hf://buckets/...) to back up experiment data from pods. Buckets are not datasets or repos — they use the sync_bucket API, not upload_file with repo_type.
Key constraint: upload bandwidth from pods is ~1.8 MB/s. Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively.
During training — sync metrics only (instant):
from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN)
api.sync_bucket(
source="/workspace/logs",
dest="hf://buckets/OWNER/BUCKET/logs",
exclude=["*/checkpoints/*"],
)
For individual files (lab notes, transcripts) — stage in a temp dir:
import tempfile, shutil, os
with tempfile.TemporaryDirectory() as td:
shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md"))
api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET")
upload_file(..., repo_type="bucket") does not work — buckets are not a valid repo type for that API. Always use sync_bucket.
After training — sync only best/final checkpoints:
api.sync_bucket(
source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST",
dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best",
)
HF_TOKEN: Stored in /opt/pawn/.env on pods. Source it or export before calling the API.
Monitoring Training Progress
Key Principle: Write Scripts to Disk for Pre-Approval
When setting up recurring monitoring, always write the monitoring script to a file first so the user can review and pre-approve it. This avoids repeated permission prompts when /loop fires.
Pattern:
- Write a bash script to disk (e.g.,
scripts/check_my_run.sh) - User reviews and approves the script
- Schedule with
/loop 15m bash scripts/check_my_run.sh
Example monitoring script:
#!/usr/bin/env bash
# scripts/check_my_run.sh — monitor a specific training run
set -euo pipefail
bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>
Or for local-only monitoring:
#!/usr/bin/env bash
set -euo pipefail
bash /home/tas/pawn/scripts/check_progress.sh --sync
Available Monitoring Tools
| Tool | What it does |
|---|---|
scripts/monitor_training.sh [POD_ID] |
SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches |
scripts/check_progress.sh [LOG_DIR] |
Show progress from local logs/ directory |
python -m pawn.dashboard --log-dir logs |
Solara web dashboard with interactive charts |
Dashboard
python -m pawn.dashboard --log-dir logs
Reads metrics.jsonl files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).
Logs
Training metrics in logs/ (gitignored). Each run gets a timestamped directory with metrics.jsonl and a random slug (e.g., run_20260325_140000_zesty-osprey/).
MetricsLogger (pawn/logging.py) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.
Hyperparameter Sweeps
Optuna integration via pawn/sweep.py and scripts/sweep.py:
uv run python scripts/sweep.py \
--adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
--total-steps 20000 --pruner hyperband \
--checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \
--local-checkpoints
Supports all adapter types + architecture search. GPU affinity assigns CUDA_VISIBLE_DEVICES = trial.number % n_gpus. SQLite-backed study persistence. Pruner options: hyperband, median, none.
Key Patterns & Gotchas
- DataLoader workers must use
multiprocessing_context='spawn'— the Rust engine uses rayon, and fork after rayon init causes deadlocks. SDPA_BACKENDmust be set beforetorch.compile()— compiled code captures the backend at trace time.apply_gpu_config()handles this.- ROCm works: The only known ROCm issue is a stride mismatch in flash attention backward when combined with
torch.compile+ AMP. The workaround is--sdpa-math(use the MATH SDPA backend instead of flash), whichconfigure_gpu()applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. Do not assume bugs are ROCm-specific. Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue. - Sparse logit projection:
forward_hidden()returns(B,T,d_model), then only loss-masked positions project throughlm_head— avoids full(B,T,V)materialization. - Legal mask via Rust:
LegalMaskBuilderreplays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense). - GPU auto-detection:
pawn.gpu.configure_gpu()selects compile/AMP/SDPA settings.apply_gpu_config()applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated. - Factored embeddings: each move token decomposes into
src_embed[s] + dst_embed[d] + promo_embed[p], reducing embedding parameters by ~32x.
Current task
Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit
You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model.
You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the pawn-lab MCP server and the manage-pod skill workflow.
As your first action, set up the check-in crons per the manage-pod skill, then proceed.
The Three Ablations
| Ablation | Config | Hypothesis |
|---|---|---|
| mate-boost | mate_boost: 1.0 |
Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames. |
| no-outcome | no_outcome_token: true |
Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts. |
| discard-ply-limit | discard_ply_limit: true |
Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings. |
There is no separate baseline run — the existing PAWN-Base checkpoint (thomas-schweich/pawn-base) trained at 100K steps is the baseline reference.
Unified Config
All training uses the unified scripts/train.py with Pydantic RunConfig. Call lab_schema to see all available fields. Configs are JSON dicts passed to lab_launch.
Base config shared by all ablations:
{
"run_type": "pretrain",
"variant": "base",
"total_steps": 200000,
"batch_size": 256,
"lr": 3e-4,
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"eval_interval": 5000,
"num_workers": 4
}
Each ablation adds its flag on top. Example for mate-boost:
{
"run_type": "pretrain",
"variant": "base",
"mate_boost": 1.0,
"total_steps": 200000,
"batch_size": 256,
"lr": 3e-4,
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"eval_interval": 5000,
"num_workers": 4
}
Procedure
Phase 1: Learning Rate Exploration (~1 hour)
For each of the 3 ablations, launch 3 trials with different learning rates using pause_after_steps:
lab_launch(config={
"run_type": "pretrain",
"variant": "base",
"mate_boost": 1.0, # ← ablation flag
"total_steps": 200000,
"pause_after_steps": 5000, # ← pause for comparison
"batch_size": 256,
"lr": 1e-4, # ← vary this: 1e-4, 3e-4, 1e-3
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"no_compile": true, # ← skip compile for short exploration
"eval_interval": 2500,
"num_workers": 4
})
That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use no_compile: true to skip the 15-30 min compile overhead for these short runs.
After all 9 pause, call lab_results to compare val_loss at 5K steps. Pick the best LR for each ablation.
Phase 2: Full Training (~10-14 hours)
Resume the best LR for each ablation to 200K steps with torch.compile enabled:
lab_resume(trial_id=BEST_MATE_BOOST) # no pause → runs to total_steps
lab_resume(trial_id=BEST_NO_OUTCOME)
lab_resume(trial_id=BEST_DISCARD_PLY)
lab_resume clears pause_after_steps by default, so the resumed trials run to completion at total_steps=200000. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact.
Important: Do NOT pass no_compile: true for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps.
With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51.
Phase 3: Evaluation
After all 3 runs converge, compare:
- Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps)
- Learning curves (loss vs step for all 3, plotted together)
- Per-ply accuracy via
scripts/eval_accuracy.pyif time permits
Monitoring
Use the manage-pod skill check-in pattern:
- Mini check-in (every 5 min):
lab_events— launch next trial if GPU idle - Full check-in (hourly):
lab_status+lab_logfor each running trial + update lab notes
During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast. During Phase 2 (3 long runs), switch to hourly after confirming stability.
What to watch for
- NaN loss: Kill immediately. Try lr/3. Use
lab_logto check. - discard-ply-limit throughput: This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce
num_workersto free CPU for data generation. - mate-boost game lengths: Check
lab_log— mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128. - GPU assignment: The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly.
Cost Budget
| Phase | Trials | Time | Cost |
|---|---|---|---|
| LR exploration | 9 × 5K steps | ~1h | ~$3 |
| Full training | 3 × 195K steps (parallel) | ~22h | ~$51 |
| Total | ~23h | ~$53 |
Set cost tracking: lab_set_cost(cost_per_hour=2.31)
Lab Notes
Write experiment state to runs/lab-notes.md (survives context compaction via PostCompact hook). Include:
- Phase 1 results table (LR × ablation → val_loss at 5K steps)
- Which LR was selected for each ablation and why
- Phase 2 progress (step, loss, ETA per ablation)
- Any anomalies or adjustments made
Current State (as of 2026-04-03 19:30 UTC)
Phase 2: Full Training (RUNNING)
Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K:
| Trial | GPU | Ablation | LR | Step (~19:00) | train_loss | acc |
|---|---|---|---|---|---|---|
| 24 | 0 | mate-boost | 1e-3 | ~120K | 3.19 | 6.8% |
| 25 | 1 | no-outcome | 3e-4 (no warmup) | ~116K | 3.09 | 7.0% |
| 26 | 2 | discard-ply-limit | 1e-3 | ~119K | 3.18 | 7.5% |
- ETA:
2026-04-04 04:30 UTC (9h remaining) - Hourly cron monitors via
lab_events+lab_log - Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps
Concurrent: Accuracy Ceiling Computation (RUNNING)
Running scripts/compute_theoretical_ceiling.py on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train:
RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \
scripts/compute_theoretical_ceiling.py \
--n-games 5000 --rollouts 1024 --sample-rate 0.075 \
--model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json
- 1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp
- Processes in 10 batches of 500 games, printing intermediate estimates
- ETA: ~3-4 hours (finishes well before training)
- Output:
/workspace/data/theoretical_ceiling_1024.json
Bug Fixes Applied This Session
pause_after_steps: Was missing fromtrainer.py(only in adapter training). Added toTrainingConfigandtrainer.pytrain loop. Wired throughscripts/train.py.lab_resumecheckpoint discovery: Was looking forcheckpoints/bestorcheckpoints/finalonly. Pretraining usesstep_XXXXXXXXnaming. Added fallback to pick higheststep_*dir.
HF Bucket Backup
All experiment data syncs to hf://buckets/thomas-schweich/pretraining-ablations:
- Metrics/configs synced every ~4 hours (excluding checkpoints)
- 100K-step checkpoints for all 3 ablations uploaded
- Lab notes + chat transcript synced periodically
- HF_TOKEN in
/opt/pawn/.env - See "HuggingFace Bucket Backups" section above for API patterns
Lab Notes
Detailed log at /workspace/runs/lab-notes.md (symlinked from /opt/pawn/runs/lab-notes.md). Survives context compaction via PostCompact hook. Read this first when resuming.
Post-Training Procedure
When each training run completes (in any order):
- Upload best checkpoint to HF bucket in the background (see bucket sync patterns above)
- Run linear probes on the best checkpoint in the background:
Use the GPU that just freed up. Run in background so other completions aren't blocked.PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \ --checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \ --log-dir /workspace/eval_probes/trial_XXXX --device cuda:N - After the last probe run is kicked off, draft the final report comparing all 3 ablations + baseline:
- val_loss and accuracy curves
- Probe results (piece type, check, castling, material, game phase)
- Accuracy ceiling comparison
- Key findings and surprises
- Update the report with probe results as they finish
- Upload everything to HF bucket: final checkpoints, probe results, report, lab notes
Success Criteria
- All three ablations train to 200K steps without divergence
- Clear comparison of val_loss and accuracy across ablations
- At least one ablation shows meaningful difference from baseline (better or worse — both are informative)
--
Xet Storage Details
- Size:
- 28.6 kB
- Xet hash:
- 2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.