|
download
raw
28.6 kB

PAWN (Playstyle-Agnostic World-model Network for Chess)

A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0.

Repository Structure

pawn/
├── engine/          # Rust chess engine with PyO3 bindings (via shakmaty)
├── pawn/            # Core Python package
│   ├── config.py    # CLMConfig (small/base/large), TrainingConfig
│   ├── model.py     # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings)
│   ├── data.py      # On-the-fly random game data pipeline
│   ├── lichess_data.py  # Lichess PGN data pipeline + legal mask computation
│   ├── trainer.py   # Pretraining loop
│   ├── gpu.py       # GPU auto-detection (compile/AMP/SDPA backend)
│   ├── logging.py   # MetricsLogger (JSONL output)
│   ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push
│   ├── adapters/    # Bottleneck, LoRA, FiLM, sparse, hybrid
│   ├── eval_suite/  # Probes, generation tests, diagnostics, lichess eval
│   └── dashboard/   # Solara training dashboard (metrics, charts, runner)
├── scripts/         # Training and evaluation entry points
├── tests/           # Unit tests
├── deploy/          # Runpod deployment scripts
└── docs/            # Architecture, training, adapter docs

Building

This is a uv workspace. The root project is the pawn Python package; engine/ is the sole workspace member.

# Build the Rust chess engine (required before anything else)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
uv sync --extra rocm      # AMD (ROCm 7.1)
uv sync --extra cu128     # NVIDIA (CUDA 12.8)

# Run tests
uv run pytest tests/

# Pretrain from scratch (local dev)
uv run python scripts/train.py --variant base --local-checkpoints

The only extras are GPU backends (rocm or cu128). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify --extra rocm or --extra cu128.

GPU requirement: configure_gpu() (called by every training and eval script) raises RuntimeError if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable PAWN_ALLOW_CPU=1 overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call configure_gpu() and run fine on CPU without the override.

Engine (engine/)

Single source of truth for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.

  • Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
  • PyO3 bindings expose chess_engine module to Python
  • Key functions: generate_random_games(), parse_pgn_file(), compute_legal_token_masks_sparse(), extract_board_states(), export_move_vocabulary(), compute_accuracy_ceiling()

Model

Architecture

  • Decoder-only transformer, next-token prediction over 4,278 tokens
  • Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
  • Factored embeddings: src_embed[s] + dst_embed[d] + promo_embed[p]
  • Sequence format: [outcome] [ply_1] ... [ply_N] [PAD] ... [PAD] (256 tokens)

Variants

  • CLMConfig.small(): d=256, 8 layers, 4 heads, ~9.5M params
  • CLMConfig.base(): d=512, 8 layers, 8 heads, ~35.8M params (default)
  • CLMConfig.large(): d=640, 10 layers, 8 heads, ~68.4M params
  • CLMConfig.toy(): d=64, 2 layers, for tests only

Training

All training scripts require one of --hf-repo REPO_ID or --local-checkpoints (mutually exclusive). Use --local-checkpoints for local dev; use --hf-repo for any run where you need durable checkpoints.

Pretraining

# Single model
uv run python scripts/train.py --variant base --local-checkpoints

# All three variants simultaneously (shared data batches, sequential GPU)
uv run python scripts/train_all.py --local-checkpoints

# Resume from checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints

scripts/train.py key args:

  • --variant {small|base|large|toy} — model size (default: base)
  • --resume PATH — resume from checkpoint directory
  • --total-steps N — training steps (default: 100,000)
  • --batch-size N — batch size (default: 256)
  • --discard-ply-limit — only train on naturally-ended games (no ply-limit truncation)
  • Architecture overrides: --d-model, --n-layers, --n-heads, --d-ff, --lr, --weight-decay, --warmup-steps

scripts/train_all.py additional args:

  • --shm-checkpoints — write checkpoints to /dev/shm (requires --hf-repo, volatile)
  • --run-evals — auto-run probes + diagnostics after training completes
  • --publish-results — push eval results to HF
  • --patience N — per-model early stopping patience (eval intervals without improvement)

Adapter Training

All adapter scripts require --checkpoint PATH (pretrained weights) and --pgn PATH (Lichess PGN file). They freeze the backbone and train only adapter parameters.

# Example: train a LoRA adapter on Lichess 1800-1900 games
uv run python scripts/train_lora.py \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
    --lora-rank 4 --lr 3e-4 --local-checkpoints
Script Adapter Key args Typical params
train_bottleneck.py Houlsby MLP --bottleneck-dim 8 ~131K
train_lora.py Low-rank attention --lora-rank 4 --lora-targets qkvo ~65K
train_film.py Channel-wise affine --no-output-film ~17K
train_sparse.py Binary mask --density 0.01 --sparse-targets qkvo ~503K-2.7M
train_hybrid.py LoRA + FiLM --lora-rank 4 --film-lr 1e-3 ~65K
train_tiny.py None (from scratch) --d-model 84 --n-layers 2 ~524K

Common adapter args: --epochs 50, --batch-size 64, --lr 3e-4, --patience 10, --val-every 1, --max-games 12000, --min-ply 10

Common CLI Patterns

  • --sdpa-math — force MATH SDPA backend (required for ROCm + torch.compile)
  • --no-compile — disable torch.compile
  • --no-amp — disable mixed precision
  • --num-workers N — DataLoader workers (default: 8 for adapters, 4 for pretraining)
  • --device {cuda|cpu} — device selection
  • --wandb — enable Weights & Biases logging

Evaluation & Metrics

Linear Probes

uv run python scripts/eval_probes.py --log-dir logs --device cuda

Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: --n-games 4096, --n-val-games 1024, --n-epochs 20, --run RUN_NAME (specific run).

Move Prediction Accuracy

uv run python scripts/eval_accuracy.py \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
    --adapter-checkpoint logs/run_*/checkpoints/best

MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: --min-eval-ply 10, --max-games 50000, --per-ply.

Theoretical Accuracy Ceilings

uv run python scripts/compute_theoretical_ceiling.py

Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.

Export to HuggingFace

uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS

Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.

Checkpoints

Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID:

  • thomas-schweich/pawn-small — 9.5M params, CLMConfig.small()
  • thomas-schweich/pawn-base — 35.8M params, CLMConfig.base()
  • thomas-schweich/pawn-large — 68.4M params, CLMConfig.large()

All scripts accept HF repo IDs for --checkpoint (e.g. --checkpoint thomas-schweich/pawn-base). Weights are downloaded and cached automatically via huggingface_hub.

Checkpoint Format (safetensors)

Checkpoints are directories, not single files:

step_00065000/
├── model.safetensors        # model weights
├── optimizer.safetensors    # flattened optimizer state
├── training_state.json      # step, scheduler, scaler, RNG (base64)
├── config.json              # model + training config
└── .complete                # SHA-256 hashes of all files (integrity sentinel)

Central module: pawn/checkpoint.py. All save/load goes through this module. Legacy .pt files are still loadable (backward compatible).

Checkpoint Storage Modes

All training scripts require one of:

  • --hf-repo REPO_ID — push checkpoints to a HuggingFace branch as they're written (durable)
  • --local-checkpoints — save locally only (for development without an HF account)

HF mode creates a run/{run_id} branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied.

Optional: --shm-checkpoints writes checkpoints to /dev/shm (RAM-backed filesystem, instant writes). Requires --hf-repo since /dev/shm is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals.

Data Integrity

Every checkpoint write is atomic: files are written to a .tmp directory, then renamed. The .complete sentinel contains SHA-256 hashes of every file in the checkpoint. Hashes are always verified on load — no exceptions.

  • IncompleteCheckpointError — raised when .complete sentinel is missing
  • CheckpointIntegrityError — raised when any hash mismatches

Never use kill -9 on training processes. SIGTERM is handled gracefully: a flag is set, the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.

Never rsync checkpoint files from running pods. Checkpoints are pushed to HuggingFace from the trainer. Load via HF repo ID (e.g. --checkpoint thomas-schweich/pawn-base).

RunPod Operations

Docker Image

A single Docker image (thomasschweich/pawn:latest) is automatically built and pushed to Docker Hub by CI on every merge to main. No manual builds needed.

The image is based on runpod/pytorch (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at /opt/pawn on pods. SSH in and run experiments directly.

To build locally (rarely needed):

docker build --platform linux/amd64 \
    --build-arg GIT_HASH=$(git rev-parse HEAD) \
    -t thomasschweich/pawn:latest .

Pod Lifecycle

Use deploy/pod.sh for all pod management. Requires runpodctl (wget -qO- cli.runpod.net | sudo bash).

# Create a pod
bash deploy/pod.sh create myexp --gpu h100

# SSH into it
bash deploy/pod.sh ssh myexp

# Launch training
bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}

# Stop (preserves volume, stops billing)
bash deploy/pod.sh stop myexp

# Delete (destroys everything)
bash deploy/pod.sh delete myexp

GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100. Pod configs are cached in ~/.config/pawn/pods/<name>.env.

GPU Selection

Benchmarks from pretraining 3 models concurrently (train_all.py, batch=256):

GPU VRAM $/hr Step time 100K cost Notes
B200 192GB $4.99 0.28s ~$39 Fastest
H200 SXM 80GB $3.59 0.34s ~$34 Best wall-clock/cost balance
RTX PRO 6000 48GB $1.89 0.62s ~$33 Cheapest viable
A100 PCIe 80GB $1.39 0.79s ~$30 Cheapest overall
L40S 48GB $0.86 1.37s ~$33 Slow but cheap
RTX 5090/4090/3090 24-32GB OOM Insufficient VRAM for 3 models

Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.

Required Pod Configuration

  • Always attach a network volume. Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
  • Set HF_TOKEN as a pod environment variable for automatic HuggingFace authentication. The entrypoint persists it to ~/.cache/huggingface/token.
  • PAWN_MODEL=thomas-schweich/pawn-base — auto-pull a checkpoint on startup (runner target).
  • PAWN_CMD — training command to execute (alternative to Docker CMD args).

Pod Safety

  • Stop pods with runpodctl pod stop or bash deploy/pod.sh stop — sends SIGTERM, trainer saves and pushes before exiting.
  • Never runpodctl pod delete while training is running — data loss risk.
  • Never kill -9 training processes — use SIGTERM (plain kill), which triggers graceful shutdown.
  • Never rsync checkpoint files from running pods — load via HF repo ID instead.

HuggingFace Bucket Backups

Use HF buckets (hf://buckets/...) to back up experiment data from pods. Buckets are not datasets or repos — they use the sync_bucket API, not upload_file with repo_type.

Key constraint: upload bandwidth from pods is ~1.8 MB/s. Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively.

During training — sync metrics only (instant):

from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN)
api.sync_bucket(
    source="/workspace/logs",
    dest="hf://buckets/OWNER/BUCKET/logs",
    exclude=["*/checkpoints/*"],
)

For individual files (lab notes, transcripts) — stage in a temp dir:

import tempfile, shutil, os
with tempfile.TemporaryDirectory() as td:
    shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md"))
    api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET")

upload_file(..., repo_type="bucket") does not work — buckets are not a valid repo type for that API. Always use sync_bucket.

After training — sync only best/final checkpoints:

api.sync_bucket(
    source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST",
    dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best",
)

HF_TOKEN: Stored in /opt/pawn/.env on pods. Source it or export before calling the API.

Monitoring Training Progress

Key Principle: Write Scripts to Disk for Pre-Approval

When setting up recurring monitoring, always write the monitoring script to a file first so the user can review and pre-approve it. This avoids repeated permission prompts when /loop fires.

Pattern:

  1. Write a bash script to disk (e.g., scripts/check_my_run.sh)
  2. User reviews and approves the script
  3. Schedule with /loop 15m bash scripts/check_my_run.sh

Example monitoring script:

#!/usr/bin/env bash
# scripts/check_my_run.sh — monitor a specific training run
set -euo pipefail
bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>

Or for local-only monitoring:

#!/usr/bin/env bash
set -euo pipefail
bash /home/tas/pawn/scripts/check_progress.sh --sync

Available Monitoring Tools

Tool What it does
scripts/monitor_training.sh [POD_ID] SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches
scripts/check_progress.sh [LOG_DIR] Show progress from local logs/ directory
python -m pawn.dashboard --log-dir logs Solara web dashboard with interactive charts

Dashboard

python -m pawn.dashboard --log-dir logs

Reads metrics.jsonl files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).

Logs

Training metrics in logs/ (gitignored). Each run gets a timestamped directory with metrics.jsonl and a random slug (e.g., run_20260325_140000_zesty-osprey/).

MetricsLogger (pawn/logging.py) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.

Hyperparameter Sweeps

Optuna integration via pawn/sweep.py and scripts/sweep.py:

uv run python scripts/sweep.py \
    --adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
    --total-steps 20000 --pruner hyperband \
    --checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \
    --local-checkpoints

Supports all adapter types + architecture search. GPU affinity assigns CUDA_VISIBLE_DEVICES = trial.number % n_gpus. SQLite-backed study persistence. Pruner options: hyperband, median, none.

Key Patterns & Gotchas

  • DataLoader workers must use multiprocessing_context='spawn' — the Rust engine uses rayon, and fork after rayon init causes deadlocks.
  • SDPA_BACKEND must be set before torch.compile() — compiled code captures the backend at trace time. apply_gpu_config() handles this.
  • ROCm works: The only known ROCm issue is a stride mismatch in flash attention backward when combined with torch.compile + AMP. The workaround is --sdpa-math (use the MATH SDPA backend instead of flash), which configure_gpu() applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. Do not assume bugs are ROCm-specific. Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue.
  • Sparse logit projection: forward_hidden() returns (B,T,d_model), then only loss-masked positions project through lm_head — avoids full (B,T,V) materialization.
  • Legal mask via Rust: LegalMaskBuilder replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense).
  • GPU auto-detection: pawn.gpu.configure_gpu() selects compile/AMP/SDPA settings. apply_gpu_config() applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated.
  • Factored embeddings: each move token decomposes into src_embed[s] + dst_embed[d] + promo_embed[p], reducing embedding parameters by ~32x.

Current task

Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit

You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model.

You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the pawn-lab MCP server and the manage-pod skill workflow.

As your first action, set up the check-in crons per the manage-pod skill, then proceed.

The Three Ablations

Ablation Config Hypothesis
mate-boost mate_boost: 1.0 Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames.
no-outcome no_outcome_token: true Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts.
discard-ply-limit discard_ply_limit: true Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings.

There is no separate baseline run — the existing PAWN-Base checkpoint (thomas-schweich/pawn-base) trained at 100K steps is the baseline reference.

Unified Config

All training uses the unified scripts/train.py with Pydantic RunConfig. Call lab_schema to see all available fields. Configs are JSON dicts passed to lab_launch.

Base config shared by all ablations:

{
  "run_type": "pretrain",
  "variant": "base",
  "total_steps": 200000,
  "batch_size": 256,
  "lr": 3e-4,
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "eval_interval": 5000,
  "num_workers": 4
}

Each ablation adds its flag on top. Example for mate-boost:

{
  "run_type": "pretrain",
  "variant": "base",
  "mate_boost": 1.0,
  "total_steps": 200000,
  "batch_size": 256,
  "lr": 3e-4,
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "eval_interval": 5000,
  "num_workers": 4
}

Procedure

Phase 1: Learning Rate Exploration (~1 hour)

For each of the 3 ablations, launch 3 trials with different learning rates using pause_after_steps:

lab_launch(config={
  "run_type": "pretrain",
  "variant": "base",
  "mate_boost": 1.0,           # ← ablation flag
  "total_steps": 200000,
  "pause_after_steps": 5000,   # ← pause for comparison
  "batch_size": 256,
  "lr": 1e-4,                  # ← vary this: 1e-4, 3e-4, 1e-3
  "local_checkpoints": true,
  "amp_dtype": "bfloat16",
  "no_compile": true,          # ← skip compile for short exploration
  "eval_interval": 2500,
  "num_workers": 4
})

That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use no_compile: true to skip the 15-30 min compile overhead for these short runs.

After all 9 pause, call lab_results to compare val_loss at 5K steps. Pick the best LR for each ablation.

Phase 2: Full Training (~10-14 hours)

Resume the best LR for each ablation to 200K steps with torch.compile enabled:

lab_resume(trial_id=BEST_MATE_BOOST)      # no pause → runs to total_steps
lab_resume(trial_id=BEST_NO_OUTCOME)
lab_resume(trial_id=BEST_DISCARD_PLY)

lab_resume clears pause_after_steps by default, so the resumed trials run to completion at total_steps=200000. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact.

Important: Do NOT pass no_compile: true for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps.

With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51.

Phase 3: Evaluation

After all 3 runs converge, compare:

  • Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps)
  • Learning curves (loss vs step for all 3, plotted together)
  • Per-ply accuracy via scripts/eval_accuracy.py if time permits

Monitoring

Use the manage-pod skill check-in pattern:

  • Mini check-in (every 5 min): lab_events — launch next trial if GPU idle
  • Full check-in (hourly): lab_status + lab_log for each running trial + update lab notes

During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast. During Phase 2 (3 long runs), switch to hourly after confirming stability.

What to watch for

  • NaN loss: Kill immediately. Try lr/3. Use lab_log to check.
  • discard-ply-limit throughput: This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce num_workers to free CPU for data generation.
  • mate-boost game lengths: Check lab_log — mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128.
  • GPU assignment: The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly.

Cost Budget

Phase Trials Time Cost
LR exploration 9 × 5K steps ~1h ~$3
Full training 3 × 195K steps (parallel) ~22h ~$51
Total ~23h ~$53

Set cost tracking: lab_set_cost(cost_per_hour=2.31)

Lab Notes

Write experiment state to runs/lab-notes.md (survives context compaction via PostCompact hook). Include:

  • Phase 1 results table (LR × ablation → val_loss at 5K steps)
  • Which LR was selected for each ablation and why
  • Phase 2 progress (step, loss, ETA per ablation)
  • Any anomalies or adjustments made

Current State (as of 2026-04-03 19:30 UTC)

Phase 2: Full Training (RUNNING)

Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K:

Trial GPU Ablation LR Step (~19:00) train_loss acc
24 0 mate-boost 1e-3 ~120K 3.19 6.8%
25 1 no-outcome 3e-4 (no warmup) ~116K 3.09 7.0%
26 2 discard-ply-limit 1e-3 ~119K 3.18 7.5%
  • ETA: 2026-04-04 04:30 UTC (9h remaining)
  • Hourly cron monitors via lab_events + lab_log
  • Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps

Concurrent: Accuracy Ceiling Computation (RUNNING)

Running scripts/compute_theoretical_ceiling.py on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train:

RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \
    scripts/compute_theoretical_ceiling.py \
    --n-games 5000 --rollouts 1024 --sample-rate 0.075 \
    --model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json
  • 1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp
  • Processes in 10 batches of 500 games, printing intermediate estimates
  • ETA: ~3-4 hours (finishes well before training)
  • Output: /workspace/data/theoretical_ceiling_1024.json

Bug Fixes Applied This Session

  • pause_after_steps: Was missing from trainer.py (only in adapter training). Added to TrainingConfig and trainer.py train loop. Wired through scripts/train.py.
  • lab_resume checkpoint discovery: Was looking for checkpoints/best or checkpoints/final only. Pretraining uses step_XXXXXXXX naming. Added fallback to pick highest step_* dir.

HF Bucket Backup

All experiment data syncs to hf://buckets/thomas-schweich/pretraining-ablations:

  • Metrics/configs synced every ~4 hours (excluding checkpoints)
  • 100K-step checkpoints for all 3 ablations uploaded
  • Lab notes + chat transcript synced periodically
  • HF_TOKEN in /opt/pawn/.env
  • See "HuggingFace Bucket Backups" section above for API patterns

Lab Notes

Detailed log at /workspace/runs/lab-notes.md (symlinked from /opt/pawn/runs/lab-notes.md). Survives context compaction via PostCompact hook. Read this first when resuming.

Post-Training Procedure

When each training run completes (in any order):

  1. Upload best checkpoint to HF bucket in the background (see bucket sync patterns above)
  2. Run linear probes on the best checkpoint in the background:
    PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \
        --checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \
        --log-dir /workspace/eval_probes/trial_XXXX --device cuda:N
    
    Use the GPU that just freed up. Run in background so other completions aren't blocked.
  3. After the last probe run is kicked off, draft the final report comparing all 3 ablations + baseline:
    • val_loss and accuracy curves
    • Probe results (piece type, check, castling, material, game phase)
    • Accuracy ceiling comparison
    • Key findings and surprises
  4. Update the report with probe results as they finish
  5. Upload everything to HF bucket: final checkpoints, probe results, report, lab notes

Success Criteria

  1. All three ablations train to 200K steps without divergence
  2. Clear comparison of val_loss and accuracy across ablations
  3. At least one ablation shows meaningful difference from baseline (better or worse — both are informative)

--

Xet Storage Details

Size:
28.6 kB
·
Xet hash:
2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.