Chess Transformer 200M v2

A 204M parameter chess transformer trained on Stockfish-labeled positions from Lichess games.

Current Results

  • Best Accuracy: 16.3% (step 0)
  • Total Positions Trained: 0 across 4 GPUs
  • Last Updated: 2026-03-28T15:20:15.471373+00:00

Training

  • Experiment: exp075_ddp_4gpu (Local SGD, 4x NVIDIA A40)
  • Dataset: avewright/chess-positions-lichess-sf (~832M positions, 3275 source parquets)
  • Architecture: FusedBoardEncoder 256d β†’ 1024d transformer, 16 layers, 16 heads, FFN 4Γ—, SpatialPolicyHead
  • Strategy: 4 independent workers each training on 1/4 of data, weights averaged every 500 optimizer steps
  • Batch: 256 Γ— accum 4 = effective 1024 per worker
  • LR: 1e-4 cosine schedule β†’ 5% floor, 1% warmup
  • Parent: Continued from exp074 best checkpoint

Eval History

Step Positions Accuracy Top-3 SF Rank Value Acc
0 0 16.3% 41.8% 66.6 78.5%

Architecture

ChessTransformer200M (~204M params)
β”œβ”€β”€ FusedBoardEncoder (embed_dim=256)
β”œβ”€β”€ Linear projection (256 β†’ 1024)
β”œβ”€β”€ CLS token + positional embeddings (68 positions)
β”œβ”€β”€ TransformerEncoder (16 layers, 16 heads, FFN 4096, GELU, norm_first)
β”œβ”€β”€ LayerNorm
β”œβ”€β”€ SpatialPolicyHead (head_dim=512) β†’ 1968 moves
└── Value head (1024 β†’ 512 β†’ 3 WDL)

Files

  • best_model.pt β€” best checkpoint (state_dict only)
  • training_log.json β€” full eval history
  • config.json β€” training configuration
  • train.log β€” aggregated worker logs

Usage

from huggingface_hub import hf_hub_download
import torch

path = hf_hub_download("avewright/chess-transformer-200m-v2", "best_model.pt")
state_dict = torch.load(path, map_location="cpu", weights_only=True)
# Load into ChessTransformer200M architecture
Downloads last month
98
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train avewright/chess-transformer-200m-v2