Chess Transformer 200M v2
A 204M parameter chess transformer trained on Stockfish-labeled positions from Lichess games.
Current Results
- Best Accuracy: 16.3% (step 0)
- Total Positions Trained: 0 across 4 GPUs
- Last Updated: 2026-03-28T15:20:15.471373+00:00
Training
- Experiment: exp075_ddp_4gpu (Local SGD, 4x NVIDIA A40)
- Dataset: avewright/chess-positions-lichess-sf (~832M positions, 3275 source parquets)
- Architecture: FusedBoardEncoder 256d β 1024d transformer, 16 layers, 16 heads, FFN 4Γ, SpatialPolicyHead
- Strategy: 4 independent workers each training on 1/4 of data, weights averaged every 500 optimizer steps
- Batch: 256 Γ accum 4 = effective 1024 per worker
- LR: 1e-4 cosine schedule β 5% floor, 1% warmup
- Parent: Continued from exp074 best checkpoint
Eval History
| Step | Positions | Accuracy | Top-3 | SF Rank | Value Acc |
|---|---|---|---|---|---|
| 0 | 0 | 16.3% | 41.8% | 66.6 | 78.5% |
Architecture
ChessTransformer200M (~204M params)
βββ FusedBoardEncoder (embed_dim=256)
βββ Linear projection (256 β 1024)
βββ CLS token + positional embeddings (68 positions)
βββ TransformerEncoder (16 layers, 16 heads, FFN 4096, GELU, norm_first)
βββ LayerNorm
βββ SpatialPolicyHead (head_dim=512) β 1968 moves
βββ Value head (1024 β 512 β 3 WDL)
Files
best_model.ptβ best checkpoint (state_dict only)training_log.jsonβ full eval historyconfig.jsonβ training configurationtrain.logβ aggregated worker logs
Usage
from huggingface_hub import hf_hub_download
import torch
path = hf_hub_download("avewright/chess-transformer-200m-v2", "best_model.pt")
state_dict = torch.load(path, map_location="cpu", weights_only=True)
# Load into ChessTransformer200M architecture
- Downloads last month
- 98
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support