SortGPT Checkpoints
Checkpoints for small decoder-only transformers trained on the integer sorting task.
Task
The model takes a sequence of k integers from {0, ..., N-1}, a SEP token, and must output the sorted sequence:
[unsorted_tokens | SEP | sorted_tokens]
Input length is 2*k + 1. The SEP token index is N (i.e., vocab_size = N + 1).
Grid
| Parameter | Values |
|---|---|
k (length) |
16, 32 |
N (vocab) |
128, 256, 512, 1024 |
| Seeds | 1, 2, 3, 4, 5 |
n_embd |
64 |
n_layers |
2 |
n_heads |
1 |
init_std |
0.01 |
lr |
0.03 |
max_iters |
100,000 |
8 configs × 5 seeds = 40 runs, each with 20 checkpoints (every 5,000 steps).
Architecture
Small GPT-2-style decoder-only transformer:
- Token embeddings (no positional embeddings —
without_pos=True) - 2 pre-norm transformer blocks, each with causal self-attention + MLP
- Final LayerNorm + tied LM head
- Weight tying between token embedding and LM head
File Structure
checkpoints/
k{16,32}_N{128,256,512,1024}/
seed{1,2,3,4,5}/
std0p01_iseed{S}__ckpt{iter}.pt
model.py # Model definition + loading utilities
Loading a Checkpoint
# Copy model.py to your project, then:
from model import load_model_from_checkpoint
model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")
Each .pt file is a dict with keys:
model_config: dict ofGPTConfigfieldsmodel_state_dict: PyTorch state dictcheckpoint_iter,init_seed,init_std,l1_init_scale
Training Details
- Optimizer: AdamW (betas=0.9, 0.95)
- LR schedule: cosine decay with linear warmup
- Batch size: 128
- Data: randomly sampled sorting problems (no duplicates)
data_seed: 1337 (shared across all runs)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support