SortGPT Checkpoints

Checkpoints for small decoder-only transformers trained on the integer sorting task.

Task

The model takes a sequence of k integers from {0, ..., N-1}, a SEP token, and must output the sorted sequence:

[unsorted_tokens | SEP | sorted_tokens]

Input length is 2*k + 1. The SEP token index is N (i.e., vocab_size = N + 1).

Grid

Parameter	Values
`k` (length)	16, 32
`N` (vocab)	128, 256, 512, 1024
Seeds	1, 2, 3, 4, 5
`n_embd`	64
`n_layers`	2
`n_heads`	1
`init_std`	0.01
`lr`	0.03
`max_iters`	100,000

8 configs × 5 seeds = 40 runs, each with 20 checkpoints (every 5,000 steps).

Architecture

Small GPT-2-style decoder-only transformer:

Token embeddings (no positional embeddings — without_pos=True)
2 pre-norm transformer blocks, each with causal self-attention + MLP
Final LayerNorm + tied LM head
Weight tying between token embedding and LM head

File Structure

checkpoints/
  k{16,32}_N{128,256,512,1024}/
    seed{1,2,3,4,5}/
      std0p01_iseed{S}__ckpt{iter}.pt
model.py   # Model definition + loading utilities

Loading a Checkpoint

# Copy model.py to your project, then:
from model import load_model_from_checkpoint

model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")

Each .pt file is a dict with keys:

model_config: dict of GPTConfig fields
model_state_dict: PyTorch state dict
checkpoint_iter, init_seed, init_std, l1_init_scale

Training Details

Optimizer: AdamW (betas=0.9, 0.95)
LR schedule: cosine decay with linear warmup
Batch size: 128
Data: randomly sampled sorting problems (no duplicates)
data_seed: 1337 (shared across all runs)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support