SortGPT Checkpoints

Checkpoints for small decoder-only transformers trained on the integer sorting task.

Task

The model takes a sequence of k integers from {0, ..., N-1}, a SEP token, and must output the sorted sequence:

[unsorted_tokens | SEP | sorted_tokens]

Input length is 2*k + 1. The SEP token index is N (i.e., vocab_size = N + 1).

Grid

Parameter Values
k (length) 16, 32
N (vocab) 128, 256, 512, 1024
Seeds 1, 2, 3, 4, 5
n_embd 64
n_layers 2
n_heads 1
init_std 0.01
lr 0.03
max_iters 100,000

8 configs × 5 seeds = 40 runs, each with 20 checkpoints (every 5,000 steps).

Architecture

Small GPT-2-style decoder-only transformer:

  • Token embeddings (no positional embeddings — without_pos=True)
  • 2 pre-norm transformer blocks, each with causal self-attention + MLP
  • Final LayerNorm + tied LM head
  • Weight tying between token embedding and LM head

File Structure

checkpoints/
  k{16,32}_N{128,256,512,1024}/
    seed{1,2,3,4,5}/
      std0p01_iseed{S}__ckpt{iter}.pt
model.py   # Model definition + loading utilities

Loading a Checkpoint

# Copy model.py to your project, then:
from model import load_model_from_checkpoint

model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")

Each .pt file is a dict with keys:

  • model_config: dict of GPTConfig fields
  • model_state_dict: PyTorch state dict
  • checkpoint_iter, init_seed, init_std, l1_init_scale

Training Details

  • Optimizer: AdamW (betas=0.9, 0.95)
  • LR schedule: cosine decay with linear warmup
  • Batch size: 128
  • Data: randomly sampled sorting problems (no duplicates)
  • data_seed: 1337 (shared across all runs)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support