GPT-X-125M

A modern Llama-style language model trained from scratch. 125M parameters, 15B tokens of FineWeb-Edu. Outperforms GPT-3 (125M) on HellaSwag using 20x less training data.

Results

Evaluated using EleutherAI/lm-evaluation-harness

Company	Model	HellaSwag	ARC (Average)	PIQA	LogicMark	Winogrande	ArithMark	Average	Training tokens
HuggingFace	SmolLM2-135M	42.10%	43.90%	68.40%	42.78%	51.30%	33.20%	46.95%	2T
Axiomic Labs	GPT-X2-125M (in training)	35.17%	38.52%	66.05%	45.90%	48.54%	34.16%	44.72%	75B
OpenAI	GPT-2 Medium (355M)	37.57%	33.56%	67.08%	44.44%	49.09%	34.44%	44.36%	~100B
Facebook	MobileLLM-125M	38.90%	35.50%	65.30%	42.04%	53.10%	31.16%	44.33%	1T
Axiomic Labs	GPT-X-125M	36.46%	38.69%	64.85%	43.52%	50.91%	30.52%	44.16%	15B
OpenAI	GPT-3 (125M)	33.70%	35.10%	64.60%	NA	52.00%	NA	NA	300B
OpenAI	GPT-2 (124M)	29.53%	30.44%	62.46%	44.06%	50.51%	31.60%	41.43%	~100B

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Datdanboi25/GPT-X-125M",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Datdanboi25/GPT-X-125M")

inputs = tokenizer("The future of AI is", return_tensors="pt", add_special_tokens=True)
output = model.generate(
    **inputs, 
    max_new_tokens=50, 
    do_sample=True, 
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

GPT-X replaces every major component of GPT-2 with modern alternatives proven at scale by Llama 3, Mistral, and Gemma 2.

Component	Details
Position encoding	RoPE (rotary, zero extra params)
Normalization	RMSNorm (float32 upcast)
Feed-forward	SwiGLU (3-matrix gated MLP)
Attention	Grouped Query Attention — 9Q / 3KV (3:1)
QK stability	QK-Norm (RMSNorm per head, before RoPE)
Bias	None (all layers bias-free)
Embedding	sqrt(d_model) scaling + weight tying
Auxiliary loss	z-loss (1e-4 on logit magnitudes)
Depth	27 layers x 576 hidden (deep & narrow)

Config

vocab_size     = 50,304    (GPT-2 BPE, padded)
n_layer        = 27
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 10,000
total params   = 124,561,728

Parameter Breakdown

Component	Params
Token embeddings (50304 x 576)	28,975,104
Per block (x27): attention + SwiGLU + norms	3,540,224
27 transformer blocks	95,586,048
Final RMSNorm	576
LM head (tied with embeddings)	0
Total	124,561,728

Training

Data

Dataset: FineWeb-Edu sample-100BT (educational web text)
Tokens: 15B (30,500 steps x 524,288 tokens/step)
Tokenizer: GPT-2 BPE (tiktoken, 50,257 vocab padded to 50,304)

Optimization

Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1)
Learning rate: 6e-4 max, 6e-5 min
Schedule: WSD — 1,000 step warmup, stable phase, linear decay over final 20%
Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, grad accumulation)
Precision: bfloat16 mixed precision
Gradient clipping: 1.0

Hardware

1x NVIDIA RTX 3080 Ti (training)
1x Intel i9-13900K (data tokenization)
Training time: ~4.5 days

Design Decisions

27 layers x 576 hidden — SmolLM-135M and MobileLLM-125M proved deep & narrow is SotA at 125M scale. 2.25x more depth than GPT-2's 12 layers.
GQA 3:1 — Saves attention parameters reinvested into a larger SwiGLU. Negligible quality loss at this ratio.
SwiGLU — Gated MLP with SiLU outperforms GELU across Llama, PaLM, and Mistral.
QK-Norm — Prevents attention logit explosion in deep models. Applied before RoPE (Llama 3.1 / Gemma 2 ordering).
z-loss — Prevents logit magnitude drift during training (PaLM, T5).
WSD schedule — Holds at peak LR for 80% of training, then sharp decay. Beats cosine with limited tokens.
No bias — Zero quality benefit in modern transformers. Confirmed by every post-2023 frontier LLM.

Limitations

Small model: 125M parameters limits reasoning and factual recall
Educational data only: Trained on FineWeb-Edu; not representative of general web text
Not instruction-tuned: Base model only, not aligned for chat
English only
1024 context window

License

Apache 2.0

Citation

@misc{gptx2025,
  title={GPT-X: A Modern Llama-Style Language Model at 125M Scale},
  author={Datdanboi25},
  year={2025},
  url={https://huggingface.co/Datdanboi25/GPT-X-125m-15bt}
}

Downloads last month: 3,920

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train AxiomicLabs/GPT-X-125M

Collection including AxiomicLabs/GPT-X-125M

GPT-X

Collection

Collection of all things GPT-X • 2 items • Updated 6 days ago