Axiomic Banner

GPT-X-125M

A modern Llama-style language model trained from scratch. 125M parameters, 15B tokens of FineWeb-Edu. Outperforms GPT-3 (125M) on HellaSwag using 20x less training data.

Results

Evaluated using EleutherAI/lm-evaluation-harness

Company Model HellaSwag ARC (Average) PIQA LogicMark Winogrande ArithMark Average Training tokens
HuggingFace SmolLM2-135M 42.10% 43.90% 68.40% 42.78% 51.30% 33.20% 46.95% 2T
Axiomic Labs GPT-X2-125M (in training) 35.17% 38.52% 66.05% 45.90% 48.54% 34.16% 44.72% 75B
OpenAI GPT-2 Medium (355M) 37.57% 33.56% 67.08% 44.44% 49.09% 34.44% 44.36% ~100B
Facebook MobileLLM-125M 38.90% 35.50% 65.30% 42.04% 53.10% 31.16% 44.33% 1T
Axiomic Labs GPT-X-125M 36.46% 38.69% 64.85% 43.52% 50.91% 30.52% 44.16% 15B
OpenAI GPT-3 (125M) 33.70% 35.10% 64.60% NA 52.00% NA NA 300B
OpenAI GPT-2 (124M) 29.53% 30.44% 62.46% 44.06% 50.51% 31.60% 41.43% ~100B

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Datdanboi25/GPT-X-125M",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Datdanboi25/GPT-X-125M")

inputs = tokenizer("The future of AI is", return_tensors="pt", add_special_tokens=True)
output = model.generate(
    **inputs, 
    max_new_tokens=50, 
    do_sample=True, 
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

GPT-X replaces every major component of GPT-2 with modern alternatives proven at scale by Llama 3, Mistral, and Gemma 2.

Component Details
Position encoding RoPE (rotary, zero extra params)
Normalization RMSNorm (float32 upcast)
Feed-forward SwiGLU (3-matrix gated MLP)
Attention Grouped Query Attention β€” 9Q / 3KV (3:1)
QK stability QK-Norm (RMSNorm per head, before RoPE)
Bias None (all layers bias-free)
Embedding sqrt(d_model) scaling + weight tying
Auxiliary loss z-loss (1e-4 on logit magnitudes)
Depth 27 layers x 576 hidden (deep & narrow)

Config

vocab_size     = 50,304    (GPT-2 BPE, padded)
n_layer        = 27
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 10,000
total params   = 124,561,728

Parameter Breakdown

Component Params
Token embeddings (50304 x 576) 28,975,104
Per block (x27): attention + SwiGLU + norms 3,540,224
27 transformer blocks 95,586,048
Final RMSNorm 576
LM head (tied with embeddings) 0
Total 124,561,728

Training

Data

  • Dataset: FineWeb-Edu sample-100BT (educational web text)
  • Tokens: 15B (30,500 steps x 524,288 tokens/step)
  • Tokenizer: GPT-2 BPE (tiktoken, 50,257 vocab padded to 50,304)

Optimization

  • Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1)
  • Learning rate: 6e-4 max, 6e-5 min
  • Schedule: WSD β€” 1,000 step warmup, stable phase, linear decay over final 20%
  • Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, grad accumulation)
  • Precision: bfloat16 mixed precision
  • Gradient clipping: 1.0

Hardware

  • 1x NVIDIA RTX 3080 Ti (training)
  • 1x Intel i9-13900K (data tokenization)
  • Training time: ~4.5 days

Design Decisions

  • 27 layers x 576 hidden β€” SmolLM-135M and MobileLLM-125M proved deep & narrow is SotA at 125M scale. 2.25x more depth than GPT-2's 12 layers.
  • GQA 3:1 β€” Saves attention parameters reinvested into a larger SwiGLU. Negligible quality loss at this ratio.
  • SwiGLU β€” Gated MLP with SiLU outperforms GELU across Llama, PaLM, and Mistral.
  • QK-Norm β€” Prevents attention logit explosion in deep models. Applied before RoPE (Llama 3.1 / Gemma 2 ordering).
  • z-loss β€” Prevents logit magnitude drift during training (PaLM, T5).
  • WSD schedule β€” Holds at peak LR for 80% of training, then sharp decay. Beats cosine with limited tokens.
  • No bias β€” Zero quality benefit in modern transformers. Confirmed by every post-2023 frontier LLM.

Limitations

  • Small model: 125M parameters limits reasoning and factual recall
  • Educational data only: Trained on FineWeb-Edu; not representative of general web text
  • Not instruction-tuned: Base model only, not aligned for chat
  • English only
  • 1024 context window

License

Apache 2.0


Citation

@misc{gptx2025,
  title={GPT-X: A Modern Llama-Style Language Model at 125M Scale},
  author={Datdanboi25},
  year={2025},
  url={https://huggingface.co/Datdanboi25/GPT-X-125m-15bt}
}
Downloads last month
3,920
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AxiomicLabs/GPT-X-125M

Collection including AxiomicLabs/GPT-X-125M