GPT-2 XL Revived
Continued pretraining and instruction tuning of GPT-2 XL

Base Model: GPT-2 XL Status: Ongoing research project


Keep in mind this project will take time, and isn't guaranteed to work.


About

This is an experimental revival of GPT-2 XL (1.5B parameters) using modern training practices.

The model is not trained from scratch. Instead, GPT-2 XL is continued-pretrained on more diverse data and later instruction-tuned to improve conversational behavior.

The goal is to see how much performance can be recovered from an older architecture when trained with better data.


Model Details

  • Architecture: GPT-2 XL (decoder-only transformer)
  • Parameters: ~1.5B
  • Tokenizer: GPT-2 BPE
  • Original context length: 1024 tokens
  • Target context length: 2048 tokens (experimental)

No architectural changes are made.


Training

Continued Pretraining

The model is continued-pretrained on a diverse, lightly curated dataset, including:

  • Samples from The Pile (web text, forums, books, Q&A)
  • Math and reasoning datasets with step-by-step solutions

Aggressive filtering is avoided to preserve data diversity.

Instruction Tuning

Instruction tuning uses UltraChat-200k, focusing on multi-turn conversations.


Context Length

Context length extension is attempted by resizing absolute positional embeddings.

If native extension is unstable, a sliding context window is used at inference time.


Evaluation

Evaluation is primarily qualitative and comparative, focusing on:

  • conversation quality
  • instruction following
  • math reasoning
  • multi-turn context retention

Comparisons include base GPT-2 XL and small modern instruction-tuned models.


Intended Use

This model is intended for:

  • research
  • experimentation
  • educational purposes

It is not intended for safety-critical or production use.


Limitations

  • Limited by GPT-2 architecture
  • Absolute positional embeddings restrict very long contexts
  • No modern alignment or safety fine-tuning

Project Status

This is an ongoing project. Checkpoints and documentation will be updated as training progresses.


Roadmap for The Future

GPT-2 XL Revived β€” Roadmap

Goal See how far GPT-2 XL (1.5B) can be pushed using continued pretraining, modest context extension, and instruction tuning β€” without changing the architecture.

Constraints

  • Single consumer GPU (RTX 3060–class)
  • Long-running training
  • Checkpoint early, checkpoint often

Phase 0 β€” Baseline

Purpose: Know what β€œbad” looks like.

  • Run stock GPT-2 XL
  • Test: chat, math, multi-turn, code
  • Save generations and notes

Output: baseline.md


Phase 1 β€” Data Prep

Principle: Variety > cleanliness.

  • Sample The Pile (web, forums, books, Q&A)
  • Add math datasets with step-by-step reasoning
  • Light dedup only

Output: dataset mix + token estimate


Phase 2 β€” Continued Pretraining

Purpose: Pay off data debt.

  • fp16 + 8-bit optimizer
  • Gradient accumulation
  • Frequent checkpoints

Scale

  • Minimum: 5–10B tokens
  • Stretch: 20–30B tokens

Watch for

  • loss plateaus
  • repetition
  • instability

Output: pretrained checkpoints + logs


Phase 3 β€” Context Extension

Primary

  • Resize positional embeddings
  • 1024 β†’ 2048 (4096 only if stable)
  • Very low LR
  • Mostly short context, some long

Fallback

  • Stop at 2048 if unstable
  • Use sliding context window at inference

Output: long-context samples + failure notes


Phase 4 β€” Quantization

Purpose: Make instruction tuning affordable.

  • Quantize base model to int4
  • Freeze base weights
  • Verify quality holds

Output: quantized base checkpoint


Phase 5 β€” Instruction Tuning

Dataset

  • UltraChat-200k (multi-turn)

Method

  • LoRA / QLoRA
  • Small LR
  • Few epochs

Output: instruction-tuned model


Phase 6 β€” Evaluation

Compare

  • Base GPT-2 XL
  • Alpaca-tuned Pythia-1B
  • Qualitative reference to modern 7B

Focus

  • conversation quality
  • instruction following
  • math reasoning
  • context retention

Output: final_eval.md


Phase 7 β€” Write-Up

Purpose: Make it matter.

  • What worked
  • What failed
  • Why data mattered
  • Why architecture wasn’t the bottleneck

Output: case study + repo documentation

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hashtagg1/gpt-2-xl-revival

Adapter
(162)
this model