GPT-2 XL Revived
Continued pretraining and instruction tuning of GPT-2 XL
Base Model: GPT-2 XL Status: Ongoing research project
Keep in mind this project will take time, and isn't guaranteed to work.
About
This is an experimental revival of GPT-2 XL (1.5B parameters) using modern training practices.
The model is not trained from scratch. Instead, GPT-2 XL is continued-pretrained on more diverse data and later instruction-tuned to improve conversational behavior.
The goal is to see how much performance can be recovered from an older architecture when trained with better data.
Model Details
- Architecture: GPT-2 XL (decoder-only transformer)
- Parameters: ~1.5B
- Tokenizer: GPT-2 BPE
- Original context length: 1024 tokens
- Target context length: 2048 tokens (experimental)
No architectural changes are made.
Training
Continued Pretraining
The model is continued-pretrained on a diverse, lightly curated dataset, including:
- Samples from The Pile (web text, forums, books, Q&A)
- Math and reasoning datasets with step-by-step solutions
Aggressive filtering is avoided to preserve data diversity.
Instruction Tuning
Instruction tuning uses UltraChat-200k, focusing on multi-turn conversations.
Context Length
Context length extension is attempted by resizing absolute positional embeddings.
If native extension is unstable, a sliding context window is used at inference time.
Evaluation
Evaluation is primarily qualitative and comparative, focusing on:
- conversation quality
- instruction following
- math reasoning
- multi-turn context retention
Comparisons include base GPT-2 XL and small modern instruction-tuned models.
Intended Use
This model is intended for:
- research
- experimentation
- educational purposes
It is not intended for safety-critical or production use.
Limitations
- Limited by GPT-2 architecture
- Absolute positional embeddings restrict very long contexts
- No modern alignment or safety fine-tuning
Project Status
This is an ongoing project. Checkpoints and documentation will be updated as training progresses.
Roadmap for The Future
GPT-2 XL Revived β Roadmap
Goal See how far GPT-2 XL (1.5B) can be pushed using continued pretraining, modest context extension, and instruction tuning β without changing the architecture.
Constraints
- Single consumer GPU (RTX 3060βclass)
- Long-running training
- Checkpoint early, checkpoint often
Phase 0 β Baseline
Purpose: Know what βbadβ looks like.
- Run stock GPT-2 XL
- Test: chat, math, multi-turn, code
- Save generations and notes
Output: baseline.md
Phase 1 β Data Prep
Principle: Variety > cleanliness.
- Sample The Pile (web, forums, books, Q&A)
- Add math datasets with step-by-step reasoning
- Light dedup only
Output: dataset mix + token estimate
Phase 2 β Continued Pretraining
Purpose: Pay off data debt.
- fp16 + 8-bit optimizer
- Gradient accumulation
- Frequent checkpoints
Scale
- Minimum: 5β10B tokens
- Stretch: 20β30B tokens
Watch for
- loss plateaus
- repetition
- instability
Output: pretrained checkpoints + logs
Phase 3 β Context Extension
Primary
- Resize positional embeddings
- 1024 β 2048 (4096 only if stable)
- Very low LR
- Mostly short context, some long
Fallback
- Stop at 2048 if unstable
- Use sliding context window at inference
Output: long-context samples + failure notes
Phase 4 β Quantization
Purpose: Make instruction tuning affordable.
- Quantize base model to int4
- Freeze base weights
- Verify quality holds
Output: quantized base checkpoint
Phase 5 β Instruction Tuning
Dataset
- UltraChat-200k (multi-turn)
Method
- LoRA / QLoRA
- Small LR
- Few epochs
Output: instruction-tuned model
Phase 6 β Evaluation
Compare
- Base GPT-2 XL
- Alpaca-tuned Pythia-1B
- Qualitative reference to modern 7B
Focus
- conversation quality
- instruction following
- math reasoning
- context retention
Output: final_eval.md
Phase 7 β Write-Up
Purpose: Make it matter.
- What worked
- What failed
- Why data mattered
- Why architecture wasnβt the bottleneck
Output: case study + repo documentation
Model tree for hashtagg1/gpt-2-xl-revival
Base model
openai-community/gpt2-xl