GPT-2 XL Revived
_{^{Continued pretraining and instruction tuning of GPT-2 XL}}

Base Model: GPT-2 XL Status: Ongoing research project

Keep in mind this project will take time, and isn't guaranteed to work.

About

This is an experimental revival of GPT-2 XL (1.5B parameters) using modern training practices.

The model is not trained from scratch. Instead, GPT-2 XL is continued-pretrained on more diverse data and later instruction-tuned to improve conversational behavior.

The goal is to see how much performance can be recovered from an older architecture when trained with better data.

Model Details

Architecture: GPT-2 XL (decoder-only transformer)
Parameters: ~1.5B
Tokenizer: GPT-2 BPE
Original context length: 1024 tokens
Target context length: 2048 tokens (experimental)

No architectural changes are made.

Training

Continued Pretraining

The model is continued-pretrained on a diverse, lightly curated dataset, including:

Samples from The Pile (web text, forums, books, Q&A)
Math and reasoning datasets with step-by-step solutions

Aggressive filtering is avoided to preserve data diversity.

Instruction Tuning

Instruction tuning uses UltraChat-200k, focusing on multi-turn conversations.

Context Length

Context length extension is attempted by resizing absolute positional embeddings.

If native extension is unstable, a sliding context window is used at inference time.

Evaluation

Evaluation is primarily qualitative and comparative, focusing on:

conversation quality
instruction following
math reasoning
multi-turn context retention

Comparisons include base GPT-2 XL and small modern instruction-tuned models.

Intended Use

This model is intended for:

research
experimentation
educational purposes

It is not intended for safety-critical or production use.

Limitations

Limited by GPT-2 architecture
Absolute positional embeddings restrict very long contexts
No modern alignment or safety fine-tuning

Project Status

This is an ongoing project. Checkpoints and documentation will be updated as training progresses.

Roadmap for The Future

GPT-2 XL Revived — Roadmap

Goal See how far GPT-2 XL (1.5B) can be pushed using continued pretraining, modest context extension, and instruction tuning — without changing the architecture.

Constraints

Single consumer GPU (RTX 3060–class)
Long-running training
Checkpoint early, checkpoint often

Phase 0 — Baseline

Purpose: Know what “bad” looks like.

Run stock GPT-2 XL
Test: chat, math, multi-turn, code
Save generations and notes

Output: baseline.md

Phase 1 — Data Prep

Principle: Variety > cleanliness.

Sample The Pile (web, forums, books, Q&A)
Add math datasets with step-by-step reasoning
Light dedup only

Output: dataset mix + token estimate

Phase 2 — Continued Pretraining

Purpose: Pay off data debt.

fp16 + 8-bit optimizer
Gradient accumulation
Frequent checkpoints

Scale

Minimum: 5–10B tokens
Stretch: 20–30B tokens

Watch for

loss plateaus
repetition
instability

Output: pretrained checkpoints + logs

Phase 3 — Context Extension

Primary

Resize positional embeddings
1024 → 2048 (4096 only if stable)
Very low LR
Mostly short context, some long

Fallback

Stop at 2048 if unstable
Use sliding context window at inference

Output: long-context samples + failure notes

Phase 4 — Quantization

Purpose: Make instruction tuning affordable.

Quantize base model to int4
Freeze base weights
Verify quality holds

Output: quantized base checkpoint

Phase 5 — Instruction Tuning

Dataset

UltraChat-200k (multi-turn)

Method

LoRA / QLoRA
Small LR
Few epochs

Output: instruction-tuned model

Phase 6 — Evaluation

Compare

Base GPT-2 XL
Alpaca-tuned Pythia-1B
Qualitative reference to modern 7B

Focus

conversation quality
instruction following
math reasoning
context retention

Output: final_eval.md

Phase 7 — Write-Up

Purpose: Make it matter.

What worked
What failed
Why data mattered
Why architecture wasn’t the bottleneck

Output: case study + repo documentation

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hashtagg1/gpt-2-xl-revival

Base model

openai-community/gpt2-xl

Adapter

(162)

this model

GPT-2 XL Revived Continued pretraining and instruction tuning of GPT-2 XL