Adaptive-RETRO-GPT-1B

Adaptive-RETRO-GPT-1B is a RETRO-inspired retrieval-pretrained decoder-only language model. Unlike a standard RAG system that only adds retrieved text at inference time, this model is trained with retrieved chunks available during next-token language modeling.

Training Setup

  • Objective: next-token language modeling
  • Backbone: decoder-only GPT
  • Retrieval: external chunk datastore, top-k 2, retrieval sequence length 512
  • Retrieval mechanism: cross-attention layers plus learned adaptive retrieval gate
  • Retrieval regularization: retrieval budget loss 0.001
  • Retrieval robustness: no-retrieval probability 0.1, random-retrieval probability 0.1
  • Retrieval layers: 5,11,17
  • Pretraining dataset: HuggingFaceFW/fineweb-edu / sample-10BT
  • Datastore dataset: wikimedia/wikipedia / 20231101.en
  • Sequence length: 2048
  • Parameters: 1,172,146,179
  • Checkpoint step: 20000
  • Related corpus repo: kyLELEng/adaptive-retro-gpt-1b-corpus
  • Related datastore repo: kyLELEng/adaptive-retro-gpt-1b-datastore

Latest Metrics

{
  "step": 20000,
  "retrieval_on": {
    "loss": 1.7580267190933228,
    "lm_loss": 1.7580267190933228,
    "ppl": 5.800979131574639,
    "gate_mean": 1.749867806211114e-06
  },
  "retrieval_off": {
    "loss": 1.7650717496871948,
    "lm_loss": 1.7650717496871948,
    "ppl": 5.841991504112031,
    "gate_mean": 0.0
  },
  "random_retrieval": {
    "loss": 1.7536429166793823,
    "lm_loss": 1.7536429166793823,
    "ppl": 5.775604444698179,
    "gate_mean": 1.7668644431978464e-06
  },
  "delta_lm_loss_off_minus_on": 0.00704503059387207,
  "delta_lm_loss_random_minus_on": -0.00438380241394043
}

The evaluation compares retrieval-on, retrieval-off, and random-retrieval modes. This is the main ablation for whether the trained model is using retrieved context productively and whether it is robust to noisy retrieval.

Research Use

This is an experimental RETRO-style pretraining run for comparing retrieval-pretrained GPT models against dense GPT baselines at similar training budgets. It is not instruction tuned and should not be used as a factual assistant without further evaluation.

Downloads last month
31
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train kyLELEng/adaptive-retro-gpt-1b