| --- |
| library_name: pytorch |
| tags: |
| - text-generation |
| - causal-lm |
| - retrieval-augmented |
| - retro |
| - pretraining |
| - adaptive-retrieval |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - wikimedia/wikipedia |
| --- |
| |
| # Adaptive-RETRO-GPT-1B |
|
|
| Adaptive-RETRO-GPT-1B is a RETRO-inspired retrieval-pretrained decoder-only language model. Unlike a standard RAG system that only adds retrieved text at inference time, this model is trained with retrieved chunks available during next-token language modeling. |
|
|
| ## Training Setup |
|
|
| - Objective: next-token language modeling |
| - Backbone: decoder-only GPT |
| - Retrieval: external chunk datastore, top-k `2`, retrieval sequence length `512` |
| - Retrieval mechanism: cross-attention layers plus learned adaptive retrieval gate |
| - Retrieval regularization: retrieval budget loss `0.001` |
| - Retrieval robustness: no-retrieval probability `0.1`, random-retrieval probability `0.1` |
| - Retrieval layers: `5,11,17` |
| - Pretraining dataset: `HuggingFaceFW/fineweb-edu` / `sample-10BT` |
| - Datastore dataset: `wikimedia/wikipedia` / `20231101.en` |
| - Sequence length: `2048` |
| - Parameters: `1,172,146,179` |
| - Checkpoint step: `20000` |
| - Related corpus repo: [`kyLELEng/adaptive-retro-gpt-1b-corpus`](https://huggingface.co/datasets/kyLELEng/adaptive-retro-gpt-1b-corpus) |
| - Related datastore repo: [`kyLELEng/adaptive-retro-gpt-1b-datastore`](https://huggingface.co/datasets/kyLELEng/adaptive-retro-gpt-1b-datastore) |
|
|
| ## Latest Metrics |
|
|
| ```json |
| { |
| "step": 20000, |
| "retrieval_on": { |
| "loss": 1.7580267190933228, |
| "lm_loss": 1.7580267190933228, |
| "ppl": 5.800979131574639, |
| "gate_mean": 1.749867806211114e-06 |
| }, |
| "retrieval_off": { |
| "loss": 1.7650717496871948, |
| "lm_loss": 1.7650717496871948, |
| "ppl": 5.841991504112031, |
| "gate_mean": 0.0 |
| }, |
| "random_retrieval": { |
| "loss": 1.7536429166793823, |
| "lm_loss": 1.7536429166793823, |
| "ppl": 5.775604444698179, |
| "gate_mean": 1.7668644431978464e-06 |
| }, |
| "delta_lm_loss_off_minus_on": 0.00704503059387207, |
| "delta_lm_loss_random_minus_on": -0.00438380241394043 |
| } |
| ``` |
|
|
| The evaluation compares retrieval-on, retrieval-off, and random-retrieval modes. This is the main ablation for whether the trained model is using retrieved context productively and whether it is robust to noisy retrieval. |
|
|
| ## Research Use |
|
|
| This is an experimental RETRO-style pretraining run for comparing retrieval-pretrained GPT models against dense GPT baselines at similar training budgets. It is not instruction tuned and should not be used as a factual assistant without further evaluation. |
|
|