FAAST-GPT2-XL
faast-gpt2-xl is an extension of gpt2-xl equipped with the FAAST module. The original GPT2-XL language model parameters are frozen, and only the FAAST readout projections are trained.
Model Description
FAAST-GPT2-XL augments GPT2-XL with FAAST modules for efficient test-time supervised learning. During FAAST pretraining, the parameters of GPT2-XL are kept frozen, while the FAAST readout projections are trained on OpenWebText2.
The model is designed to support learning at test time through fast weights, allowing adaptation without backpropagation (gradient descent).
Training Details
- Base model: GPT2-XL
- Trainable parameters: FAAST readout projections
- Frozen parameters: All GPT2-XL parameters
- Training corpus: OpenWebText2
- Training objective: FAAST readout projection learning with the frozen GPT2-XL backbone
Intended Use
This model is intended for research on:
- Test-time learning
- Fast-weight adaptation
- Efficient language model adaptation
- Sequence modeling
- Few-shot learning
- Frozen-backbone adaptation methods
Evaluation Results
Sentiment Classification
Sentiment classification accuracy on SST-2 and IMDB with 95% confidence intervals.
| Dataset | Method | 1-shot | 5-shot | Full |
|---|---|---|---|---|
| SST-2 | GPT2-XL zero-shot | - | - | 74.3 ± 1.2 |
| SST-2 | In-Context Learning | 59.6 ± 1.4 | 71.6 ± 1.3 | - |
| SST-2 | FAAST | 78.5 ± 1.1 | 80.8 ± 1.1 | 87.5 ± 0.9 |
| Dataset | Method | 1-shot | 2-shot | Full |
|---|---|---|---|---|
| IMDB | GPT2-XL zero-shot | - | - | 85.7 ± 1.0 |
| IMDB | In-Context Learning | 70.1 ± 1.3 | 56.0 ± 1.4 | - |
| IMDB | FAAST | 86.7 ± 0.9 | 87.4 ± 0.9 | 90.4 ± 0.8 |
Sequence Modeling on WikiText-103
Sequence modeling adaptation results using GPT2-XL. Inference FLOPs and memory report only the increased cost over the GPT2-XL base model.
| Method | Adaptation | Test-time Learning | WikiText-103 PPL ↓ | +Compute | +Memory | Learn Cost |
|---|---|---|---|---|---|---|
| GPT2-XL zero-shot lower bound | No adaptation | No | 17.41 | - | - | - |
| Linear Projection upper bound | Backprop | No | 13.60 | O(L d_x²) | 112 MB | 3 hrs |
| LoRA w/ same number of params | Backprop | No | 13.57 | - | - | 3 hrs |
| In-Context Learning | Context | Yes | Not applicable | O(N L d_x) | - | - |
| kNN-LM | Memory | Yes | 12.70 | O(N d_x) | 307 GB | 16 hrs |
| FAAST | Fast weights | Yes | 15.35 | O(L d_x²) | 112 MB | 0.2 hrs |
The WikiText-103 setup uses hidden size d_x = 1600, number of layers L = 48, and number of training tokens N = 1.03 × 10^8. The in-context learning cost is a theoretical estimate because the base GPT2-XL model does not support such long contexts.
Limitations
- The model is trained on English web text and is primarily intended for English-language tasks.
- Test-time learning behavior depends on the quality and distribution of the test-time examples.
Citation
If you use this model, please cite the corresponding FAAST paper or project.
@article{bao2026faast,
title={FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation},
author={Bao, Guangsheng and Zhang, Hongbo and Cui, Han and Sun, Ke and Zhao, Yanbin and He, Juncai and Zhang, Yue},
journal={arXiv preprint arXiv:2605.04651},
year={2026}
}
- Downloads last month
- 25
Model tree for gshbao/faast-gpt2-xl
Base model
openai-community/gpt2-xl