FAAST-GPT2-XL

faast-gpt2-xl is an extension of gpt2-xl equipped with the FAAST module. The original GPT2-XL language model parameters are frozen, and only the FAAST readout projections are trained.

Model Description

FAAST-GPT2-XL augments GPT2-XL with FAAST modules for efficient test-time supervised learning. During FAAST pretraining, the parameters of GPT2-XL are kept frozen, while the FAAST readout projections are trained on OpenWebText2.

The model is designed to support learning at test time through fast weights, allowing adaptation without backpropagation (gradient descent).

Training Details

  • Base model: GPT2-XL
  • Trainable parameters: FAAST readout projections
  • Frozen parameters: All GPT2-XL parameters
  • Training corpus: OpenWebText2
  • Training objective: FAAST readout projection learning with the frozen GPT2-XL backbone

Intended Use

This model is intended for research on:

  • Test-time learning
  • Fast-weight adaptation
  • Efficient language model adaptation
  • Sequence modeling
  • Few-shot learning
  • Frozen-backbone adaptation methods

Evaluation Results

Sentiment Classification

Sentiment classification accuracy on SST-2 and IMDB with 95% confidence intervals.

Dataset Method 1-shot 5-shot Full
SST-2 GPT2-XL zero-shot - - 74.3 ± 1.2
SST-2 In-Context Learning 59.6 ± 1.4 71.6 ± 1.3 -
SST-2 FAAST 78.5 ± 1.1 80.8 ± 1.1 87.5 ± 0.9
Dataset Method 1-shot 2-shot Full
IMDB GPT2-XL zero-shot - - 85.7 ± 1.0
IMDB In-Context Learning 70.1 ± 1.3 56.0 ± 1.4 -
IMDB FAAST 86.7 ± 0.9 87.4 ± 0.9 90.4 ± 0.8

Sequence Modeling on WikiText-103

Sequence modeling adaptation results using GPT2-XL. Inference FLOPs and memory report only the increased cost over the GPT2-XL base model.

Method Adaptation Test-time Learning WikiText-103 PPL ↓ +Compute +Memory Learn Cost
GPT2-XL zero-shot lower bound No adaptation No 17.41 - - -
Linear Projection upper bound Backprop No 13.60 O(L d_x²) 112 MB 3 hrs
LoRA w/ same number of params Backprop No 13.57 - - 3 hrs
In-Context Learning Context Yes Not applicable O(N L d_x) - -
kNN-LM Memory Yes 12.70 O(N d_x) 307 GB 16 hrs
FAAST Fast weights Yes 15.35 O(L d_x²) 112 MB 0.2 hrs

The WikiText-103 setup uses hidden size d_x = 1600, number of layers L = 48, and number of training tokens N = 1.03 × 10^8. The in-context learning cost is a theoretical estimate because the base GPT2-XL model does not support such long contexts.

Limitations

  • The model is trained on English web text and is primarily intended for English-language tasks.
  • Test-time learning behavior depends on the quality and distribution of the test-time examples.

Citation

If you use this model, please cite the corresponding FAAST paper or project.

@article{bao2026faast,
  title={FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation},
  author={Bao, Guangsheng and Zhang, Hongbo and Cui, Han and Sun, Ke and Zhao, Yanbin and He, Juncai and Zhang, Yue},
  journal={arXiv preprint arXiv:2605.04651},
  year={2026}
}
Downloads last month
25
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gshbao/faast-gpt2-xl

Finetuned
(69)
this model

Datasets used to train gshbao/faast-gpt2-xl

Collection including gshbao/faast-gpt2-xl

Paper for gshbao/faast-gpt2-xl