FAAST-GPT2-XL

faast-gpt2-xl is an extension of gpt2-xl equipped with the FAAST module. The original GPT2-XL language model parameters are frozen, and only the FAAST readout projections are trained.

Model Description

FAAST-GPT2-XL augments GPT2-XL with FAAST modules for efficient test-time supervised learning. During FAAST pretraining, the parameters of GPT2-XL are kept frozen, while the FAAST readout projections are trained on OpenWebText2.

The model is designed to support learning at test time through fast weights, allowing adaptation without backpropagation (gradient descent).

Training Details

Base model: GPT2-XL
Trainable parameters: FAAST readout projections
Frozen parameters: All GPT2-XL parameters
Training corpus: OpenWebText2
Training objective: FAAST readout projection learning with the frozen GPT2-XL backbone

Intended Use

This model is intended for research on:

Test-time learning
Fast-weight adaptation
Efficient language model adaptation
Sequence modeling
Few-shot learning
Frozen-backbone adaptation methods

Evaluation Results

Sentiment Classification

Sentiment classification accuracy on SST-2 and IMDB with 95% confidence intervals.

Dataset	Method	1-shot	5-shot	Full
SST-2	GPT2-XL zero-shot	-	-	74.3 ± 1.2
SST-2	In-Context Learning	59.6 ± 1.4	71.6 ± 1.3	-
SST-2	FAAST	78.5 ± 1.1	80.8 ± 1.1	87.5 ± 0.9

Dataset	Method	1-shot	2-shot	Full
IMDB	GPT2-XL zero-shot	-	-	85.7 ± 1.0
IMDB	In-Context Learning	70.1 ± 1.3	56.0 ± 1.4	-
IMDB	FAAST	86.7 ± 0.9	87.4 ± 0.9	90.4 ± 0.8

Sequence Modeling on WikiText-103

Sequence modeling adaptation results using GPT2-XL. Inference FLOPs and memory report only the increased cost over the GPT2-XL base model.

Method	Adaptation	Test-time Learning	WikiText-103 PPL ↓	+Compute	+Memory	Learn Cost
GPT2-XL zero-shot lower bound	No adaptation	No	17.41	-	-	-
Linear Projection upper bound	Backprop	No	13.60	O(L d_x²)	112 MB	3 hrs
LoRA w/ same number of params	Backprop	No	13.57	-	-	3 hrs
In-Context Learning	Context	Yes	Not applicable	O(N L d_x)	-	-
kNN-LM	Memory	Yes	12.70	O(N d_x)	307 GB	16 hrs
FAAST	Fast weights	Yes	15.35	O(L d_x²)	112 MB	0.2 hrs

The WikiText-103 setup uses hidden size d_x = 1600, number of layers L = 48, and number of training tokens N = 1.03 × 10^8. The in-context learning cost is a theoretical estimate because the base GPT2-XL model does not support such long contexts.

Limitations

The model is trained on English web text and is primarily intended for English-language tasks.
Test-time learning behavior depends on the quality and distribution of the test-time examples.

Citation

If you use this model, please cite the corresponding FAAST paper or project.

@article{bao2026faast,
  title={FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation},
  author={Bao, Guangsheng and Zhang, Hongbo and Cui, Han and Sun, Ke and Zhao, Yanbin and He, Juncai and Zhang, Yue},
  journal={arXiv preprint arXiv:2605.04651},
  year={2026}
}