Is SWA used during pertaining?

#113

by EarthWorm001 - opened Dec 22, 2023

Dec 22, 2023

I have two questions about sliding window attention (SWA):

1: Is the SWA used during pertaining all the time? I mean, in every pretraining step.
2: If not, is the SWA used during pertaining in a certain stage? For example, pretrain the model with the full-attention for sometime then use SWA to do pretraining.

Thanks!

EarthWorm001 changed discussion title from If SWA used during pertaining? to Is SWA used during pertaining? Dec 23, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment