Is SWA used during pertaining?
#113
by EarthWorm001 - opened
I have two questions about sliding window attention (SWA):
1: Is the SWA used during pertaining all the time? I mean, in every pretraining step.
2: If not, is the SWA used during pertaining in a certain stage? For example, pretrain the model with the full-attention for sometime then use SWA to do pretraining.
Thanks!
EarthWorm001 changed discussion title from If SWA used during pertaining? to Is SWA used during pertaining?