To fix your GPT-3 implementation you need to make it alternating.
#1
by User8213 - opened
It's literally GPT-2 but with alternating dense and locally banded sparse attention patterns in the layers of the transformer and making the feedforward layer 4x. That's all
Thanks!
I think the feedforward layer is already 4x the hidden size as it's inherited from GPT-2, but you have a valid point about the attention. I indeed missed this, I will rework the architecture and train a new model.
And huge thanks for the data on GPT-4 btw!