To fix your GPT-3 implementation you need to make it alternating.

by User8213 - opened Nov 19, 2025

Nov 19, 2025

It's literally GPT-2 but with alternating dense and locally banded sparse attention patterns in the layers of the transformer and making the feedforward layer 4x. That's all

k050506koch

Owner Nov 19, 2025

Thanks!

I think the feedforward layer is already 4x the hidden size as it's inherited from GPT-2, but you have a valid point about the attention. I indeed missed this, I will rework the architecture and train a new model.
And huge thanks for the data on GPT-4 btw!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment