How Gemma 4 models were trained?

#9
by guiopen - opened

In the papers of the past Gemma models, it was explained that only the first model received the full training dataset, and the smaller models were instead trained on smaller amounts of distilled logits from their big teacher.

Was Gemma 4 trained the same way? Or each model was trained independently? All received the same training tokens or the bigger ones received more?

Thanks Gemma team for this amazing release

Google org

Hi @guiopen
Please check out this deep-dive into Gemma 4’s architecture and training process https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4.

Thanks

Sign up or log in to comment