How Gemma 4 models were trained?
#9
by guiopen - opened
In the papers of the past Gemma models, it was explained that only the first model received the full training dataset, and the smaller models were instead trained on smaller amounts of distilled logits from their big teacher.
Was Gemma 4 trained the same way? Or each model was trained independently? All received the same training tokens or the bigger ones received more?
Thanks Gemma team for this amazing release
Hi @guiopen
Please check out this deep-dive into Gemma 4’s architecture and training process https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4.
Thanks