How Gemma 4 models were trained?

by guiopen - opened 7 days ago

In the papers of the past Gemma models, it was explained that only the first model received the full training dataset, and the smaller models were instead trained on smaller amounts of distilled logits from their big teacher.

Was Gemma 4 trained the same way? Or each model was trained independently? All received the same training tokens or the bigger ones received more?

Thanks Gemma team for this amazing release

pannaga10

Google org 2 days ago

Hi @guiopen
Please check out this deep-dive into Gemma 4’s architecture and training process https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4.

Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment