Fine-Tuning Gemma 4 on Day Zero: 3 Bugs in 30 Minutes
Google released Gemma 4 today under Apache 2.0. We had a QLoRA training run stepping on a B200 within hours. Here's what broke.
Bug 1: "Transformers does not recognize this architecture"
Stable Transformers (5.4.0) doesn't include gemma4 model type yet.
Fix: pip install git+https://github.com/huggingface/transformers.git
Bug 2: "Target module Gemma4ClippableLinear is not supported"
Gemma 4 introduces Gemma4ClippableLinear for its vision encoder. PEFT doesn't recognize it - even from source. The layer wraps nn.Linear with input/output clamping, but inherits from nn.Module instead of nn.Linear, so PEFT's type check rejects it.
Fix: Monkey-patch it to inherit from nn.Linear. PEFT then treats it normally.
Bug 3: "mm_token_type_ids is required"
Gemma 3 needed token_type_ids. Gemma 4 adds mm_token_type_ids (multimodal). Both need to be in the data collator, even for text-only training.
Fix: Custom collator with both fields initialized to zeros.
Result: 31B model training at 4.5s/step on 1x B200. 534M trainable params (1.68% QLoRA). None of these issues are avoidable with experience - they're day-zero discovery problems.
Could you go into more detail about how you exactly resolved the issue with the Gemma4ClippableLinear, please?
Sure. @MetaMind42 The issue is that Gemma4ClippableLinear inherits from nn.Module instead of nn.Linear. PEFT checks the type of every module before applying LoRA, and it only recognizes nn.Linear, nn.Embedding, nn.Conv1d/2d/3d, and nn.MultiheadAttention. Since Gemma4ClippableLinear isn't any of those, PEFT rejects it - even if you use exclude_modules to skip it, because the type check runs first.
The fix is to replace the class definition before loading the model:
import torch
import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4
class PatchedClippableLinear(nn.Linear):
def init(self, config, in_features, out_features):
nn.Linear.init(self, in_features, out_features, bias=False)
self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
if self.use_clipped_linears:
self.register_buffer("input_min", torch.tensor(-float("inf")))
self.register_buffer("input_max", torch.tensor(float("inf")))
self.register_buffer("output_min", torch.tensor(-float("inf")))
self.register_buffer("output_max", torch.tensor(float("inf")))
def forward(self, x):
if self.use_clipped_linears:
x = torch.clamp(x, self.input_min, self.input_max)
out = nn.Linear.forward(self, x)
if self.use_clipped_linears:
out = torch.clamp(out, self.output_min, self.output_max)
return out
modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear
Place this before any AutoModelForCausalLM.from_pretrained() call. The patched class preserves the clamping behavior but now inherits from nn.Linear, so PEFT recognizes it.
Also use exclude_modules=["vision_tower", "multi_modal_projector", "audio_tower"] in your LoRA config to skip applying LoRA to the vision/audio layers entirely.
I filed this as huggingface/peft#3129 - Benjamin Bossan suggested targeting the inner .linear layer as an alternative approach, but I haven't tested that yet.
Will try! Thanks a lot.
@MetaMind42 let me know how it works out. Also check out Benjamin's .linear layer approach. I just havent had a moment to run it yet... π
No luck with Benjamin's workaround. But yours worked for me. Thanks!