Fine-Tuning Gemma 4 on Day Zero: 3 Bugs in 30 Minutes

#3
by Nathan-Maine - opened

Google released Gemma 4 today under Apache 2.0. We had a QLoRA training run stepping on a B200 within hours. Here's what broke.

Bug 1: "Transformers does not recognize this architecture"

Stable Transformers (5.4.0) doesn't include gemma4 model type yet.

Fix: pip install git+https://github.com/huggingface/transformers.git

Bug 2: "Target module Gemma4ClippableLinear is not supported"

Gemma 4 introduces Gemma4ClippableLinear for its vision encoder. PEFT doesn't recognize it - even from source. The layer wraps nn.Linear with input/output clamping, but inherits from nn.Module instead of nn.Linear, so PEFT's type check rejects it.

Fix: Monkey-patch it to inherit from nn.Linear. PEFT then treats it normally.

Bug 3: "mm_token_type_ids is required"

Gemma 3 needed token_type_ids. Gemma 4 adds mm_token_type_ids (multimodal). Both need to be in the data collator, even for text-only training.

Fix: Custom collator with both fields initialized to zeros.

Result: 31B model training at 4.5s/step on 1x B200. 534M trainable params (1.68% QLoRA). None of these issues are avoidable with experience - they're day-zero discovery problems.

Could you go into more detail about how you exactly resolved the issue with the Gemma4ClippableLinear, please?

Sure. @MetaMind42 The issue is that Gemma4ClippableLinear inherits from nn.Module instead of nn.Linear. PEFT checks the type of every module before applying LoRA, and it only recognizes nn.Linear, nn.Embedding, nn.Conv1d/2d/3d, and nn.MultiheadAttention. Since Gemma4ClippableLinear isn't any of those, PEFT rejects it - even if you use exclude_modules to skip it, because the type check runs first.

The fix is to replace the class definition before loading the model:

import torch
import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4

class PatchedClippableLinear(nn.Linear):
def init(self, config, in_features, out_features):
nn.Linear.init(self, in_features, out_features, bias=False)
self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
if self.use_clipped_linears:
self.register_buffer("input_min", torch.tensor(-float("inf")))
self.register_buffer("input_max", torch.tensor(float("inf")))
self.register_buffer("output_min", torch.tensor(-float("inf")))
self.register_buffer("output_max", torch.tensor(float("inf")))

def forward(self, x):
    if self.use_clipped_linears:
        x = torch.clamp(x, self.input_min, self.input_max)
    out = nn.Linear.forward(self, x)
    if self.use_clipped_linears:
        out = torch.clamp(out, self.output_min, self.output_max)
    return out

modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear
Place this before any AutoModelForCausalLM.from_pretrained() call. The patched class preserves the clamping behavior but now inherits from nn.Linear, so PEFT recognizes it.

Also use exclude_modules=["vision_tower", "multi_modal_projector", "audio_tower"] in your LoRA config to skip applying LoRA to the vision/audio layers entirely.

I filed this as huggingface/peft#3129 - Benjamin Bossan suggested targeting the inner .linear layer as an alternative approach, but I haven't tested that yet.

Will try! Thanks a lot.

@MetaMind42 let me know how it works out. Also check out Benjamin's .linear layer approach. I just havent had a moment to run it yet... πŸ™ƒ

No luck with Benjamin's workaround. But yours worked for me. Thanks!

Sign up or log in to comment