Fine-Tuning Gemma 4 on Day Zero: 3 Bugs in 30 Minutes

by Nathan-Maine - opened 15 days ago

Google released Gemma 4 today under Apache 2.0. We had a QLoRA training run stepping on a B200 within hours. Here's what broke.

Bug 1: "Transformers does not recognize this architecture"

Stable Transformers (5.4.0) doesn't include gemma4 model type yet.

Fix: pip install git+https://github.com/huggingface/transformers.git

Bug 2: "Target module Gemma4ClippableLinear is not supported"

Gemma 4 introduces Gemma4ClippableLinear for its vision encoder. PEFT doesn't recognize it - even from source. The layer wraps nn.Linear with input/output clamping, but inherits from nn.Module instead of nn.Linear, so PEFT's type check rejects it.

Fix: Monkey-patch it to inherit from nn.Linear. PEFT then treats it normally.

Bug 3: "mm_token_type_ids is required"

Gemma 3 needed token_type_ids. Gemma 4 adds mm_token_type_ids (multimodal). Both need to be in the data collator, even for text-only training.

Fix: Custom collator with both fields initialized to zeros.

Result: 31B model training at 4.5s/step on 1x B200. 534M trainable params (1.68% QLoRA). None of these issues are avoidable with experience - they're day-zero discovery problems.

MetaMind42

14 days ago

Could you go into more detail about how you exactly resolved the issue with the Gemma4ClippableLinear, please?

Nathan-Maine

14 days ago

Sure. @MetaMind42 The issue is that Gemma4ClippableLinear inherits from nn.Module instead of nn.Linear. PEFT checks the type of every module before applying LoRA, and it only recognizes nn.Linear, nn.Embedding, nn.Conv1d/2d/3d, and nn.MultiheadAttention. Since Gemma4ClippableLinear isn't any of those, PEFT rejects it - even if you use exclude_modules to skip it, because the type check runs first.

The fix is to replace the class definition before loading the model:

import torch
import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4

class PatchedClippableLinear(nn.Linear):
def init(self, config, in_features, out_features):
nn.Linear.init(self, in_features, out_features, bias=False)
self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
if self.use_clipped_linears:
self.register_buffer("input_min", torch.tensor(-float("inf")))
self.register_buffer("input_max", torch.tensor(float("inf")))
self.register_buffer("output_min", torch.tensor(-float("inf")))
self.register_buffer("output_max", torch.tensor(float("inf")))

def forward(self, x):
    if self.use_clipped_linears:
        x = torch.clamp(x, self.input_min, self.input_max)
    out = nn.Linear.forward(self, x)
    if self.use_clipped_linears:
        out = torch.clamp(out, self.output_min, self.output_max)
    return out

modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear
Place this before any AutoModelForCausalLM.from_pretrained() call. The patched class preserves the clamping behavior but now inherits from nn.Linear, so PEFT recognizes it.

Also use exclude_modules=["vision_tower", "multi_modal_projector", "audio_tower"] in your LoRA config to skip applying LoRA to the vision/audio layers entirely.

I filed this as huggingface/peft#3129 - Benjamin Bossan suggested targeting the inner .linear layer as an alternative approach, but I haven't tested that yet.

MetaMind42

14 days ago

Will try! Thanks a lot.

Nathan-Maine

14 days ago

@MetaMind42 let me know how it works out. Also check out Benjamin's .linear layer approach. I just havent had a moment to run it yet... 🙃

MetaMind42

14 days ago

No luck with Benjamin's workaround. But yours worked for me. Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment