ColGemma-4

ColGemma4-E4B-IT-Base

See also: ColGemma4-E2B-IT-Base (smaller 2.3B-effective variant)

ColGemma4 is a visual document retrieval model built on Google's Gemma 4 E4B (9B params, 4.5B effective). It generates ColBERT-style multi-vector representations of document images and text queries for late-interaction retrieval.

Single-seed LoRA adapter trained with ColbertLoss, no hard negatives, no model merging.

Built following the ColPali architecture pattern, adapted for Gemma 4's multimodal architecture.

Model Description

Property Value
Base model google/gemma-4-E4B-it (9B total, 4.5B effective)
Architecture ColBERT late-interaction over Gemma 4 VLM
Embedding dim 128
Visual tokens 1120 (max soft tokens)
Fine-tuning LoRA (r=32, alpha=32, dropout=0.1)
Trainable params 73.4M (0.95% of total)
Projection Random-init linear (hidden_size -> 128), not trained
Training loss ColbertLoss (temperature=0.02, in-batch negatives only)
Precision BF16

Benchmark Results

All scores are nDCG@5 on the ViDoRe benchmark.

ViDoRe V1

Task nDCG@5 nDCG@10
ArxivQA 84.53 85.57
DocVQA 57.70 59.87
InfoVQA 90.84 91.12
ShiftProject 84.45 85.17
SyntheticDocQA - AI 97.89 97.89
SyntheticDocQA - Energy 94.40 95.01
SyntheticDocQA - Government 96.58 96.58
SyntheticDocQA - Healthcare 97.89 97.89
Tabfquad 92.37 92.84
Tatdqa 76.72 78.93
Average 87.34 88.09

ViDoRe V2

Task nDCG@5 nDCG@10
BioMedical Lectures 57.06 59.48
ESG Reports - HL 57.96 61.07
ESG Reports 47.51 51.59
Economics Reports 39.76 42.46
Average 50.57 53.65

ViDoRe V3

Task nDCG@5 nDCG@10
Computer Science 63.55 67.76
Energy 59.64 62.55
Finance En 51.05 53.96
Finance Fr 41.05 44.64
HR 52.60 55.17
Pharmaceuticals 55.98 57.42
Physics 43.37 46.11
Industrial 41.52 42.98
Average 51.10 53.82

Usage

Installation

pip install colpali-engine transformers torch peft

Loading the Model

import torch
from colgemma4 import ColGemma4, ColGemma4Processor

model = ColGemma4.from_pretrained(
    "athrael-soju/ColGemma4-E4B-IT-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",
    ignore_mismatched_sizes=True,
)

processor = ColGemma4Processor.from_pretrained(
    "athrael-soju/ColGemma4-E4B-IT-Base",
    max_num_visual_tokens=1120,
)

Encoding Documents (Images)

from PIL import Image

images = [Image.open("page1.png"), Image.open("page2.png")]
batch_doc = processor.process_images(images)
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}

with torch.no_grad():
    doc_embeddings = model(**batch_doc)

Encoding Queries

queries = ["What is the revenue for Q3 2024?"]
batch_query = processor.process_queries(queries)
batch_query = {k: v.to(model.device) for k, v in batch_query.items()}

with torch.no_grad():
    query_embeddings = model(**batch_query)

Scoring (MaxSim)

scores = processor.score(query_embeddings, doc_embeddings)

Training Configuration

Base model: google/gemma-4-E4B-it
Loss: ColbertLoss (temperature=0.02)
Hard negatives: none
Batch size per GPU: 64
GPUs: 8
Gradient accumulation: 1
Effective batch size: 512 (64 x 8)
In-batch negatives: 512
LoRA:
  r: 32
  alpha: 32
  dropout: 0.1
  target_modules: "language_model.*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj)"
Learning rate: 2e-4 (cosine schedule, 8% warmup)
Weight decay: 0.02
Epochs: 1
Steps: 1,512
Visual tokens: 1120
Attention: Bidirectional (all 42 layers patched)
Gradient checkpointing: enabled
Precision: BF16

Training Data

Trained on ~774K query-document pairs from publicly available datasets:

  • vidore/colpali_train_set
  • openbmb/VisRAG-Ret-Train-Synthetic-data
  • openbmb/VisRAG-Ret-Train-In-domain-data
  • llamaindex/vdr-multilingual-train (en/de/es/fr/it subsets)
  • vidore/tatdqa_train
  • vidore/tabfquad_train_set

Troubleshooting & Fine-tuning Guide

If you're building on this model or training your own ColGemma4, here's what we learned along the way.

Gemma 4 Architecture Gotchas

  • Position embedding memory blow-up - The vision encoder uses F.one_hot(positions, num_classes=10240) which allocates ~314 GB at batch=32 with 1120 tokens. Replacing with F.embedding is mathematically identical and saves ~106 GB/GPU. Required for batch sizes above 8.
    # In Gemma4VisionEncoder._position_embeddings:
    # Replace: one_hot = F.one_hot(clamped_positions, num_classes=self.position_embedding_size)
    # With:    return F.embedding(clamped_positions, self.position_embedding_table)
    
  • KV-sharing and use_cache - Gemma 4 E4B has 18 of 42 text layers (24-41) that reuse K/V from 2 donor layers (22 and 23) when caching is enabled. During training, always set use_cache=False to ensure every layer computes its own K/V and all LoRA weights are active. At inference time, set use_cache=True so the KV-sharing architecture works as designed.
  • Flash Attention is incompatible - Gemma 4 has head_dim 256 (sliding) and 512 (global). FA v2 caps at 256, FA v4 at 128. Use attn_implementation="sdpa" instead.
  • Gradient checkpointing is mandatory - At 1120 visual tokens, activations from 42 layers consume well over 200 GB. Even with the position patch, disabling gradient checkpointing will OOM.

Training Tips

  • Leave custom_text_proj alone - The 128-dim projection is randomly initialized and works best untrained. Both LoRA-targeting and modules_to_save caused regressions in our experiments. The random projection provides a consistent mapping without overfitting.
  • Keep grad_accum=1 with contrastive losses - all_gather only collects the current micro-batch, so accumulation steps halve your in-batch negatives. Training loss looks deceptively good but eval regresses. Use the largest batch that fits with grad_accum=1.
  • Avoid torch.compile - It adds _orig_mod. to weight keys, breaking PEFT adapter loading at eval time. Scores drop to near-zero despite healthy training loss.
  • Watch for silent weight randomization - ignore_mismatched_sizes=True will initialize mismatched weights to random without any error. Sanity check: loss at step 0 should be near log(batch_size) (~6.2 for batch 512), and grad norms should be 5-20. If grad norms are in the thousands, weights didn't load correctly.

HydraGemma4 (Dual-Head Variant)

This repo includes lm_head.pt, the saved language model head from the base model. Combined with the LoRA adapter, this enables the HydraGemma4 dual-head architecture supporting both retrieval (ColBERT embeddings) and generation (text output) from the same base model.

Limitations

  • Single-seed, no model merging
  • No hard negative mining (relies entirely on in-batch negatives)
  • English-centric training data (some multilingual from VDR)
  • Visual token budget fixed at 1120

Citation

@misc{colgemma4,
  title={ColGemma4: Visual Document Retrieval with Gemma 4},
  author={Athrael Soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/ColGemma4-E4B-IT-Base}
}

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for athrael-soju/ColGemma4-E4B-IT-Base

Adapter
(64)
this model

Datasets used to train athrael-soju/ColGemma4-E4B-IT-Base

Collection including athrael-soju/ColGemma4-E4B-IT-Base

Papers for athrael-soju/ColGemma4-E4B-IT-Base