arxiv:2407.07726

PaliGemma: A versatile 3B VLM for transfer

Published on Jul 10, 2024

· Submitted by

AK on Jul 11, 2024

#1 Paper of the day

Upvote

Authors:

Lucas Beyer ,

Daniel Salz ,

Maxim Neumann ,

Ibrahim Alabdulmohsin ,

Michael Tschannen ,

Emanuele Bugliarello ,

Daniel Keysers ,

Skanda Koppula ,

Fangyu Liu ,

Alexey Gritsenko ,

Neil Houlsby ,

Keran Rong ,

Julian Eisenschlos ,

Rishabh Kabra ,

Abstract

PaliGemma, a versatile Vision-Language Model based on SigLIP-So400m and Gemma-2B, demonstrates strong performance across numerous open-world tasks, including specialized areas like remote sensing and segmentation.

AI-generated summary

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.