Papers
arxiv:2507.05513

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

Published on Jul 7, 2025
Authors:
,
,
,
,

Abstract

A unified text-image retrieval model using modified NVIDIA Eagle2 VLM with bidirectional attention and ColBERT-style late interaction achieves state-of-the-art performance across benchmarks.

AI-generated summary

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2507.05513
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.05513 in a dataset README.md to link it from this page.

Spaces citing this paper 13

Collections including this paper 1