arxiv:2605.24938

Your Embedding Model is SMARTer Than You Think

Published on May 24

· Submitted by

Harris Zhang on May 26

Upvote

Authors:

Jianrui Zhang ,

Sukanta Ganguly ,

Abstract

SMART enhances multimodal retrieval by leveraging latent multi-vector capabilities from single-vector models through contrastive training and late-interaction inference, achieving state-of-the-art performance with reduced computational costs.

AI-generated summary

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

View arXiv page View PDF GitHub 2 Add to collection

Community

HanSolo9682

Paper author Paper submitter about 15 hours ago

Our work introduces SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models for multimodal retrieval. While single-vector retrievers are highly efficient, they often discard the fine-grained, local evidence critical for dense retrieval tasks. To address this, SMART applies direct late-interaction over the frozen hidden states of your model during inference, acting as a plug-and-play upgrade that consistently improves performance across diverse modalities.

Not only does this approach push state-of-the-art performance further on MMEB-V2, but with simple, lightweight post-training, it also enables single-vector models to outperform heavily trained multi-vector counterparts on Visual Document retrieval. Ultimately, it offers both a highly efficient inference enhancement and a powerful finetuning technique to get the absolute most out of existing embedding models, saving both time and compute. Code and weights are open-sourced!

ZebangCheng

about 15 hours ago

Good job !!!

ZebangCheng

about 15 hours ago

Very interesting idea! I’m curious whether SMART works equally well for smaller or older embedding models, or if its gains mainly depend on strong modern backbones like Qwen3-VL-Embedding.

HanSolo9682

Paper author about 15 hours ago

•

edited about 14 hours ago

Great question! We actually did try on other models like VLM2Vec and GME to showcase the generalizability of our method. Table 1 will be where to look at!