LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning

We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families!

This model implements the framework presented in the paper Scaling Language-Centric Omnimodal Representation Learning, accepted by NeurIPS 2025.

Project Page: https://huggingface.co/LCO-Embedding

Github Repository: https://github.com/LCO-Embedding/LCO-Embedding

Quick Start

Note: We are only using the thinker component of Qwen2.5 Omni and drops the talker component.

Using Sentence Transformers

Install Sentence Transformers:

pip install "sentence_transformers[image]"

import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "LCO-Embedding/LCO-Embedding-Omni-3B",
    trust_remote_code=True,
    model_kwargs={"dtype": torch.bfloat16},
)

# The same "Summarize the above <modality> in one word:" instruction used in
# the paper is baked into the chat template, so encode() takes plain text or
# multimodal dicts directly.
texts = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "The Eiffel Tower is located in Paris.",
    "Berlin is the capital of Germany.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 2048)

text_similarities = model.similarity(text_embeddings, text_embeddings)
print(text_similarities)
# tensor([[1.0000, 0.9538, 0.6566, 0.5988],
#         [0.9538, 1.0000, 0.7059, 0.5932],
#         [0.6566, 0.7059, 1.0000, 0.4198],
#         [0.5988, 0.5932, 0.4198, 1.0000]])

# Encoding images (text, audio, and video also work, individually or combined using a dict input):
image_embeddings = model.encode([
    "path/to/image_1.png",
    "path/to/image_2.png",
])
print(image_embeddings.shape)
# (2, 2048)

# Multimodal inputs can mix modalities via dicts (text + image + audio + video):
queries = ["A diagram of the Qwen2.5-Omni architecture"]
documents = [
    {"image": "path/to/qwen_diagram.png"},
    {"text": "Llama 4 architecture overview", "image": "path/to/llama_diagram.png"},
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities.shape)
# torch.Size([1, 2])

Using Transformers

from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B") # or add a `max_pixels = 1280*28*28' for efficient encoding
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B",
                                                                    torch_dtype=torch.bfloat16,
                                                                    device_map="auto")

Text Batch Encodings:

texts = ["some random text", "a second random text", "a third random text"] * 30
batch_size = 8
text_prompt =  "{}\nSummarize the above text in one word:" 

all_text_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i : i + batch_size]
        batch_texts = [text_prompt.format(text) for text in batch_texts]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text":text},
                ],

            }
        ] for text in batch_texts]
        text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
        text_inputs = processor(
        text = text_inputs,
        padding = True,
        return_tensors = "pt",
        )
        text_inputs = text_inputs.to("cuda")
        text_outputs = model(
            **text_inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_text_embeddings.append(text_outputs.to(torch.float16).cpu())

all_text_embeddings = torch.cat(all_text_embeddings, dim=0)

Image Batch Encodings:


images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline
image_prompt = "\nSummarize the above image in one word:"
batch_size = 8

all_image_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(images), batch_size)):
        batch_images = images[i : i + batch_size]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "image", "image":image},
                    {"type": "text", "text": image_prompt},
                ],

            }
        ] for image in batch_images]
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True)
        inputs = processor(
            text=text, 
            audio=audio_inputs, 
            images=image_inputs, 
            videos=video_inputs, 
            return_tensors="pt", 
            padding=True
        )
        inputs = inputs.to("cuda")
        image_outputs = model(
            **inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_image_embeddings.append(image_outputs.to(torch.float16).cpu())

all_image_embeddings = torch.cat(all_image_embeddings, dim=0)

Audio Batch Encoding:

import logging
logging.getLogger("root").setLevel(logging.ERROR)
# set this to prevent getting the Qwen Omni system prompt mismatch warning.

batch_size = 4
audio_prompt = "\nSummarize the above audio in one word:"
audis = [some audios]  * 1000

all_audio_embeddings = []

with torch.no_grad():
  for i in tqdm(range(0, len(audios), batch_size)):
      torch.cuda.empty_cache()
      
      batch_audios = audios[i : i + batch_size]
      messages = [[
          {
              "role": "user",
              "content": [
                   {"type": "audio", "audio": audio},
                  {"type": "text", "text": audio_prompt},
              ],
              
          }
      ] for audio in batch_audios]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      audio_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_audio_embeddings.append(audio_outputs.to(torch.float16).cpu())
      del inputs, audio_outputs
      torch.cuda.empty_cache()
                
all_audio_embeddings = torch.cat(all_audio_embeddings, dim=0)

Video Batch Encoding:

videos = [some videos] * 1000
video_prompt = "\nSummarize the above video in one word:"
batch_size = 4

long_video = False
# followed by some example hyperparameters to save RAM
# for long videos. Not optimal. Tune case by case.

all_video_embeddings = []
with torch.no_grad():
  for i in tqdm(range(0, len(videos), batch_size)):
      torch.cuda.empty_cache()
      
      batch_videos = videos[i : i + batch_size]
      if long_video:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                          "max_pixels": 224 * 224,
                          "fps": 1,
                          "max_frames": 10
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      else:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      video_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_video_embeddings.append(video_outputs.to(torch.float16).cpu())
      
      del inputs, video_outputs
      torch.cuda.empty_cache()
                
all_video_embeddings = torch.cat(all_video_embeddings, dim=0)

Overview

We introduce LCO-Embedding, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on MIEB (Massive Image Embedding Benchmark), while supporting audio and videos.

This work also introduces the Generation-Representation Scaling Law, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce SeaDoc, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound.

Evaluation Results

We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories.

LCO-Embedding is also SOTA on MAEB (massive audio embedding benchmark) without even training on audio. Screenshot from the MAEB paper.

Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones.

Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis).

Citation

If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX:

@article{xiao2025scaling,
  title={Scaling Language-Centric Omnimodal Representation Learning},
  author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu},
  journal={arXiv preprint arXiv:2510.11693},
  year={2025}
}