Integrate with Sentence Transformers v5.4

#1
by tomaarsen HF Staff - opened

Hello!

Preface

Congratulations on your #1 on MAEB!
For context, Sentence Transformers recently added multimodality support (https://huggingface.co/blog/multimodal-sentence-transformers), and I'd love to incorporate your models into the suite of supported models.

Pull Request overview

  • Integrate this model as a Sentence Transformers SentenceTransformer (targeting v5.4)

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since Qwen2_5OmniThinkerForConditionalGeneration returns hidden_states rather than last_hidden_state), followed by last-token pooling and L2 normalization to produce 2048-dimensional embeddings.

The paper's instruction suffix ("Summarize the above <modality> in one word:") is baked into a dedicated Sentence Transformers chat template (additional_chat_templates/sentence_transformers.jinja), so model.encode() takes plain text or multimodal dicts without any manual formatting.

Because the upstream qwen2_5_omni_thinker model type is shipped in transformers but not registered in AutoConfig, a two-line modeling_lco_omni.py re-exports Qwen2_5OmniThinkerConfig and Qwen2_5OmniThinkerForConditionalGeneration from transformers. An auto_map entry in config.json points at these re-exports so AutoConfig/AutoModel resolve correctly when trust_remote_code=True is set. The original model_type and architectures values are preserved, since auto_map takes precedence during loading when trust_remote_code=True.

To keep the processor's "System prompt modified, audio output may not work as expected" warning quiet, config_sentence_transformers.json sets Qwen2.5-Omni's default system string as the ST default prompt. Sentence Transformers injects it as a system message on every encode() call, which satisfies the processor's check; the custom chat template then discards it and always renders "You are a helpful assistant." (the original system prompt), so the tokenized output is unchanged from the paper's Quick Start.

Added files:

  • modules.json: pipeline with Transformer, Pooling(lasttoken) and Normalize modules
  • sentence_bert_config.json: feature-extraction task with ["hidden_states", -1] output extraction, structured message format, and a custom chat template reference with add_generation_prompt=True
  • config_sentence_transformers.json: SentenceTransformer model type with cosine similarity and a single default prompt set to Qwen2.5-Omni's default system string. This is injected as a system message on every encode() call purely to satisfy the processor's "System prompt modified, audio output may not work as expected" check; the custom chat template discards it and renders "You are a helpful assistant." instead.
  • 1_Pooling/config.json: last-token pooling with 2048 embedding dimension
  • chat_template.jinja: the upstream Qwen2.5-Omni chat template (byte-identical to the legacy chat_template.json's value), converted to the modern .jinja format so it coexists with additional_chat_templates/
  • additional_chat_templates/sentence_transformers.jinja: chat template that auto-appends "\nSummarize the above <modality> in one word:" to the user content based on whether the message contains text, image, audio, or video, and discards any input system message while emitting a fixed "You are a helpful assistant." system block
  • modeling_lco_omni.py: two-line re-export of Qwen2_5OmniThinkerConfig / Qwen2_5OmniThinkerForConditionalGeneration from transformers so auto_map can resolve them (the config's model_type and architectures fields stay unchanged)

Modified files:

  • README.md: switched library_name from transformers to sentence-transformers, added a tags: block (transformers, sentence-transformers, feature-extraction, multimodal-embedding), and added a "Using Sentence Transformers" subsection with text, image, and multimodal-dict examples. The existing Quick Start code is retained as a "Using Transformers" subsection; pipeline_tag is unchanged.
  • config.json: added an auto_map entry pointing at modeling_lco_omni; model_type and architectures are unchanged

Removed files:

  • chat_template.json: replaced by the modern chat_template.jinja (required for additional_chat_templates/ to be picked up by the processor)

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "LCO-Embedding/LCO-Embedding-Omni-3B",
    trust_remote_code=True,
    model_kwargs={"dtype": torch.bfloat16},
    revision="refs/pr/1",
)

texts = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "The Eiffel Tower is located in Paris.",
    "Berlin is the capital of Germany.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 2048)

text_similarities = model.similarity(text_embeddings, text_embeddings)
print(text_similarities)
# tensor([[1.0000, 0.9538, 0.6566, 0.5988],
#         [0.9538, 1.0000, 0.7059, 0.5932],
#         [0.6566, 0.7059, 1.0000, 0.4198],
#         [0.5988, 0.5932, 0.4198, 1.0000]])

image_embeddings = model.encode([
    "path/to/image_1.png",
    "path/to/image_2.png",
])
print(image_embeddings.shape)
# (2, 2048)

# Multimodal inputs can mix modalities via dicts (text + image + audio + video):
queries = ["A diagram of the Qwen2.5-Omni architecture"]
documents = [
    {"image": "path/to/qwen_diagram.png"},
    {"text": "Llama 4 architecture overview", "image": "path/to/llama_diagram.png"},
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities.shape)
# torch.Size([1, 2])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected or changed. It only adds an additional way to run this model in a familiar and common format. I'll also be opening a PR for the 7B model with the same changes. Let me know if you have any questions! Sadly, trust_remote_code=True will be needed as the thinker component can't be loaded standalone currently.

  • Tom Aarsen
tomaarsen changed pull request status to open
LCO-Embedding org

Hi Tom,

Thanks so much for the integration! Congrats on the multimodality support for SentenceTransformer! - been following the latest release. This looks great to me. Am merging this now. We are releasing a pro version of the model soon; will let you know and looking forward to integrating it as well.

Chenghao

gowitheflow changed pull request status to merged

Very exciting! Feel free to ping me when you do, perhaps I can help with sharing the word on the social medias or something along those lines 🤗

  • Tom Aarsen

Sign up or log in to comment