Integrate with Sentence Transformers v5.4
Hello!
Preface
Congratulations on your #1 on MAEB!
For context, Sentence Transformers recently added multimodality support (https://huggingface.co/blog/multimodal-sentence-transformers), and I'd love to incorporate your models into the suite of supported models.
Pull Request overview
- Integrate this model as a Sentence Transformers
SentenceTransformer(targeting v5.4)
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since Qwen2_5OmniThinkerForConditionalGeneration returns hidden_states rather than last_hidden_state), followed by last-token pooling and L2 normalization to produce 2048-dimensional embeddings.
The paper's instruction suffix ("Summarize the above <modality> in one word:") is baked into a dedicated Sentence Transformers chat template (additional_chat_templates/sentence_transformers.jinja), so model.encode() takes plain text or multimodal dicts without any manual formatting.
Because the upstream qwen2_5_omni_thinker model type is shipped in transformers but not registered in AutoConfig, a two-line modeling_lco_omni.py re-exports Qwen2_5OmniThinkerConfig and Qwen2_5OmniThinkerForConditionalGeneration from transformers. An auto_map entry in config.json points at these re-exports so AutoConfig/AutoModel resolve correctly when trust_remote_code=True is set. The original model_type and architectures values are preserved, since auto_map takes precedence during loading when trust_remote_code=True.
To keep the processor's "System prompt modified, audio output may not work as expected" warning quiet, config_sentence_transformers.json sets Qwen2.5-Omni's default system string as the ST default prompt. Sentence Transformers injects it as a system message on every encode() call, which satisfies the processor's check; the custom chat template then discards it and always renders "You are a helpful assistant." (the original system prompt), so the tokenized output is unchanged from the paper's Quick Start.
Added files:
modules.json: pipeline withTransformer,Pooling(lasttoken)andNormalizemodulessentence_bert_config.json:feature-extractiontask with["hidden_states", -1]output extraction, structured message format, and a custom chat template reference withadd_generation_prompt=Trueconfig_sentence_transformers.json:SentenceTransformermodel type with cosine similarity and a singledefaultprompt set to Qwen2.5-Omni's default system string. This is injected as a system message on everyencode()call purely to satisfy the processor's"System prompt modified, audio output may not work as expected"check; the custom chat template discards it and renders"You are a helpful assistant."instead.1_Pooling/config.json: last-token pooling with 2048 embedding dimensionchat_template.jinja: the upstream Qwen2.5-Omni chat template (byte-identical to the legacychat_template.json's value), converted to the modern.jinjaformat so it coexists withadditional_chat_templates/additional_chat_templates/sentence_transformers.jinja: chat template that auto-appends"\nSummarize the above <modality> in one word:"to the user content based on whether the message contains text, image, audio, or video, and discards any input system message while emitting a fixed"You are a helpful assistant."system blockmodeling_lco_omni.py: two-line re-export ofQwen2_5OmniThinkerConfig/Qwen2_5OmniThinkerForConditionalGenerationfrom transformers soauto_mapcan resolve them (the config'smodel_typeandarchitecturesfields stay unchanged)
Modified files:
README.md: switchedlibrary_namefromtransformerstosentence-transformers, added atags:block (transformers,sentence-transformers,feature-extraction,multimodal-embedding), and added a "Using Sentence Transformers" subsection with text, image, and multimodal-dict examples. The existing Quick Start code is retained as a "Using Transformers" subsection;pipeline_tagis unchanged.config.json: added anauto_mapentry pointing atmodeling_lco_omni;model_typeandarchitecturesare unchanged
Removed files:
chat_template.json: replaced by the modernchat_template.jinja(required foradditional_chat_templates/to be picked up by the processor)
Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:
import torch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"LCO-Embedding/LCO-Embedding-Omni-3B",
trust_remote_code=True,
model_kwargs={"dtype": torch.bfloat16},
revision="refs/pr/1",
)
texts = [
"The capital of France is Paris.",
"Paris is the capital city of France.",
"The Eiffel Tower is located in Paris.",
"Berlin is the capital of Germany.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 2048)
text_similarities = model.similarity(text_embeddings, text_embeddings)
print(text_similarities)
# tensor([[1.0000, 0.9538, 0.6566, 0.5988],
# [0.9538, 1.0000, 0.7059, 0.5932],
# [0.6566, 0.7059, 1.0000, 0.4198],
# [0.5988, 0.5932, 0.4198, 1.0000]])
image_embeddings = model.encode([
"path/to/image_1.png",
"path/to/image_2.png",
])
print(image_embeddings.shape)
# (2, 2048)
# Multimodal inputs can mix modalities via dicts (text + image + audio + video):
queries = ["A diagram of the Qwen2.5-Omni architecture"]
documents = [
{"image": "path/to/qwen_diagram.png"},
{"text": "Llama 4 architecture overview", "image": "path/to/llama_diagram.png"},
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities.shape)
# torch.Size([1, 2])
And after merging, the revision argument can be dropped.
Note that none of the old behaviour is affected or changed. It only adds an additional way to run this model in a familiar and common format. I'll also be opening a PR for the 7B model with the same changes. Let me know if you have any questions! Sadly, trust_remote_code=True will be needed as the thinker component can't be loaded standalone currently.
- Tom Aarsen
Hi Tom,
Thanks so much for the integration! Congrats on the multimodality support for SentenceTransformer! - been following the latest release. This looks great to me. Am merging this now. We are releasing a pro version of the model soon; will let you know and looking forward to integrating it as well.
Chenghao
Very exciting! Feel free to ping me when you do, perhaps I can help with sharing the word on the social medias or something along those lines 🤗
- Tom Aarsen