Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate this model as a Sentence Transformers
SentenceTransformer
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since the custom NVOmniEmbedModel returns hidden_states rather than last_hidden_state), followed by mean pooling and normalization to produce 2048-dimensional embeddings.
The model supports text, image, audio, and video inputs β both individually and in combination. A dedicated Sentence Transformers chat template (additional_chat_templates/sentence_transformers.jinja) merges the query: / passage: prompt prefixes directly into the user message content, matching the format expected by the model.
Added files:
modules.json: pipeline withTransformer,Pooling(mean)&Normalizemodulessentence_bert_config.json:feature-extractiontask with["hidden_states", -1]output extraction, structured message format,unpad_inputs: false(required for multimodal inputs), and a custom chat template referenceconfig_sentence_transformers.json:SentenceTransformermodel type withquery:/passage:prompts and cosine similarity1_Pooling/config.json: mean pooling with 2048 embedding dimensionadditional_chat_templates/sentence_transformers.jinja: simplified chat template that flattens system message prompts into the user message content
Changed files:
README.md: addedsentence-transformerstag and a "Using Sentence Transformers" section with a text + video + audio exampletokenizers.json: Updated"max_length": 900
Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nvidia/omni-embed-nemotron-3b", trust_remote_code=True, revision="refs/pr/3")
# Text queries
queries = ["Drawing of a cat", "Drawing of a guitar", "Man walking down the street"]
# Document: text + video + audio (a video of someone drawing a guitar)
documents = [
{
"text": "This is a passage to be embedded",
"video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
"audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
},
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3532],
# [0.5502],
# [0.2215]])
And after merging, the revision argument can be dropped.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen