nvidia/omni-embed-nemotron-3b · Integrate with Sentence Transformers v5.4

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since the custom NVOmniEmbedModel returns hidden_states rather than last_hidden_state), followed by mean pooling and normalization to produce 2048-dimensional embeddings.

The model supports text, image, audio, and video inputs — both individually and in combination. A dedicated Sentence Transformers chat template (additional_chat_templates/sentence_transformers.jinja) merges the query: / passage: prompt prefixes directly into the user message content, matching the format expected by the model.

Added files:

modules.json: pipeline with Transformer, Pooling(mean) & Normalize modules
sentence_bert_config.json: feature-extraction task with ["hidden_states", -1] output extraction, structured message format, unpad_inputs: false (required for multimodal inputs), and a custom chat template reference
config_sentence_transformers.json: SentenceTransformer model type with query: / passage: prompts and cosine similarity
1_Pooling/config.json: mean pooling with 2048 embedding dimension
additional_chat_templates/sentence_transformers.jinja: simplified chat template that flattens system message prompts into the user message content

Changed files:

README.md: added sentence-transformers tag and a "Using Sentence Transformers" section with a text + video + audio example
tokenizers.json: Updated "max_length": 900

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nvidia/omni-embed-nemotron-3b", trust_remote_code=True, revision="refs/pr/3")

# Text queries
queries = ["Drawing of a cat", "Drawing of a guitar", "Man walking down the street"]

# Document: text + video + audio (a video of someone drawing a guitar)
documents = [
    {
        "text": "This is a passage to be embedded",
        "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
        "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
    },
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3532],
#         [0.5502],
#         [0.2215]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

Tom Aarsen

tomaarsen changed pull request status to open 12 days ago

nvidia-oliver-holworthy changed pull request status to merged 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment