Integrate with Sentence Transformers v5.4

#3
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model as a Sentence Transformers SentenceTransformer

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since the custom NVOmniEmbedModel returns hidden_states rather than last_hidden_state), followed by mean pooling and normalization to produce 2048-dimensional embeddings.

The model supports text, image, audio, and video inputs β€” both individually and in combination. A dedicated Sentence Transformers chat template (additional_chat_templates/sentence_transformers.jinja) merges the query: / passage: prompt prefixes directly into the user message content, matching the format expected by the model.

Added files:

  • modules.json: pipeline with Transformer, Pooling(mean) & Normalize modules
  • sentence_bert_config.json: feature-extraction task with ["hidden_states", -1] output extraction, structured message format, unpad_inputs: false (required for multimodal inputs), and a custom chat template reference
  • config_sentence_transformers.json: SentenceTransformer model type with query: / passage: prompts and cosine similarity
  • 1_Pooling/config.json: mean pooling with 2048 embedding dimension
  • additional_chat_templates/sentence_transformers.jinja: simplified chat template that flattens system message prompts into the user message content

Changed files:

  • README.md: added sentence-transformers tag and a "Using Sentence Transformers" section with a text + video + audio example
  • tokenizers.json: Updated "max_length": 900

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nvidia/omni-embed-nemotron-3b", trust_remote_code=True, revision="refs/pr/3")

# Text queries
queries = ["Drawing of a cat", "Drawing of a guitar", "Man walking down the street"]

# Document: text + video + audio (a video of someone drawing a guitar)
documents = [
    {
        "text": "This is a passage to be embedded",
        "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
        "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4",
    },
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3532],
#         [0.5502],
#         [0.2215]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open
nvidia-oliver-holworthy changed pull request status to merged

Sign up or log in to comment