nomic-embed-multimodal-v1.5

This is a multimodal Sentence Transformers model that combines nomic-embed-text-v1.5 and nomic-embed-vision-v1.5 into a single model using a Router. It maps text and images to a shared 768-dimensional embedding space, enabling cross-modal retrieval.

Text inputs are automatically routed to the text encoder, and image inputs (PIL images or URLs) are routed to the vision encoder.

Usage

Install Sentence Transformers:

pip install sentence_transformers[image]

Encoding text and images

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-multimodal-v1.5", trust_remote_code=True)

# Text encoding - encode_query applies the "search_query: " prefix
text_embeddings = model.encode_query([
    "What are cute animals to cuddle with?",
    "What do cats look like?",
])
print(text_embeddings.shape)
# (2, 768)

# Image encoding - also supports URLs and PIL images
img_embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
print(img_embeddings.shape)
# (768,)

Cross-modal retrieval

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-multimodal-v1.5", trust_remote_code=True)

queries = ["What are cute animals to cuddle with?", "What do cats look like?"]

img_embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
query_embeddings = model.encode_query(queries)

print(model.similarity(img_embeddings, query_embeddings))
# tensor([[0.0751, 0.0684]])

Model Details

Text encoder: nomic-embed-text-v1.5 (768-dim, mean pooling + LayerNorm + L2 normalize)
Image encoder: nomic-embed-vision-v1.5 (768-dim, CLS pooling + L2 normalize)
Similarity Function: Cosine Similarity
Supported Modalities: Text, Image
Maximum Sequence Length: 8192 tokens

Full Model Architecture

SentenceTransformer(
  (0): Router(
    default_route='text'
    (sub_modules): ModuleDict(
      (text): Sequential(
        (0): Transformer({..., 'architecture': 'NomicBertModel'})
        (1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'mean'})
        (2): LayerNorm({'dimension': 768})
        (3): Normalize({})
      )
      (image): Sequential(
        (0): Transformer({..., 'architecture': 'NomicVisionModel'})
        (1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'cls'})
        (2): Normalize({})
      )
    )
  )
)

Notes

The text encoder requires prefixes. Use encode_query() / encode_document() for search, or prompt_name="classification" / prompt_name="clustering" for other tasks. See the nomic-embed-text-v1.5 model card for details.
The text pipeline includes a LayerNorm layer to align the text embeddings with the vision embedding space, matching the original F.layer_norm step from the nomic-embed-vision-v1.5 usage instructions.

Paper for tomaarsen/nomic-embed-multimodal-v1.5

Nomic Embed Vision: Expanding the Latent Space

Paper • 2406.18587 • Published Jun 6, 2024 • 1

tomaarsen
/

nomic-embed-multimodal-v1.5