Nomic Embed Vision: Expanding the Latent Space
Paper • 2406.18587 • Published • 1
This is a multimodal Sentence Transformers model that combines nomic-embed-text-v1.5 and nomic-embed-vision-v1.5 into a single model using a Router. It maps text and images to a shared 768-dimensional embedding space, enabling cross-modal retrieval.
Text inputs are automatically routed to the text encoder, and image inputs (PIL images or URLs) are routed to the vision encoder.
Install Sentence Transformers:
pip install sentence_transformers[image]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-multimodal-v1.5", trust_remote_code=True)
# Text encoding - encode_query applies the "search_query: " prefix
text_embeddings = model.encode_query([
"What are cute animals to cuddle with?",
"What do cats look like?",
])
print(text_embeddings.shape)
# (2, 768)
# Image encoding - also supports URLs and PIL images
img_embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
print(img_embeddings.shape)
# (768,)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-multimodal-v1.5", trust_remote_code=True)
queries = ["What are cute animals to cuddle with?", "What do cats look like?"]
img_embeddings = model.encode("http://images.cocodataset.org/val2017/000000039769.jpg")
query_embeddings = model.encode_query(queries)
print(model.similarity(img_embeddings, query_embeddings))
# tensor([[0.0751, 0.0684]])
SentenceTransformer(
(0): Router(
default_route='text'
(sub_modules): ModuleDict(
(text): Sequential(
(0): Transformer({..., 'architecture': 'NomicBertModel'})
(1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'mean'})
(2): LayerNorm({'dimension': 768})
(3): Normalize({})
)
(image): Sequential(
(0): Transformer({..., 'architecture': 'NomicVisionModel'})
(1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'cls'})
(2): Normalize({})
)
)
)
)
encode_query() / encode_document() for search, or prompt_name="classification" / prompt_name="clustering" for other tasks. See the nomic-embed-text-v1.5 model card for details.LayerNorm layer to align the text embeddings with the vision embedding space, matching the original F.layer_norm step from the nomic-embed-vision-v1.5 usage instructions.