Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate E5-V with Sentence Transformers v5.4
Details
The integration uses a Transformer(feature-extraction) -> Pooling(lasttoken) -> Normalize pipeline. A custom chat_template.jinja was added that automatically applies the correct instruction suffix depending on whether the input is text-only or contains an image, so users simply pass raw text strings or PIL images to model.encode(). The processor config was updated with patch_size and vision_feature_select_strategy fields required by newer transformers versions (the model was originally published for transformers 4.41.2).
Added files:
modules.json: Defines the Transformer -> Pooling -> Normalize module pipelineconfig_sentence_transformers.json: ST model metadata (cosine similarity, no prompts)sentence_bert_config.json: Transformer config with feature-extraction task, modality config for text/image/message, andadd_generation_promptprocessing kwarg1_Pooling/config.json: Lasttoken pooling with embedding dimension 4096chat_template.jinja: Custom chat template that wraps text with "Summary above sentence in one word:" and images with "Summary above image in one word:", using the LLaMA 3 formatprocessor_config.json: Addspatch_size: 14andvision_feature_select_strategy: "full"required by transformers v5assets/dog.jpg,assets/cat.jpg: Example images for the usage snippet, these are the same as the Wikipedia ones, but Wikipedia requires User-Agents nowtest_baseline.py: Baseline script using transformers directlytest_st.py: Sentence Transformers integration test script
Modified files:
README.md: Addedsentence-transformerslibrary tag,sentence-similaritypipeline tag, and a "Using Sentence Transformers" usage section. Updated image URLs to use valid Wikipedia thumbnail sizes.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("royokong/e5-v")
# Encode text inputs
texts = [
"A dog sitting in the grass.",
"A dog standing in the snow.",
"A cat sitting in the grass.",
"A cat standing in the snow.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 4096)
# Encode image inputs
images = [
"https://huggingface.co/royokong/e5-v/resolve/main/assets/dog.jpg",
"https://huggingface.co/royokong/e5-v/resolve/main/assets/cat.jpg",
]
image_embeddings = model.encode(images)
print(image_embeddings.shape)
# (2, 4096)
# Compute text-image similarities
similarities = model.similarity(text_embeddings, image_embeddings)
print(similarities)
# tensor([[0.7183, 0.3579],
# [0.5806, 0.5522],
# [0.4714, 0.6479],
# [0.4150, 0.8081]])
This script won't work until the PR is merged, as the assets don't exist yet. To test it locally, you can replace the image URLs with local paths to the downloaded images, and you can use SentenceTransformer("royokong/e5-v", revision="refs/pr/12") to test it before merging.
Note: the image similarity scores from the Sentence Transformers snippet differ slightly from the transformers snippet (e.g. 0.7183 vs 0.7275 for the first pair). This is because the transformers example scores were produced with transformers 4.41.2 (the version the model was originally published with), while Sentence Transformers uses a newer transformers version (v5.x) which has changes in the image processing pipeline. Running the transformers example with v5.x produces scores that match the Sentence Transformers output. Text-only embeddings are identical across both versions.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen