Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate this model with Sentence Transformers v5.4
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer. The model is a LLaVA-NeXT (Mistral-7B) based multimodal embedding model that uses an any-to-any Transformer with last-token pooling and normalization, producing 4096-dimensional embeddings. It supports text, image, and composed image+text inputs.
This model uses a specific input format with [INST]...[/INST] wrapping, <instruct> tokens for task instructions, and <query> tokens for query text. Rather than requiring users to manually construct these formatted strings, I've added a chat_template.jinja that handles the formatting automatically. Passing a prompt (task instruction) formats the input as a query; omitting it formats as a candidate. The original chat template in tokenizer_config.json was a generic Mistral Instruct template that didn't handle multimodal content or the query/candidate distinction, so it has been removed in favour of the .jinja file.
I've also added a processor_config.json with the required LlavaNext processor settings (patch_size, vision_feature_select_strategy, num_additional_image_tokens) so the processor is correctly configured out of the box.
Added files:
modules.json: pipeline:Transformer,Pooling(lasttoken) &Normalizesentence_bert_config.json:any-to-anytask, multimodal config withforward+hidden_states[-1]for text, image, image+text, and message modalitiesconfig_sentence_transformers.json: cosine similaritychat_template.jinja: query/candidate formatting via system message presenceprocessor_config.json: LlavaNext processor settings
Modified files:
tokenizer_config.json: removed inline chat template (replaced bychat_template.jinja)README.md: added Sentence Transformers usage section,library_name, tags
Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:
import torch
from sentence_transformers import SentenceTransformer
from PIL import Image
model = SentenceTransformer("BAAI/BGE-VL-MLLM-S2", model_kwargs={"torch_dtype": torch.float16}, revision="refs/pr/5")
# Composed image retrieval: text + image query
query_img = Image.open("./assets/cir_query.png").resize((512, 512))
query_emb = model.encode(
{"text": "Make the background dark, as if the camera has taken the photo at night", "image": query_img},
prompt="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: ",
)
# Image-only candidates (no prompt = candidate format)
candidate_imgs = [
Image.open("./assets/cir_candi_1.png").resize((512, 512)),
Image.open("./assets/cir_candi_2.png").resize((512, 512)),
]
candidate_embs = model.encode(candidate_imgs, batch_size=1)
print(candidate_embs.shape)
# (2, 4096)
similarities = model.similarity(query_emb, candidate_embs)
print(similarities)
# tensor([[0.4158, 0.2188]], dtype=torch.float16)
And after merging, the revision argument can be dropped.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen