Integrate with Sentence Transformers v5.4

This PR adds the configuration files needed to load this model directly as a SentenceTransformer. The model is a LLaVA-NeXT (Mistral-7B) based multimodal embedding model that uses an any-to-any Transformer with last-token pooling and normalization, producing 4096-dimensional embeddings. It supports text, image, and composed image+text inputs.

This model uses a specific input format with [INST]...[/INST] wrapping, <instruct> tokens for task instructions, and <query> tokens for query text. Rather than requiring users to manually construct these formatted strings, I've added a chat_template.jinja that handles the formatting automatically. Passing a prompt (task instruction) formats the input as a query; omitting it formats as a candidate. The original chat template in tokenizer_config.json was a generic Mistral Instruct template that didn't handle multimodal content or the query/candidate distinction, so it has been removed in favour of the .jinja file.

I've also added a processor_config.json with the required LlavaNext processor settings (patch_size, vision_feature_select_strategy, num_additional_image_tokens) so the processor is correctly configured out of the box.

Added files:

modules.json: pipeline: Transformer, Pooling (lasttoken) & Normalize
sentence_bert_config.json: any-to-any task, multimodal config with forward + hidden_states[-1] for text, image, image+text, and message modalities
config_sentence_transformers.json: cosine similarity
chat_template.jinja: query/candidate formatting via system message presence
processor_config.json: LlavaNext processor settings

Modified files:

tokenizer_config.json: removed inline chat template (replaced by chat_template.jinja)
README.md: added Sentence Transformers usage section, library_name, tags

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

import torch
from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("BAAI/BGE-VL-MLLM-S2", model_kwargs={"torch_dtype": torch.float16}, revision="refs/pr/5")

# Composed image retrieval: text + image query
query_img = Image.open("./assets/cir_query.png").resize((512, 512))
query_emb = model.encode(
    {"text": "Make the background dark, as if the camera has taken the photo at night", "image": query_img},
    prompt="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: ",
)

# Image-only candidates (no prompt = candidate format)
candidate_imgs = [
    Image.open("./assets/cir_candi_1.png").resize((512, 512)),
    Image.open("./assets/cir_candi_2.png").resize((512, 512)),
]
candidate_embs = model.encode(candidate_imgs, batch_size=1)
print(candidate_embs.shape)
# (2, 4096)

similarities = model.similarity(query_emb, candidate_embs)
print(similarities)
# tensor([[0.4158, 0.2188]], dtype=torch.float16)

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

Tom Aarsen

tomaarsen changed pull request status to open 12 days ago

JUNJIE99 changed pull request status to merged 10 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment