Integrate with Sentence Transformers v5.4

#5
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model with Sentence Transformers v5.4

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer. The model is a LLaVA-NeXT (Mistral-7B) based multimodal embedding model that uses an any-to-any Transformer with last-token pooling and normalization, producing 4096-dimensional embeddings. It supports text, image, and composed image+text inputs.

This model uses a specific input format with [INST]...[/INST] wrapping, <instruct> tokens for task instructions, and <query> tokens for query text. Rather than requiring users to manually construct these formatted strings, I've added a chat_template.jinja that handles the formatting automatically. Passing a prompt (task instruction) formats the input as a query; omitting it formats as a candidate. The original chat template in tokenizer_config.json was a generic Mistral Instruct template that didn't handle multimodal content or the query/candidate distinction, so it has been removed in favour of the .jinja file.

I've also added a processor_config.json with the required LlavaNext processor settings (patch_size, vision_feature_select_strategy, num_additional_image_tokens) so the processor is correctly configured out of the box.

Added files:

  • modules.json: pipeline: Transformer, Pooling (lasttoken) & Normalize
  • sentence_bert_config.json: any-to-any task, multimodal config with forward + hidden_states[-1] for text, image, image+text, and message modalities
  • config_sentence_transformers.json: cosine similarity
  • chat_template.jinja: query/candidate formatting via system message presence
  • processor_config.json: LlavaNext processor settings

Modified files:

  • tokenizer_config.json: removed inline chat template (replaced by chat_template.jinja)
  • README.md: added Sentence Transformers usage section, library_name, tags

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

import torch
from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("BAAI/BGE-VL-MLLM-S2", model_kwargs={"torch_dtype": torch.float16}, revision="refs/pr/5")

# Composed image retrieval: text + image query
query_img = Image.open("./assets/cir_query.png").resize((512, 512))
query_emb = model.encode(
    {"text": "Make the background dark, as if the camera has taken the photo at night", "image": query_img},
    prompt="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: ",
)

# Image-only candidates (no prompt = candidate format)
candidate_imgs = [
    Image.open("./assets/cir_candi_1.png").resize((512, 512)),
    Image.open("./assets/cir_candi_2.png").resize((512, 512)),
]
candidate_embs = model.encode(candidate_imgs, batch_size=1)
print(candidate_embs.shape)
# (2, 4096)

similarities = model.similarity(query_emb, candidate_embs)
print(similarities)
# tensor([[0.4158, 0.2188]], dtype=torch.float16)

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open
JUNJIE99 changed pull request status to merged

Sign up or log in to comment