Integrate with Sentence Transformers v5.4

#7
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model using a Sentence Transformers SentenceTransformer

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with a Normalize module, producing 512-dimensional normalized embeddings via CLIP's projection layers (get_text_features/get_image_features). The model supports text, image, and composed image+text inputs.

Because this model can create text+image retrieval (summing text and image projected embeddings), I've included a small custom BGEVLCLIPTransformer module (bge_vl_clip_transformer.py) that subclasses Sentence Transformers' Transformer. For the ("image", "text") compound modality, it runs text and image through their respective forward paths and sums the resulting embeddings. Text-only and image-only inputs are handled by the parent class directly. This requires trust_remote_code=True when loading the model with Sentence Transformers.

The custom module also overrides load to force trust_remote_code=False for the underlying AutoModel, since the repo's custom modeling_MMRet_CLIP.py has a non-persistent position_ids buffer issue on transformers v5+. The standard CLIPModel loads these weights fine.

Added files:

  • modules.json: pipeline: BGEVLCLIPTransformer & Normalize
  • sentence_bert_config.json: feature-extraction task, multimodal config with get_text_features/get_image_features
  • config_sentence_transformers.json: cosine similarity
  • bge_vl_clip_transformer.py: custom Transformer subclass for composed image+text late fusion

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/BGE-VL-base", trust_remote_code=True, revision="refs/pr/7")

query_image = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_query.png"
candidate_1 = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_candi_1.png"
candidate_2 = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_candi_2.png"

# Encode text
text_embeddings = model.encode(["A dog sitting on a bench", "A cat sleeping on a couch"])
print(text_embeddings.shape)
# (2, 512)

# Encode images
image_embeddings = model.encode([query_image, candidate_1])
print(image_embeddings.shape)
# (2, 512)

# Composed image retrieval: encode image+text query, compare with image candidates
query_embeddings = model.encode([{
    "image": query_image,
    "text": "Make the background dark, as if the camera has taken the photo at night",
}])
candidate_embeddings = model.encode([candidate_1, candidate_2])
scores = model.similarity(query_embeddings, candidate_embeddings)
print(scores)
# tensor([[0.2645, 0.1251]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open
JUNJIE99 changed pull request status to merged

Sign up or log in to comment