Integrate with Sentence Transformers v5.4

#3
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model with Sentence Transformers v5.4

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer. The model is a Qwen2.5-VL-3B based multimodal embedding model that uses a feature-extraction Transformer with last-token pooling and normalization, producing 2048-dimensional cosine-similarity embeddings. It supports text, image, and composed image+text inputs for visualized information retrieval (querying and retrieving screenshots).

The model's custom code (modeling_bge_vl_screenshot.py) uses a specific prompt template where the last token is <|endoftext|> (the pad token, id 151643), not the standard eos token. The default Qwen2.5-VL chat template with add_generation_prompt=True ends with <|im_start|>assistant\n, which would pool from the wrong token. To handle this, I've added a custom chat_template.jinja that appends <|endoftext|> after the generation prompt, matching the trained format exactly.

The model uses different system prompts for queries vs candidates: queries use a task instruction (e.g. "Represent the given image with the given query.") and candidates use a description prompt ("Represent the given text-rich image, focusing on extracting and interpreting both its rich text content and visual features."). These are configured as ST prompts with default_prompt_name: "document" for candidates.

I've also updated preprocessor_config.json to match the pixel limits used by the model's custom set_processor method (min_pixels=50176, max_pixels=1960000), which differ from the Qwen2.5-VL defaults. Without this, images are processed at the wrong resolution.

Note: the existing trust_remote_code usage (via modeling_bge_vl_screenshot.py) only works with transformers v4.51.x, as Qwen2_5_VLForConditionalGeneration was significantly restructured in v5.x (.visual moved into .model, get_rope_index now requires mm_token_type_ids, etc.). The Sentence Transformers integration does not use the custom modeling code and works with current transformers versions.

There is also a small (~0.04) cosine similarity shift between transformers v4.51.3 and v5.3+ baselines due to https://github.com/huggingface/transformers/pull/43972 ("Unify 3D position ids"), which refactored get_rope_index from token-ID scanning to mm_token_type_ids-based position computation. Rankings are fully preserved. This is inherent to the transformers version change (v5.2 -> v5.3) and not caused by the ST integration.

Added files:

  • modules.json: pipeline: Transformer -> Pooling (lasttoken, dim 2048) -> Normalize
  • sentence_bert_config.json: feature-extraction task, modality config for text/image/image+text/message with last_hidden_state, processing_kwargs with add_generation_prompt: true
  • config_sentence_transformers.json: cosine similarity, prompts for query and document, default_prompt_name: "document"
  • 1_Pooling/config.json: lasttoken pooling with include_prompt: true
  • chat_template.jinja: Qwen2.5-style chat template that appends <|endoftext|> in the generation prompt for correct last-token pooling

Modified files:

  • preprocessor_config.json: updated min_pixels (3136 -> 50176) and max_pixels (12845056 -> 1960000) to match the model's intended image resolution limits
  • README.md: added sentence-transformers library tag, pipeline_tag: sentence-similarity, and Sentence Transformers usage section
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/BGE-VL-Screenshot")

# Queries: composed image + text inputs (prefix text with "Query:")
query_inputs = [
    {"text": "Query:After a 17% drop, what is Nvidia's closing stock price?", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_1.png"},
    {"text": "Query:I would like to see a detailed and intuitive performance comparison between the two models.", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_2.png"},
]
query_embeddings = model.encode_query(query_inputs)
print(query_embeddings.shape)
# (2, 2048)

# Candidates: screenshot images
candidate_inputs = [
    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_1.jpeg",
    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_1.jpeg",
    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_2.jpeg",
    "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_2.jpeg",
]
candidate_embeddings = model.encode_document(candidate_inputs)
print(candidate_embeddings.shape)
# (4, 2048)

similarities = model.similarity(query_embeddings, candidate_embeddings)
print(similarities)
# tensor([[0.5725, 0.3449, 0.1913, 0.1497],
#         [0.1457, 0.0795, 0.4243, 0.4177]])

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open

Friendly ping @JUNJIE99 !

  • Tom Aarsen
Beijing Academy of Artificial Intelligence org

Thank you for your excellent work!

JUNJIE99 changed pull request status to merged

Sign up or log in to comment