Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate this model with Sentence Transformers v5.4
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer. The model is a Qwen2.5-VL-3B based multimodal embedding model that uses a feature-extraction Transformer with last-token pooling and normalization, producing 2048-dimensional cosine-similarity embeddings. It supports text, image, and composed image+text inputs for visualized information retrieval (querying and retrieving screenshots).
The model's custom code (modeling_bge_vl_screenshot.py) uses a specific prompt template where the last token is <|endoftext|> (the pad token, id 151643), not the standard eos token. The default Qwen2.5-VL chat template with add_generation_prompt=True ends with <|im_start|>assistant\n, which would pool from the wrong token. To handle this, I've added a custom chat_template.jinja that appends <|endoftext|> after the generation prompt, matching the trained format exactly.
The model uses different system prompts for queries vs candidates: queries use a task instruction (e.g. "Represent the given image with the given query.") and candidates use a description prompt ("Represent the given text-rich image, focusing on extracting and interpreting both its rich text content and visual features."). These are configured as ST prompts with default_prompt_name: "document" for candidates.
I've also updated preprocessor_config.json to match the pixel limits used by the model's custom set_processor method (min_pixels=50176, max_pixels=1960000), which differ from the Qwen2.5-VL defaults. Without this, images are processed at the wrong resolution.
Note: the existing trust_remote_code usage (via modeling_bge_vl_screenshot.py) only works with transformers v4.51.x, as Qwen2_5_VLForConditionalGeneration was significantly restructured in v5.x (.visual moved into .model, get_rope_index now requires mm_token_type_ids, etc.). The Sentence Transformers integration does not use the custom modeling code and works with current transformers versions.
There is also a small (~0.04) cosine similarity shift between transformers v4.51.3 and v5.3+ baselines due to https://github.com/huggingface/transformers/pull/43972 ("Unify 3D position ids"), which refactored get_rope_index from token-ID scanning to mm_token_type_ids-based position computation. Rankings are fully preserved. This is inherent to the transformers version change (v5.2 -> v5.3) and not caused by the ST integration.
Added files:
modules.json: pipeline:Transformer->Pooling(lasttoken, dim 2048) ->Normalizesentence_bert_config.json:feature-extractiontask, modality config for text/image/image+text/message withlast_hidden_state,processing_kwargswithadd_generation_prompt: trueconfig_sentence_transformers.json: cosine similarity, prompts for query and document,default_prompt_name: "document"1_Pooling/config.json: lasttoken pooling withinclude_prompt: truechat_template.jinja: Qwen2.5-style chat template that appends<|endoftext|>in the generation prompt for correct last-token pooling
Modified files:
preprocessor_config.json: updatedmin_pixels(3136 -> 50176) andmax_pixels(12845056 -> 1960000) to match the model's intended image resolution limitsREADME.md: addedsentence-transformerslibrary tag,pipeline_tag: sentence-similarity, and Sentence Transformers usage section
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/BGE-VL-Screenshot")
# Queries: composed image + text inputs (prefix text with "Query:")
query_inputs = [
{"text": "Query:After a 17% drop, what is Nvidia's closing stock price?", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_1.png"},
{"text": "Query:I would like to see a detailed and intuitive performance comparison between the two models.", "image": "https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/query_2.png"},
]
query_embeddings = model.encode_query(query_inputs)
print(query_embeddings.shape)
# (2, 2048)
# Candidates: screenshot images
candidate_inputs = [
"https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_1.jpeg",
"https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_1.jpeg",
"https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/positive_2.jpeg",
"https://huggingface.co/BAAI/BGE-VL-Screenshot/resolve/main/assets/neg_2.jpeg",
]
candidate_embeddings = model.encode_document(candidate_inputs)
print(candidate_embeddings.shape)
# (4, 2048)
similarities = model.similarity(query_embeddings, candidate_embeddings)
print(similarities)
# tensor([[0.5725, 0.3449, 0.1913, 0.1497],
# [0.1457, 0.0795, 0.4243, 0.4177]])
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen
Thank you for your excellent work!