SeqResize-Qwen2.5-VL-3B
This model uses SeqResize (sequence resizing) to compress multi-vector video representations for ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-VL-3B-Instruct and finetuned on MSR-VTT for text-to-video retrieval with bidirectional attention.
SeqResize compresses ~1300 video token vectors into a fixed budget of 32 vectors (97.6% compression) through projection along the sequence dimension.
Method Overview
SeqResize is a simple compression baseline: a MLP (or linear) layer projects the sequence dimension from a fixed input length to a fixed output length. Variable-length token sequences are trim-or-padded to the input length, then projected to the target number of vectors for ColBERT-style MaxSim retrieval. It is not the main method in our paper; we include it as a baseline.
Results on MSR-VTT
| Method | Tokens | R@1 | R@10 | nDCG@10 |
|---|---|---|---|---|
| OmniEmbed-7B | 1 | 51.5 | 83.2 | 67.1 |
| Video-ColBERT | 26 | 51.5 | 85.5 | 67.7 |
| Baseline (Ours, uncompressed) | 1318 | 55.7 | 88.3 | 71.9 |
| SeqResize (This model) | 32 | 53.3 | 86.9 | 69.9 |
| MemTok | 32 | 54.2 | 86.4 | 69.9 |
| H-Pool | 32 | 54.1 | 87.3 | 70.4 |
| AGC | 32* | 56.9 | 87.0 | 71.5 |
Model Details
| Initial weights | Qwen2.5-VL-3B-Instruct |
| Architecture | Qwen2.5-VL with bidirectional attention |
| Hidden dimension | 2048 |
| Budget | 32 vectors per document |
| Compression method | SeqResize (learned sequence projection) |
| Resizer input size | 1024 (fixed sequence length before projection) |
| Resizer output size | 32 (Budget) |
| Resizer hidden size | 256 (MLP bottleneck) |
| Scoring | ColBERT-style MaxSim (late interaction) |
| Normalization | L2-normalized embeddings |
| Query prefix | "Query: " |
| Passage prefix | "Passage: " |
| Precision | bfloat16 |
| Training video frames | 24 |
Usage
Important: When loading this model you must set resizer_input_size, resizer_output_size, and resizer_hidden_size to match the trained checkpoint (1024, 32, and 256 for this release). The extra_encoder_state.safetensors, should also be placed in the model directory so the sequence resizer weights are loaded.
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from src.arguments import ModelArguments
from src.encoder.resize_encoder import SequenceResizerEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding
MODEL_ID = "hltcoe/SeqResize_qwen2.5-vl_msrvtt"
VIDEO_PATH = "PLACEHOLDER"
RESIZER_INPUT_SIZE = 1024
RESIZER_OUTPUT_SIZE = 32
RESIZER_HIDDEN_SIZE = 256
# --- Setup ---
model_args = ModelArguments(
model_name_or_path=MODEL_ID,
pooling="resize",
normalize=True,
resizer_input_size=RESIZER_INPUT_SIZE,
resizer_output_size=RESIZER_OUTPUT_SIZE,
resizer_hidden_size=RESIZER_HIDDEN_SIZE,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = SequenceResizerEncoder.load(
Qwen2_5ForEmbedding,
model_args,
attn_implementation=model_args.attn_implementation,
dtype=torch.bfloat16,
)
model = model.to("cuda").eval()
# --- Encode a video document ---
passage_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Passage: "},
{"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 84672, "min_pixels": 75264},
],
}
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
print(doc_embeddings.shape)
# doc_embeddings: (1, 32, 2048) — 32 compressed vectors
# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
print(query_embeddings.shape)
# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")
Command line usage
For running inference and evaluation from the command line, see the Quick Start section.
Citation
@misc{qin2026multivectorindexcompressionmodality,
title={Multi-Vector Index Compression in Any Modality},
author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
year={2026},
eprint={2602.21202},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.21202},
}
- Downloads last month
- 2
Model tree for hltcoe/SeqResize_qwen2.5-vl_msrvtt
Base model
Qwen/Qwen2.5-VL-3B-Instruct