SeqResize-Qwen2.5-VL-3B

This model uses SeqResize (sequence resizing) to compress multi-vector video representations for ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-VL-3B-Instruct and finetuned on MSR-VTT for text-to-video retrieval with bidirectional attention.

SeqResize compresses ~1300 video token vectors into a fixed budget of 32 vectors (97.6% compression) through projection along the sequence dimension.

arXiv GitHub License

Method Overview

SeqResize is a simple compression baseline: a MLP (or linear) layer projects the sequence dimension from a fixed input length to a fixed output length. Variable-length token sequences are trim-or-padded to the input length, then projected to the target number of vectors for ColBERT-style MaxSim retrieval. It is not the main method in our paper; we include it as a baseline.

Method

Results on MSR-VTT

Method Tokens R@1 R@10 nDCG@10
OmniEmbed-7B 1 51.5 83.2 67.1
Video-ColBERT 26 51.5 85.5 67.7
Baseline (Ours, uncompressed) 1318 55.7 88.3 71.9
SeqResize (This model) 32 53.3 86.9 69.9
MemTok 32 54.2 86.4 69.9
H-Pool 32 54.1 87.3 70.4
AGC 32* 56.9 87.0 71.5

Model Details

Initial weights Qwen2.5-VL-3B-Instruct
Architecture Qwen2.5-VL with bidirectional attention
Hidden dimension 2048
Budget 32 vectors per document
Compression method SeqResize (learned sequence projection)
Resizer input size 1024 (fixed sequence length before projection)
Resizer output size 32 (Budget)
Resizer hidden size 256 (MLP bottleneck)
Scoring ColBERT-style MaxSim (late interaction)
Normalization L2-normalized embeddings
Query prefix "Query: "
Passage prefix "Passage: "
Precision bfloat16
Training video frames 24

Usage

Important: When loading this model you must set resizer_input_size, resizer_output_size, and resizer_hidden_size to match the trained checkpoint (1024, 32, and 256 for this release). The extra_encoder_state.safetensors, should also be placed in the model directory so the sequence resizer weights are loaded.

import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

from src.arguments import ModelArguments
from src.encoder.resize_encoder import SequenceResizerEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding

MODEL_ID = "hltcoe/SeqResize_qwen2.5-vl_msrvtt"
VIDEO_PATH = "PLACEHOLDER"
RESIZER_INPUT_SIZE = 1024
RESIZER_OUTPUT_SIZE = 32
RESIZER_HIDDEN_SIZE = 256

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="resize",
    normalize=True,
    resizer_input_size=RESIZER_INPUT_SIZE,
    resizer_output_size=RESIZER_OUTPUT_SIZE,
    resizer_hidden_size=RESIZER_HIDDEN_SIZE,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = SequenceResizerEncoder.load(
    Qwen2_5ForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode a video document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 84672, "min_pixels": 75264},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 32, 2048) — 32 compressed vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}
Downloads last month
2
Safetensors
Model size
756k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hltcoe/SeqResize_qwen2.5-vl_msrvtt

Finetuned
(722)
this model

Dataset used to train hltcoe/SeqResize_qwen2.5-vl_msrvtt

Collection including hltcoe/SeqResize_qwen2.5-vl_msrvtt

Paper for hltcoe/SeqResize_qwen2.5-vl_msrvtt