Full ColBERT & H-Pool — Qwen2.5-Omni-3B
This checkpoint supports two inference modes from the same weights: (1) Full ColBERT (uncompressed) — use all token-level vectors for late interaction; (2) H-Pool — a parameter-free compression that applies Ward hierarchical clustering at inference to reduce document vectors to a fixed budget (e.g. 64). No extra parameters; you switch behavior by setting pooling="colbert" or pooling="hierarchical_clustering". Weights are initialized from Qwen2.5-Omni-3B-Instruct (thinker part), finetuned on RankVideo-Dataset and tested on MultiVENT2.0 for audiovisual text-to-video retrieval with bidirectional attention.
Method Overview
Full ColBERT keeps the full multi-vector representation (all token embeddings) and scores with ColBERT-style MaxSim. H-Pool compresses document tokens to a fixed number of vectors via Ward hierarchical clustering (cosine similarity → distance, then cluster and average within clusters); queries stay uncompressed. Both use the same checkpoint; only the pooling option changes.
Results on MultiVENT 2.0
H-Pool in the table below is from this checkpoint.
| Method | Tokens | R@10 | nDCG@10 |
|---|---|---|---|
| SeqResize | 64 | 41.1 | 38.5 |
| MemTok | 64 | 48.7 | 44.8 |
| H-Pool (this checkpoint) | 64 | 49.2 | 46.5 |
| AGC | 64 | 49.6 | 46.3 |
Model Details
| Initial weights | Qwen2.5-Omni-3B-Instruct (thinker) |
| Architecture | Qwen2.5-Omni (thinker) with bidirectional attention |
| Hidden dimension | 2048 |
| Pooling | colbert (full) or hierarchical_clustering (H-Pool) |
| Budget | H-Pool: 64 vectors per document |
| Scoring | ColBERT-style MaxSim (late interaction) |
| Normalization | L2-normalized embeddings |
| Query prefix | "Query: " |
| Passage prefix | "Passage: " |
| Precision | bfloat16 |
| Training video frames | 24 |
| Audio sampling rate | 4KHz |
Usage
Use Full ColBERT (uncompressed) with pooling="colbert", or H-Pool with pooling="hierarchical_clustering" and num_repr_vectors=64. Same checkpoint; only the pooling argument changes.
import torch
from transformers import AutoProcessor
from qwen_omni_utils import process_mm_info
from src.arguments import ModelArguments
from src.encoder.multivec_encoder import MultiVecEncoder
from src.models.qwen2_5_omni_embed.qwen2_5_omni_embed import Qwen2_5OmniForEmbedding
MODEL_ID = "PLACEHOLDER"
VIDEO_PATH = "PLACEHOLDER"
AUDIO_PATH = "PLACEHOLDER"
# Full (uncompressed) ColBERT:
# model_args = ModelArguments(model_name_or_path=MODEL_ID, pooling="colbert", normalize=True, attn_implementation="flash_attention_2")
# H-Pool (64 vectors):
model_args = ModelArguments(
model_name_or_path=MODEL_ID,
pooling="hierarchical_clustering",
normalize=True,
num_repr_vectors=64,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = MultiVecEncoder.load(
Qwen2_5OmniForEmbedding,
model_args,
attn_implementation=model_args.attn_implementation,
dtype=torch.bfloat16,
)
model = model.to("cuda").eval()
# --- Encode a video+audio document ---
passage_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Passage: "},
{"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 75264, "min_pixels": 65856},
{"type": "audio", "audio": AUDIO_PATH},
],
}
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
audio_inputs, image_inputs, video_inputs = process_mm_info([passage_messages], use_audio_in_video=False)
passage_inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, audio=audio_inputs, padding=True, return_tensors="pt",
).to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
print(doc_embeddings.shape)
# colbert: (1, seq_len, 2048); hierarchical_clustering: (1, 64, 2048)
# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")
Command line usage
For running inference and evaluation from the command line, see the Quick Start section.
Citation
@misc{qin2026multivectorindexcompressionmodality,
title={Multi-Vector Index Compression in Any Modality},
author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
year={2026},
eprint={2602.21202},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.21202},
}
- Downloads last month
- 1
