🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding

Project Page Paper Hugging Face GitHub License

Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.

Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.

πŸ—οΈ Architecture

Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).

  • Local Compressor: Qwen3-VL-2B-Instruct
  • Global LLM: Qwen/Qwen3-4B
  • Total Parameters: ~6B

✨ Key Features

  • Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
  • Token Efficiency: Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
  • Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.

πŸš€ Quick Start

1. Installation

# Clone the repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install dependencies
pip install -r requirements.txt

2. Prepare Checkpoints

To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

3. Inference

Launch Gradio Web UI:

python app.py

CLI Inference:

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/video.mp4" \
    --query "Describe the video in detail."

(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)

πŸ† Performance

Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores 52.3, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

πŸ“‘ Citation

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun, Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and Shuming Liu and Lemeng Wu and Raghuraman Krishnamoorthi and Vikas Chandra and Mohamed Elhoseiny and Chenchen Zhu},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}
Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Vision-CAIR/Tempo-6B

Finetuned
Qwen/Qwen3-4B
Finetuned
(555)
this model

Space using Vision-CAIR/Tempo-6B 1

Collection including Vision-CAIR/Tempo-6B

Paper for Vision-CAIR/Tempo-6B