🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding

Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.

Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.

🏗️ Architecture

Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).

Local Compressor: Qwen3-VL-2B-Instruct
Global LLM: Qwen/Qwen3-4B
Total Parameters: ~6B

✨ Key Features

Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
Token Efficiency: Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.

🚀 Quick Start

1. Installation

# Clone the repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install dependencies
pip install -r requirements.txt

2. Prepare Checkpoints

To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

3. Inference

Launch Gradio Web UI:

python app.py

CLI Inference:

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/video.mp4" \
    --query "Describe the video in detail."

(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)

🏆 Performance

Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores 52.3, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

📑 Citation

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun, Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and Shuming Liu and Lemeng Wu and Raghuraman Krishnamoorthi and Vikas Chandra and Mohamed Elhoseiny and Chenchen Zhu},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}