You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file chat_template.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Tarsier Model Card

Introduction

We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.

Compared to Tarsier-7B, Tarsier2-7B is comprehensively upgraded in base model (Qwen2-VL-7B) and training data & stage:

Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity.
SFT: Fine-grained temporal alignment is performed during supervised fine-tuning.
DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization.

Model details

Base Model: Qwen2-VL-7B-Instruct
Training Data:
- Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following:
  
  Figure 1: Summary of datasets used in the pre-training stage of Tarsier2.
- Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO.

Model date: Tarsier2-Recap-7b was trained in December 2024.

Paper or resources for more information:

online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b
github repo: https://github.com/bytedance/tarsier/tree/tarsier2
paper link: https://arxiv.org/abs/2501.07888
leaderboard: https://tarsier-vlm.github.io/

Performace

Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.

Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.

License

Qwen/Qwen2-VL-7B-Instruct license.

Intended use

Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

How to Use

see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.

Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

Citation

If you find our work helpful, feel free to cite us as:

@misc{yuan2025tarsier2advancinglargevisionlanguage,
      title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding}, 
      author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
      year={2025},
      eprint={2501.07888},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.07888}, 
}

Downloads last month: 119

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using omni-research/Tarsier2-7b-0115 2

Paper for omni-research/Tarsier2-7b-0115

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14, 2025 • 15