VidEoMT-L on YouTube-VIS 2019

This repository contains the Hugging Face Transformers conversion of the official VidEoMT checkpoint yt_2019_vit_large_68.6.pth from tue-mps/VidEoMT.

Model details

Architecture: VidEoMT with a DINOv2 ViT-L/14 with 4 register tokens backbone
Task: video instance segmentation
Dataset: YouTube-VIS 2019
Input resolution: 640 x 640
Number of frames: 2
Paper: Your ViT is Secretly Also a Video Segmentation Model

Reported metrics

Metric	Value
AP	68.6
AR@10	73.9
FPS	160

The metrics above are the numbers reported by the authors in the official model zoo.

Usage

from transformers import AutoModelForUniversalSegmentation, AutoVideoProcessor

model_id = "tue-mps/videomt-dinov2-large-ytvis2019"
processor = AutoVideoProcessor.from_pretrained(model_id)
model = AutoModelForUniversalSegmentation.from_pretrained(model_id)

Use processor.post_process_instance_segmentation, processor.post_process_panoptic_segmentation, or processor.post_process_semantic_segmentation depending on the target task.

Downloads last month: 14

Safetensors

Model size

0.3B params

Tensor type

F32

Paper for tue-mps/videomt-dinov2-large-ytvis2019

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Paper • 2602.17807 • Published Feb 19 • 7