VideoMAEv2_Hashtag2Action

This repository provides VideoMAE V2 weights pre-trained on the Hashtag2Action (H2A) dataset, a large-scale collection of short-form social-media videos curated for action recognition research. We release two backbones:

ViT-B (Vision Transformer-Base)
ViT-Giant

We also provide fine-tuned weights on Kinetics-400 for both backbones, together with training logs and metadata when available.

Model Description

Hashtag2Action (H2A) is an ethically curated short-form video dataset built from public social-media clips for self-supervised action-recognition pre-training.
The full dataset contains 283,582 clips spanning 386 action categories. The H2A pipeline combines:

adaptive hashtag mining
metadata filtering
vision-based frame validation

Using this dataset, we pre-train VideoMAE V2 in a self-supervised manner and then fine-tune on downstream action-recognition benchmarks. The resulting models show that carefully curated, weakly labeled short-form videos can support strong downstream performance without additional manual annotation.

Intended Use

These weights are intended for:

action-recognition research
self-supervised video representation learning
transfer learning to downstream video understanding tasks
benchmarking short-form video pre-training strategies

They are not intended to imply endorsement of any original platform content, and should be used in accordance with applicable dataset, platform, privacy, and research ethics requirements.

Training Data

The pre-training data comes from Hashtag2Action (H2A), which contains:

283,582 curated short-form video clips
386 action categories
data assembled through a pipeline designed to reduce noise and improve action relevance

As described in the paper, the dataset is derived from short-form public videos and curated through a multi-stage data-engineering pipeline rather than direct manual dense labeling.

Fine-Tuning

After pre-training on H2A, the models were fine-tuned on standard action-recognition benchmarks, including:

UCF101
HMDB51
Kinetics-400
Something-Something V2

Results

Using the ViT-Giant backbone, the model achieves the following top-1 accuracies reported in the published paper:

UCF101: 99.1%
HMDB51: 86.1%
Kinetics-400: 85.5%
Something-Something V2 (SSv2): 74.3%

These results were obtained with only 20% of the original VideoMAE V2 pre-training data volume, highlighting the value of high-quality short-form video curation for self-supervised learning.

Files Provided

This repository may include:

pre-trained weights on Hashtag2Action
fine-tuned weights on Kinetics-400
log.txt containing fine-tuning details
metadata files and related resources for reproducibility

For training scripts and implementation details of the backbone architecture, please refer to the official VideoMAEv2 repository:

https://github.com/OpenGVLab/VideoMAEv2

Citation

If you use VideoMAE V2, please cite:

@InProceedings{wang2023videomaev2,
  author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
  title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2023},
  pages     = {14549--14560}
}

@misc{videomaev2,
  title         = {VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
  author        = {Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
  year          = {2023},
  eprint        = {2303.16727},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

If you use this repository, the Hashtag2Action dataset, or the released weights, please cite the published paper:

@InProceedings{Qian_2025_ICCV,
  author    = {Qian, Yang and Kargarandehkordi, Ali and Sun, Yinan and Azizian, Parnian and Mutlu, Onur Cezmi and Surabhi, Saimourya and Jabbar, Zain and Wall, Dennis and Washington, Peter and Chen, Huaijin},
  title     = {Hashtag2Action: Data Engineering and Self-Supervised Pre-Training for Action Recognition in Short-Form Videos},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  month     = {October},
  year      = {2025},
  pages     = {2965--2975}
}

Earlier Version

An earlier preprint / earlier repository citation associated with this project is:

@article{qian2024actionrecognition,
  author  = {Qian, Yang and Sun, Yinan and Kargarandehkordi, Ali and Azizian, Parnian and Mutlu, Onur Cezmi and Surabhi, Saimourya and Chen, Pingyi and Jabbar, Zain and Wall, Dennis Paul and Washington, Peter},
  title   = {Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos},
  journal = {arXiv preprint arXiv:2402.08875},
  year    = {2024}
}

For new citations, please prefer the published ICCV 2025 Workshops paper above.

License

This project is licensed under the MIT License.

Acknowledgments

This repository builds on the official VideoMAE V2 framework and extends it with short-form-video data engineering and self-supervised pre-training on Hashtag2Action. Please also acknowledge the original VideoMAE V2 authors when using the released models.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for DrQY/VideoMAEv2_TikTok

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Paper • 2402.08875 • Published Feb 14, 2024

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023