Video Classification

VideoMAEv2_Hashtag2Action

This repository provides VideoMAE V2 weights pre-trained on the Hashtag2Action (H2A) dataset, a large-scale collection of short-form social-media videos curated for action recognition research. We release two backbones:

  • ViT-B (Vision Transformer-Base)
  • ViT-Giant

We also provide fine-tuned weights on Kinetics-400 for both backbones, together with training logs and metadata when available.

Model Description

Hashtag2Action (H2A) is an ethically curated short-form video dataset built from public social-media clips for self-supervised action-recognition pre-training.
The full dataset contains 283,582 clips spanning 386 action categories. The H2A pipeline combines:

  • adaptive hashtag mining
  • metadata filtering
  • vision-based frame validation

Using this dataset, we pre-train VideoMAE V2 in a self-supervised manner and then fine-tune on downstream action-recognition benchmarks. The resulting models show that carefully curated, weakly labeled short-form videos can support strong downstream performance without additional manual annotation.

Intended Use

These weights are intended for:

  • action-recognition research
  • self-supervised video representation learning
  • transfer learning to downstream video understanding tasks
  • benchmarking short-form video pre-training strategies

They are not intended to imply endorsement of any original platform content, and should be used in accordance with applicable dataset, platform, privacy, and research ethics requirements.

Training Data

The pre-training data comes from Hashtag2Action (H2A), which contains:

  • 283,582 curated short-form video clips
  • 386 action categories
  • data assembled through a pipeline designed to reduce noise and improve action relevance

As described in the paper, the dataset is derived from short-form public videos and curated through a multi-stage data-engineering pipeline rather than direct manual dense labeling.

Fine-Tuning

After pre-training on H2A, the models were fine-tuned on standard action-recognition benchmarks, including:

  • UCF101
  • HMDB51
  • Kinetics-400
  • Something-Something V2

Results

Using the ViT-Giant backbone, the model achieves the following top-1 accuracies reported in the published paper:

  • UCF101: 99.1%
  • HMDB51: 86.1%
  • Kinetics-400: 85.5%
  • Something-Something V2 (SSv2): 74.3%

These results were obtained with only 20% of the original VideoMAE V2 pre-training data volume, highlighting the value of high-quality short-form video curation for self-supervised learning.

Files Provided

This repository may include:

  • pre-trained weights on Hashtag2Action
  • fine-tuned weights on Kinetics-400
  • log.txt containing fine-tuning details
  • metadata files and related resources for reproducibility

For training scripts and implementation details of the backbone architecture, please refer to the official VideoMAEv2 repository:

Citation

If you use VideoMAE V2, please cite:

@InProceedings{wang2023videomaev2,
  author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
  title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2023},
  pages     = {14549--14560}
}

@misc{videomaev2,
  title         = {VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
  author        = {Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
  year          = {2023},
  eprint        = {2303.16727},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

If you use this repository, the Hashtag2Action dataset, or the released weights, please cite the published paper:

@InProceedings{Qian_2025_ICCV,
  author    = {Qian, Yang and Kargarandehkordi, Ali and Sun, Yinan and Azizian, Parnian and Mutlu, Onur Cezmi and Surabhi, Saimourya and Jabbar, Zain and Wall, Dennis and Washington, Peter and Chen, Huaijin},
  title     = {Hashtag2Action: Data Engineering and Self-Supervised Pre-Training for Action Recognition in Short-Form Videos},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  month     = {October},
  year      = {2025},
  pages     = {2965--2975}
}

Earlier Version

An earlier preprint / earlier repository citation associated with this project is:

@article{qian2024actionrecognition,
  author  = {Qian, Yang and Sun, Yinan and Kargarandehkordi, Ali and Azizian, Parnian and Mutlu, Onur Cezmi and Surabhi, Saimourya and Chen, Pingyi and Jabbar, Zain and Wall, Dennis Paul and Washington, Peter},
  title   = {Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos},
  journal = {arXiv preprint arXiv:2402.08875},
  year    = {2024}
}

For new citations, please prefer the published ICCV 2025 Workshops paper above.

License

This project is licensed under the MIT License.

Acknowledgments

This repository builds on the official VideoMAE V2 framework and extends it with short-form-video data engineering and self-supervised pre-training on Hashtag2Action. Please also acknowledge the original VideoMAE V2 authors when using the released models.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for DrQY/VideoMAEv2_TikTok