File size: 5,174 Bytes
8456c9f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | ---
license: mit
---
π§Ύ Model Card β VideoMAE-DeepFake-Detector-v1
π§ Model Overview
VideoMAE-DeepFake-Detector-v1 is a fine-tuned video deepfake detection model trained to distinguish between authentic and manipulated facial videos. The model builds upon the pretrained VideoMAE architecture and adapts it for binary classification of real versus synthetic videos.
The base model was originally trained on large-scale video action datasets, enabling strong spatiotemporal feature understanding. It was further fine-tuned on the FaceForensics++ dataset to detect visual artifacts, temporal inconsistencies, and manipulation signatures commonly found in deepfake videos.
By leveraging transformer-based video representation learning, the model captures both frame-level visual cues and motion patterns across time, allowing it to identify subtle manipulations that traditional image-based detectors may miss.
The model is designed for applications in media verification, misinformation detection, and AI-generated content monitoring.
ποΈ Training Details
Base Model:
MCG-NJU/videomae-base-finetuned-kinetics
Framework:
Hugging Face Transformers + PyTorch
Training Hardware:
NVIDIA T4 GPU (Kaggle)
Epochs:
15
Batch Size:
4
Learning Rate:
2e-5
Optimizer:
AdamW
Video Sampling:
16 frames per video clip
Resolution:
224 Γ 224
Training Strategy:
Transfer learning with partial freezing:
~70% of VideoMAE backbone layers frozen
Final transformer layers + classifier head fine-tuned
Dataset:
FaceForensics++ (C23 compression level)
Classes:
π’ Real Video
π΄ Deepfake Video
π Dataset Description
The model was trained using the FaceForensics++ dataset, a widely used benchmark for deepfake detection research.
FaceForensics++ contains manipulated videos generated using multiple facial manipulation techniques, including deepfake generation and facial reenactment.
For this model version, training used a subset consisting of:
Original videos (real)
Deepfakes manipulation videos (fake)
Each video was processed by sampling 16 frames uniformly across its duration to capture both spatial and temporal artifacts.
Label Description
Real Authentic unmodified video
Fake Video manipulated using deepfake synthesis techniques
π― Evaluation Metrics
Evaluation was performed on a held-out validation split of the dataset.
Metric Score
Train Loss 0.303
Validation Loss 0.506
Accuracy 88.0%
F1 Score 0.742
AUC 0.836
β
The model demonstrates strong ability to distinguish between authentic and manipulated videos using temporal visual patterns.
π¬ Example Usage
import torch
import numpy as np
from decord import VideoReader, cpu
from PIL import Image
from transformers import VideoMAEForVideoClassification, VideoMAEImageProcessor
model = VideoMAEForVideoClassification.from_pretrained(
"your_username/videomae-deepfake-detector"
)
processor = VideoMAEImageProcessor.from_pretrained(
"your_username/videomae-deepfake-detector"
)
def load_video_frames(video_path, num_frames=16):
vr = VideoReader(video_path, ctx=cpu(0))
total_frames = len(vr)
indices = np.linspace(0, total_frames - 1, num_frames).astype(int)
frames = vr.get_batch(indices).asnumpy()
return [Image.fromarray(f) for f in frames]
@torch.no_grad()
def predict(video_path):
frames = load_video_frames(video_path)
inputs = processor(frames, return_tensors="pt")
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
return {
"real": float(probs[0]),
"fake": float(probs[1])
}
print(predict("sample_video.mp4"))
Output example:
{'real': 0.96, 'fake': 0.04}
π§© Intended Use
Deepfake detection in video content
Media authenticity verification
AI-generated video detection pipelines
Research on manipulated media detection
Integration into misinformation monitoring systems
β οΈ Limitations
The model was trained on a subset of FaceForensics++ and may not generalize perfectly to unseen deepfake generation techniques.
Performance may degrade on:
heavily compressed social media videos
unseen manipulation methods
partial face occlusions
extremely short clips
This model should be used as an assistive forensic tool, not as a definitive authenticity guarantee.
π§βπ» Developer
Author: Vansh Momaya
Institution: D. J. Sanghvi College of Engineering
Focus Area: Computer Vision, AI Safety, Deepfake Detection, Video Understanding
Email: vanshmomaya9@gmail.com
π Citation
If you use this model in research or projects:
@online{momaya2025videomaedeepfake,
author = {Vansh Momaya},
title = {VideoMAE-DeepFake-Detector-v1},
year = {2025},
version = {v1},
url = {https://huggingface.co/Vansh180/VideoMae-deepfake-detector},
institution = {D. J. Sanghvi College of Engineering},
note = {Fine-tuned VideoMAE model for detecting deepfake videos using FaceForensics++},
license = {MIT}
}
π Acknowledgements
VideoMAE β Base architecture for video representation learning
FaceForensics++ β Deepfake detection dataset benchmark
Hugging Face Transformers β Training and deployment framework |