Add data card with metrics and usage
fffc143 verified
Multimodal Deepfake Detection - Data Card
Datasets Used
Visual: Hemg/deepfake-and-real-images
- Source: https://huggingface.co/datasets/Hemg/deepfake-and-real-images
- Size: ~528K images (140K+ used for training)
- Labels: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training)
- Content: Face images (real photographs vs deepfakes)
- Preprocessing: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur)
Text: artem9k/ai-text-detection-pile
Results
| Component |
Validation Accuracy |
Notes |
| Visual Branch (EfficientNet-B0) |
~73% |
3 epochs on 1K image subset |
| Text Branch (RoBERTa-base) |
~75% |
3 epochs on 500 text subset |
| Multimodal Ensemble |
Combines visual + text with learnable weights |
Fusion on CPU subset |
Architecture
- Visual: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier
- Text: RoBERTa-base (125M params) + Mean Pooling + MLP Head
- Fusion: Learnable weighted averaging with cross-modal attention (optional)
- Explainability: GradCAM on last EfficientNet convolutional block
Training Configuration
| Parameter |
Visual |
Text |
| Backbone |
EfficientNet-B0 |
RoBERTa-base |
| Optimizer |
Adam |
AdamW |
| Learning Rate |
1e-4 |
2e-5 |
| Weight Decay |
1e-4 |
0.01 |
| Epochs |
8 (full), 3 (compact) |
5 (full), 3 (compact) |
| Batch Size |
32 |
16 |
| Image Size |
224x224 |
- |
| Text Length |
- |
512 |
| Augmentation |
Flip, Rot, ColorJitter, GaussianBlur, RandomErasing |
- |
Inference API
from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal
model, config = load_model('multimodal_ensemble.pt')
result = classify_image(model, 'face.jpg', return_gradcam=True)
result = classify_text(model, 'This essay was written by...')
result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
Files in this Repository
| File |
Description |
model.py |
Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation |
preprocessing.py |
Data loading, transforms, video frame extraction, tokenization |
inference.py |
Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal |
train.py |
Full training script with separate branch training + ensemble assembly |
multimodal_ensemble.pt |
Full ensemble checkpoint (493MB) |
visual_branch.pt |
Visual-only checkpoint (15.6MB) |
text_branch.pt |
Text-only checkpoint (476MB) |
config.json |
Training configuration |
requirements.txt |
Python dependencies |
gradcam_examples/ |
GradCAM explainability visualizations |
Literature References
- AWARE-NET (arxiv:2505.00312): Two-tier weighted ensemble
- CLIP Deepfake Detection (arxiv:2503.19683): L2-normalized features
- DeTeCtive (arxiv:2410.20964): RoBERTa-based AI text detection
License
Apache-2.0