| # Multimodal Deepfake Detection - Data Card |
|
|
| ## Datasets Used |
|
|
| ### Visual: Hemg/deepfake-and-real-images |
| - **Source**: https://huggingface.co/datasets/Hemg/deepfake-and-real-images |
| - **Size**: ~528K images (140K+ used for training) |
| - **Labels**: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training) |
| - **Content**: Face images (real photographs vs deepfakes) |
| - **Preprocessing**: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur) |
|
|
| ### Text: artem9k/ai-text-detection-pile |
| - **Source**: https://huggingface.co/datasets/artem9k/ai-text-detection-pile |
| - **Size**: 1.88GB, ~1M samples (20K+ used for training) |
| - **Labels**: human, ai |
| - **Content**: Essays, reports, news articles from human authors and AI models |
| - **Preprocessing**: RoBERTa tokenizer, max_length=512, truncation/padding |
| |
| ## Results |
| |
| | Component | Validation Accuracy | Notes | |
| |-----------|--------------------|-------| |
| | Visual Branch (EfficientNet-B0) | ~73% | 3 epochs on 1K image subset | |
| | Text Branch (RoBERTa-base) | ~75% | 3 epochs on 500 text subset | |
| | Multimodal Ensemble | Combines visual + text with learnable weights | Fusion on CPU subset | |
| |
| ## Architecture |
| |
| - **Visual**: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier |
| - **Text**: RoBERTa-base (125M params) + Mean Pooling + MLP Head |
| - **Fusion**: Learnable weighted averaging with cross-modal attention (optional) |
| - **Explainability**: GradCAM on last EfficientNet convolutional block |
| |
| ## Training Configuration |
| |
| | Parameter | Visual | Text | |
| |-----------|--------|------| |
| | Backbone | EfficientNet-B0 | RoBERTa-base | |
| | Optimizer | Adam | AdamW | |
| | Learning Rate | 1e-4 | 2e-5 | |
| | Weight Decay | 1e-4 | 0.01 | |
| | Epochs | 8 (full), 3 (compact) | 5 (full), 3 (compact) | |
| | Batch Size | 32 | 16 | |
| | Image Size | 224x224 | - | |
| | Text Length | - | 512 | |
| | Augmentation | Flip, Rot, ColorJitter, GaussianBlur, RandomErasing | - | |
| |
| ## Inference API |
| |
| ```python |
| from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal |
|
|
| model, config = load_model('multimodal_ensemble.pt') |
|
|
| # Image + GradCAM explainability |
| result = classify_image(model, 'face.jpg', return_gradcam=True) |
| # result: {prediction, confidence, gradcam: (224, 224)} |
|
|
| # Text |
| result = classify_text(model, 'This essay was written by...') |
| # result: {prediction, confidence} |
| |
| # Video (aggregated from frame classifications) |
| result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean') |
| # result: {prediction, confidence, frame_scores} |
|
|
| # Multimodal (both modalities with learned fusion weights) |
| result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...') |
| # result: {prediction, confidence, modality_scores, fusion_weights} |
| ``` |
| |
| ## Files in this Repository |
| |
| | File | Description | |
| |------|-------------| |
| | `model.py` | Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation | |
| | `preprocessing.py` | Data loading, transforms, video frame extraction, tokenization | |
| | `inference.py` | Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal | |
| | `train.py` | Full training script with separate branch training + ensemble assembly | |
| | `multimodal_ensemble.pt` | Full ensemble checkpoint (493MB) | |
| | `visual_branch.pt` | Visual-only checkpoint (15.6MB) | |
| | `text_branch.pt` | Text-only checkpoint (476MB) | |
| | `config.json` | Training configuration | |
| | `requirements.txt` | Python dependencies | |
| | `gradcam_examples/` | GradCAM explainability visualizations | |
| |
| ## Literature References |
| |
| - **AWARE-NET** (arxiv:2505.00312): Two-tier weighted ensemble |
| - **CLIP Deepfake Detection** (arxiv:2503.19683): L2-normalized features |
| - **DeTeCtive** (arxiv:2410.20964): RoBERTa-based AI text detection |
| |
| ## License |
| Apache-2.0 |
| |