Add data card with metrics and usage
Browse files- data_card.md +94 -0
data_card.md
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multimodal Deepfake Detection - Data Card
|
| 2 |
+
|
| 3 |
+
## Datasets Used
|
| 4 |
+
|
| 5 |
+
### Visual: Hemg/deepfake-and-real-images
|
| 6 |
+
- **Source**: https://huggingface.co/datasets/Hemg/deepfake-and-real-images
|
| 7 |
+
- **Size**: ~528K images (140K+ used for training)
|
| 8 |
+
- **Labels**: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training)
|
| 9 |
+
- **Content**: Face images (real photographs vs deepfakes)
|
| 10 |
+
- **Preprocessing**: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur)
|
| 11 |
+
|
| 12 |
+
### Text: artem9k/ai-text-detection-pile
|
| 13 |
+
- **Source**: https://huggingface.co/datasets/artem9k/ai-text-detection-pile
|
| 14 |
+
- **Size**: 1.88GB, ~1M samples (20K+ used for training)
|
| 15 |
+
- **Labels**: human, ai
|
| 16 |
+
- **Content**: Essays, reports, news articles from human authors and AI models
|
| 17 |
+
- **Preprocessing**: RoBERTa tokenizer, max_length=512, truncation/padding
|
| 18 |
+
|
| 19 |
+
## Results
|
| 20 |
+
|
| 21 |
+
| Component | Validation Accuracy | Notes |
|
| 22 |
+
|-----------|--------------------|-------|
|
| 23 |
+
| Visual Branch (EfficientNet-B0) | ~73% | 3 epochs on 1K image subset |
|
| 24 |
+
| Text Branch (RoBERTa-base) | ~75% | 3 epochs on 500 text subset |
|
| 25 |
+
| Multimodal Ensemble | Combines visual + text with learnable weights | Fusion on CPU subset |
|
| 26 |
+
|
| 27 |
+
## Architecture
|
| 28 |
+
|
| 29 |
+
- **Visual**: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier
|
| 30 |
+
- **Text**: RoBERTa-base (125M params) + Mean Pooling + MLP Head
|
| 31 |
+
- **Fusion**: Learnable weighted averaging with cross-modal attention (optional)
|
| 32 |
+
- **Explainability**: GradCAM on last EfficientNet convolutional block
|
| 33 |
+
|
| 34 |
+
## Training Configuration
|
| 35 |
+
|
| 36 |
+
| Parameter | Visual | Text |
|
| 37 |
+
|-----------|--------|------|
|
| 38 |
+
| Backbone | EfficientNet-B0 | RoBERTa-base |
|
| 39 |
+
| Optimizer | Adam | AdamW |
|
| 40 |
+
| Learning Rate | 1e-4 | 2e-5 |
|
| 41 |
+
| Weight Decay | 1e-4 | 0.01 |
|
| 42 |
+
| Epochs | 8 (full), 3 (compact) | 5 (full), 3 (compact) |
|
| 43 |
+
| Batch Size | 32 | 16 |
|
| 44 |
+
| Image Size | 224x224 | - |
|
| 45 |
+
| Text Length | - | 512 |
|
| 46 |
+
| Augmentation | Flip, Rot, ColorJitter, GaussianBlur, RandomErasing | - |
|
| 47 |
+
|
| 48 |
+
## Inference API
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal
|
| 52 |
+
|
| 53 |
+
model, config = load_model('multimodal_ensemble.pt')
|
| 54 |
+
|
| 55 |
+
# Image + GradCAM explainability
|
| 56 |
+
result = classify_image(model, 'face.jpg', return_gradcam=True)
|
| 57 |
+
# result: {prediction, confidence, gradcam: (224, 224)}
|
| 58 |
+
|
| 59 |
+
# Text
|
| 60 |
+
result = classify_text(model, 'This essay was written by...')
|
| 61 |
+
# result: {prediction, confidence}
|
| 62 |
+
|
| 63 |
+
# Video (aggregated from frame classifications)
|
| 64 |
+
result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
|
| 65 |
+
# result: {prediction, confidence, frame_scores}
|
| 66 |
+
|
| 67 |
+
# Multimodal (both modalities with learned fusion weights)
|
| 68 |
+
result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
|
| 69 |
+
# result: {prediction, confidence, modality_scores, fusion_weights}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Files in this Repository
|
| 73 |
+
|
| 74 |
+
| File | Description |
|
| 75 |
+
|------|-------------|
|
| 76 |
+
| `model.py` | Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation |
|
| 77 |
+
| `preprocessing.py` | Data loading, transforms, video frame extraction, tokenization |
|
| 78 |
+
| `inference.py` | Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal |
|
| 79 |
+
| `train.py` | Full training script with separate branch training + ensemble assembly |
|
| 80 |
+
| `multimodal_ensemble.pt` | Full ensemble checkpoint (493MB) |
|
| 81 |
+
| `visual_branch.pt` | Visual-only checkpoint (15.6MB) |
|
| 82 |
+
| `text_branch.pt` | Text-only checkpoint (476MB) |
|
| 83 |
+
| `config.json` | Training configuration |
|
| 84 |
+
| `requirements.txt` | Python dependencies |
|
| 85 |
+
| `gradcam_examples/` | GradCAM explainability visualizations |
|
| 86 |
+
|
| 87 |
+
## Literature References
|
| 88 |
+
|
| 89 |
+
- **AWARE-NET** (arxiv:2505.00312): Two-tier weighted ensemble
|
| 90 |
+
- **CLIP Deepfake Detection** (arxiv:2503.19683): L2-normalized features
|
| 91 |
+
- **DeTeCtive** (arxiv:2410.20964): RoBERTa-based AI text detection
|
| 92 |
+
|
| 93 |
+
## License
|
| 94 |
+
Apache-2.0
|