Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- deepfake-detection
|
| 5 |
+
- multimodal
|
| 6 |
+
- image-classification
|
| 7 |
+
- text-classification
|
| 8 |
+
- ensemble
|
| 9 |
+
- gradcam
|
| 10 |
+
- explainability
|
| 11 |
+
datasets:
|
| 12 |
+
- Hemg/deepfake-and-real-images
|
| 13 |
+
- artem9k/ai-text-detection-pile
|
| 14 |
+
pipeline_tag: image-classification
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Multimodal Deepfake Detection Model
|
| 18 |
+
|
| 19 |
+
A multimodal ensemble model that classifies images, video frames, and text as **real** or **AI-generated/fake**, with confidence scores and GradCAM explainability maps.
|
| 20 |
+
|
| 21 |
+
## Architecture
|
| 22 |
+
|
| 23 |
+
**Visual Branch**: EfficientNet-B0 (ImageNet pretrained) with L2-normalized features for image/video frame classification
|
| 24 |
+
**Text Branch**: RoBERTa-base with mean pooling and MLP head for AI-generated text detection
|
| 25 |
+
**Fusion Layer**: Learnable weighted late ensemble combining visual + text probabilities
|
| 26 |
+
**Explainability**: GradCAM heatmaps on EfficientNet convolutional layers
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal
|
| 32 |
+
|
| 33 |
+
model, config = load_model('multimodal_ensemble.pt', device='cuda')
|
| 34 |
+
|
| 35 |
+
# Image with GradCAM explainability
|
| 36 |
+
result = classify_image(model, 'face.jpg', device='cuda', return_gradcam=True)
|
| 37 |
+
print(f"Prediction: {result['prediction']} (confidence: {result['confidence']:.2%})")
|
| 38 |
+
# result['gradcam'] contains the explainability heatmap
|
| 39 |
+
|
| 40 |
+
# Text classification
|
| 41 |
+
result = classify_text(model, 'This text was written by...')
|
| 42 |
+
print(f"Prediction: {result['prediction']} (confidence: {result['confidence']:.2%})")
|
| 43 |
+
|
| 44 |
+
# Video classification
|
| 45 |
+
result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
|
| 46 |
+
print(f"Video: {result['prediction']} (confidence: {result['confidence']:.2%})")
|
| 47 |
+
|
| 48 |
+
# Multimodal (image + text)
|
| 49 |
+
result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
|
| 50 |
+
print(f"Combined: {result['prediction']} — Weights: {result['fusion_weights']}")
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Training
|
| 54 |
+
|
| 55 |
+
### Datasets
|
| 56 |
+
- **Visual**: [Hemg/deepfake-and-real-images](https://huggingface.co/datasets/Hemg/deepfake-and-real-images) — 140K+ face images (real vs deepfake)
|
| 57 |
+
- **Text**: [artem9k/ai-text-detection-pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile) — 1.9GB human vs AI-generated text
|
| 58 |
+
|
| 59 |
+
### Recipe
|
| 60 |
+
| Component | Config |
|
| 61 |
+
|-----------|--------|
|
| 62 |
+
| Visual backbone | EfficientNet-B0 |
|
| 63 |
+
| Visual optimizer | Adam, lr=1e-4, cosine annealing, 8 epochs |
|
| 64 |
+
| Text backbone | RoBERTa-base |
|
| 65 |
+
| Text optimizer | AdamW, lr=2e-5, warmup+cosine, 5 epochs |
|
| 66 |
+
| Augmentations | RandomFlip, Rotation, ColorJitter, GaussianBlur, RandomErasing |
|
| 67 |
+
|
| 68 |
+
### Based on Research
|
| 69 |
+
- **AWARE-NET** (arxiv:2505.00312): Learnable weighted fusion
|
| 70 |
+
- **CLIP Deepfake** (arxiv:2503.19683): L2-normalized feature spaces
|
| 71 |
+
- **DeTeCtive** (arxiv:2410.20964): RoBERTa for AI text detection
|
| 72 |
+
|
| 73 |
+
## Files
|
| 74 |
+
- `model.py` — Architecture (GradCAM, EfficientNet, RoBERTa, Fusion)
|
| 75 |
+
- `preprocessing.py` — Data pipeline (images, video frames, text)
|
| 76 |
+
- `inference.py` — Inference API (single/modality, multimodal, video)
|
| 77 |
+
- `train.py` — Training script
|
| 78 |
+
|
| 79 |
+
## License
|
| 80 |
+
Apache-2.0
|