alianassmaaa commited on
Commit
796428c
·
verified ·
1 Parent(s): cd95a81

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - deepfake-detection
5
+ - multimodal
6
+ - image-classification
7
+ - text-classification
8
+ - ensemble
9
+ - gradcam
10
+ - explainability
11
+ datasets:
12
+ - Hemg/deepfake-and-real-images
13
+ - artem9k/ai-text-detection-pile
14
+ pipeline_tag: image-classification
15
+ ---
16
+
17
+ # Multimodal Deepfake Detection Model
18
+
19
+ A multimodal ensemble model that classifies images, video frames, and text as **real** or **AI-generated/fake**, with confidence scores and GradCAM explainability maps.
20
+
21
+ ## Architecture
22
+
23
+ **Visual Branch**: EfficientNet-B0 (ImageNet pretrained) with L2-normalized features for image/video frame classification
24
+ **Text Branch**: RoBERTa-base with mean pooling and MLP head for AI-generated text detection
25
+ **Fusion Layer**: Learnable weighted late ensemble combining visual + text probabilities
26
+ **Explainability**: GradCAM heatmaps on EfficientNet convolutional layers
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal
32
+
33
+ model, config = load_model('multimodal_ensemble.pt', device='cuda')
34
+
35
+ # Image with GradCAM explainability
36
+ result = classify_image(model, 'face.jpg', device='cuda', return_gradcam=True)
37
+ print(f"Prediction: {result['prediction']} (confidence: {result['confidence']:.2%})")
38
+ # result['gradcam'] contains the explainability heatmap
39
+
40
+ # Text classification
41
+ result = classify_text(model, 'This text was written by...')
42
+ print(f"Prediction: {result['prediction']} (confidence: {result['confidence']:.2%})")
43
+
44
+ # Video classification
45
+ result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
46
+ print(f"Video: {result['prediction']} (confidence: {result['confidence']:.2%})")
47
+
48
+ # Multimodal (image + text)
49
+ result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
50
+ print(f"Combined: {result['prediction']} — Weights: {result['fusion_weights']}")
51
+ ```
52
+
53
+ ## Training
54
+
55
+ ### Datasets
56
+ - **Visual**: [Hemg/deepfake-and-real-images](https://huggingface.co/datasets/Hemg/deepfake-and-real-images) — 140K+ face images (real vs deepfake)
57
+ - **Text**: [artem9k/ai-text-detection-pile](https://huggingface.co/datasets/artem9k/ai-text-detection-pile) — 1.9GB human vs AI-generated text
58
+
59
+ ### Recipe
60
+ | Component | Config |
61
+ |-----------|--------|
62
+ | Visual backbone | EfficientNet-B0 |
63
+ | Visual optimizer | Adam, lr=1e-4, cosine annealing, 8 epochs |
64
+ | Text backbone | RoBERTa-base |
65
+ | Text optimizer | AdamW, lr=2e-5, warmup+cosine, 5 epochs |
66
+ | Augmentations | RandomFlip, Rotation, ColorJitter, GaussianBlur, RandomErasing |
67
+
68
+ ### Based on Research
69
+ - **AWARE-NET** (arxiv:2505.00312): Learnable weighted fusion
70
+ - **CLIP Deepfake** (arxiv:2503.19683): L2-normalized feature spaces
71
+ - **DeTeCtive** (arxiv:2410.20964): RoBERTa for AI text detection
72
+
73
+ ## Files
74
+ - `model.py` — Architecture (GradCAM, EfficientNet, RoBERTa, Fusion)
75
+ - `preprocessing.py` — Data pipeline (images, video frames, text)
76
+ - `inference.py` — Inference API (single/modality, multimodal, video)
77
+ - `train.py` — Training script
78
+
79
+ ## License
80
+ Apache-2.0