alianassmaaa commited on
Commit
fffc143
·
verified ·
1 Parent(s): 4633878

Add data card with metrics and usage

Browse files
Files changed (1) hide show
  1. data_card.md +94 -0
data_card.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multimodal Deepfake Detection - Data Card
2
+
3
+ ## Datasets Used
4
+
5
+ ### Visual: Hemg/deepfake-and-real-images
6
+ - **Source**: https://huggingface.co/datasets/Hemg/deepfake-and-real-images
7
+ - **Size**: ~528K images (140K+ used for training)
8
+ - **Labels**: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training)
9
+ - **Content**: Face images (real photographs vs deepfakes)
10
+ - **Preprocessing**: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur)
11
+
12
+ ### Text: artem9k/ai-text-detection-pile
13
+ - **Source**: https://huggingface.co/datasets/artem9k/ai-text-detection-pile
14
+ - **Size**: 1.88GB, ~1M samples (20K+ used for training)
15
+ - **Labels**: human, ai
16
+ - **Content**: Essays, reports, news articles from human authors and AI models
17
+ - **Preprocessing**: RoBERTa tokenizer, max_length=512, truncation/padding
18
+
19
+ ## Results
20
+
21
+ | Component | Validation Accuracy | Notes |
22
+ |-----------|--------------------|-------|
23
+ | Visual Branch (EfficientNet-B0) | ~73% | 3 epochs on 1K image subset |
24
+ | Text Branch (RoBERTa-base) | ~75% | 3 epochs on 500 text subset |
25
+ | Multimodal Ensemble | Combines visual + text with learnable weights | Fusion on CPU subset |
26
+
27
+ ## Architecture
28
+
29
+ - **Visual**: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier
30
+ - **Text**: RoBERTa-base (125M params) + Mean Pooling + MLP Head
31
+ - **Fusion**: Learnable weighted averaging with cross-modal attention (optional)
32
+ - **Explainability**: GradCAM on last EfficientNet convolutional block
33
+
34
+ ## Training Configuration
35
+
36
+ | Parameter | Visual | Text |
37
+ |-----------|--------|------|
38
+ | Backbone | EfficientNet-B0 | RoBERTa-base |
39
+ | Optimizer | Adam | AdamW |
40
+ | Learning Rate | 1e-4 | 2e-5 |
41
+ | Weight Decay | 1e-4 | 0.01 |
42
+ | Epochs | 8 (full), 3 (compact) | 5 (full), 3 (compact) |
43
+ | Batch Size | 32 | 16 |
44
+ | Image Size | 224x224 | - |
45
+ | Text Length | - | 512 |
46
+ | Augmentation | Flip, Rot, ColorJitter, GaussianBlur, RandomErasing | - |
47
+
48
+ ## Inference API
49
+
50
+ ```python
51
+ from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal
52
+
53
+ model, config = load_model('multimodal_ensemble.pt')
54
+
55
+ # Image + GradCAM explainability
56
+ result = classify_image(model, 'face.jpg', return_gradcam=True)
57
+ # result: {prediction, confidence, gradcam: (224, 224)}
58
+
59
+ # Text
60
+ result = classify_text(model, 'This essay was written by...')
61
+ # result: {prediction, confidence}
62
+
63
+ # Video (aggregated from frame classifications)
64
+ result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
65
+ # result: {prediction, confidence, frame_scores}
66
+
67
+ # Multimodal (both modalities with learned fusion weights)
68
+ result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
69
+ # result: {prediction, confidence, modality_scores, fusion_weights}
70
+ ```
71
+
72
+ ## Files in this Repository
73
+
74
+ | File | Description |
75
+ |------|-------------|
76
+ | `model.py` | Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation |
77
+ | `preprocessing.py` | Data loading, transforms, video frame extraction, tokenization |
78
+ | `inference.py` | Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal |
79
+ | `train.py` | Full training script with separate branch training + ensemble assembly |
80
+ | `multimodal_ensemble.pt` | Full ensemble checkpoint (493MB) |
81
+ | `visual_branch.pt` | Visual-only checkpoint (15.6MB) |
82
+ | `text_branch.pt` | Text-only checkpoint (476MB) |
83
+ | `config.json` | Training configuration |
84
+ | `requirements.txt` | Python dependencies |
85
+ | `gradcam_examples/` | GradCAM explainability visualizations |
86
+
87
+ ## Literature References
88
+
89
+ - **AWARE-NET** (arxiv:2505.00312): Two-tier weighted ensemble
90
+ - **CLIP Deepfake Detection** (arxiv:2503.19683): L2-normalized features
91
+ - **DeTeCtive** (arxiv:2410.20964): RoBERTa-based AI text detection
92
+
93
+ ## License
94
+ Apache-2.0