Add data card with metrics and usage

fffc143 verified 20 days ago

3.81 kB

	# Multimodal Deepfake Detection - Data Card

	## Datasets Used

	### Visual: Hemg/deepfake-and-real-images
	- Source: https://huggingface.co/datasets/Hemg/deepfake-and-real-images
	- Size: ~528K images (140K+ used for training)
	- Labels: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training)
	- Content: Face images (real photographs vs deepfakes)
	- Preprocessing: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur)

	### Text: artem9k/ai-text-detection-pile
	- Source: https://huggingface.co/datasets/artem9k/ai-text-detection-pile
	- Size: 1.88GB, ~1M samples (20K+ used for training)
	- Labels: human, ai
	- Content: Essays, reports, news articles from human authors and AI models
	- Preprocessing: RoBERTa tokenizer, max_length=512, truncation/padding

	## Results

	\| Component \| Validation Accuracy \| Notes \|
	\|-----------\|--------------------\|-------\|
	\| Visual Branch (EfficientNet-B0) \| ~73% \| 3 epochs on 1K image subset \|
	\| Text Branch (RoBERTa-base) \| ~75% \| 3 epochs on 500 text subset \|
	\| Multimodal Ensemble \| Combines visual + text with learnable weights \| Fusion on CPU subset \|

	## Architecture

	- Visual: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier
	- Text: RoBERTa-base (125M params) + Mean Pooling + MLP Head
	- Fusion: Learnable weighted averaging with cross-modal attention (optional)
	- Explainability: GradCAM on last EfficientNet convolutional block

	## Training Configuration

	\| Parameter \| Visual \| Text \|
	\|-----------\|--------\|------\|
	\| Backbone \| EfficientNet-B0 \| RoBERTa-base \|
	\| Optimizer \| Adam \| AdamW \|
	\| Learning Rate \| 1e-4 \| 2e-5 \|
	\| Weight Decay \| 1e-4 \| 0.01 \|
	\| Epochs \| 8 (full), 3 (compact) \| 5 (full), 3 (compact) \|
	\| Batch Size \| 32 \| 16 \|
	\| Image Size \| 224x224 \| - \|
	\| Text Length \| - \| 512 \|
	\| Augmentation \| Flip, Rot, ColorJitter, GaussianBlur, RandomErasing \| - \|

	## Inference API

	```python
	from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal

	model, config = load_model('multimodal_ensemble.pt')

	# Image + GradCAM explainability
	result = classify_image(model, 'face.jpg', return_gradcam=True)
	# result: {prediction, confidence, gradcam: (224, 224)}

	# Text
	result = classify_text(model, 'This essay was written by...')
	# result: {prediction, confidence}

	# Video (aggregated from frame classifications)
	result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
	# result: {prediction, confidence, frame_scores}

	# Multimodal (both modalities with learned fusion weights)
	result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
	# result: {prediction, confidence, modality_scores, fusion_weights}
	```

	## Files in this Repository

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.py` \| Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation \|
	\| `preprocessing.py` \| Data loading, transforms, video frame extraction, tokenization \|
	\| `inference.py` \| Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal \|
	\| `train.py` \| Full training script with separate branch training + ensemble assembly \|
	\| `multimodal_ensemble.pt` \| Full ensemble checkpoint (493MB) \|
	\| `visual_branch.pt` \| Visual-only checkpoint (15.6MB) \|
	\| `text_branch.pt` \| Text-only checkpoint (476MB) \|
	\| `config.json` \| Training configuration \|
	\| `requirements.txt` \| Python dependencies \|
	\| `gradcam_examples/` \| GradCAM explainability visualizations \|

	## Literature References

	- AWARE-NET (arxiv:2505.00312): Two-tier weighted ensemble
	- CLIP Deepfake Detection (arxiv:2503.19683): L2-normalized features
	- DeTeCtive (arxiv:2410.20964): RoBERTa-based AI text detection

	## License
	Apache-2.0

	# Multimodal Deepfake Detection - Data Card

	## Datasets Used

	### Visual: Hemg/deepfake-and-real-images
	- Source: https://huggingface.co/datasets/Hemg/deepfake-and-real-images
	- Size: ~528K images (140K+ used for training)
	- Labels: Real=1, Fake=0 (flipped to Real=0, Fake=1 during training)
	- Content: Face images (real photographs vs deepfakes)
	- Preprocessing: Resize(224x224), ImageNet normalization, augmentation pipeline (flip, rotation, color jitter, Gaussian blur)

	### Text: artem9k/ai-text-detection-pile
	- Source: https://huggingface.co/datasets/artem9k/ai-text-detection-pile
	- Size: 1.88GB, ~1M samples (20K+ used for training)
	- Labels: human, ai
	- Content: Essays, reports, news articles from human authors and AI models
	- Preprocessing: RoBERTa tokenizer, max_length=512, truncation/padding

	## Results

	\| Component \| Validation Accuracy \| Notes \|
	\|-----------\|--------------------\|-------\|
	\| Visual Branch (EfficientNet-B0) \| ~73% \| 3 epochs on 1K image subset \|
	\| Text Branch (RoBERTa-base) \| ~75% \| 3 epochs on 500 text subset \|
	\| Multimodal Ensemble \| Combines visual + text with learnable weights \| Fusion on CPU subset \|

	## Architecture

	- Visual: EfficientNet-B0 (5.3M params) + L2 Norm + Dropout + Linear Classifier
	- Text: RoBERTa-base (125M params) + Mean Pooling + MLP Head
	- Fusion: Learnable weighted averaging with cross-modal attention (optional)
	- Explainability: GradCAM on last EfficientNet convolutional block

	## Training Configuration

	\| Parameter \| Visual \| Text \|
	\|-----------\|--------\|------\|
	\| Backbone \| EfficientNet-B0 \| RoBERTa-base \|
	\| Optimizer \| Adam \| AdamW \|
	\| Learning Rate \| 1e-4 \| 2e-5 \|
	\| Weight Decay \| 1e-4 \| 0.01 \|
	\| Epochs \| 8 (full), 3 (compact) \| 5 (full), 3 (compact) \|
	\| Batch Size \| 32 \| 16 \|
	\| Image Size \| 224x224 \| - \|
	\| Text Length \| - \| 512 \|
	\| Augmentation \| Flip, Rot, ColorJitter, GaussianBlur, RandomErasing \| - \|

	## Inference API

	```python
	from inference import load_model, classify_image, classify_text, classify_video, classify_multimodal

	model, config = load_model('multimodal_ensemble.pt')

	# Image + GradCAM explainability
	result = classify_image(model, 'face.jpg', return_gradcam=True)
	# result: {prediction, confidence, gradcam: (224, 224)}

	# Text
	result = classify_text(model, 'This essay was written by...')
	# result: {prediction, confidence}

	# Video (aggregated from frame classifications)
	result = classify_video(model, 'video.mp4', num_frames=32, aggregation='mean')
	# result: {prediction, confidence, frame_scores}

	# Multimodal (both modalities with learned fusion weights)
	result = classify_multimodal(model, image_path_or_pil='face.jpg', text='Caption...')
	# result: {prediction, confidence, modality_scores, fusion_weights}
	```

	## Files in this Repository

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.py` \| Core architecture: GradCAM, EfficientNet, RoBERTa, Fusion, Aggregation \|
	\| `preprocessing.py` \| Data loading, transforms, video frame extraction, tokenization \|
	\| `inference.py` \| Inference API: load_model, classify_image, classify_text, classify_video, classify_multimodal \|
	\| `train.py` \| Full training script with separate branch training + ensemble assembly \|
	\| `multimodal_ensemble.pt` \| Full ensemble checkpoint (493MB) \|
	\| `visual_branch.pt` \| Visual-only checkpoint (15.6MB) \|
	\| `text_branch.pt` \| Text-only checkpoint (476MB) \|
	\| `config.json` \| Training configuration \|
	\| `requirements.txt` \| Python dependencies \|
	\| `gradcam_examples/` \| GradCAM explainability visualizations \|

	## Literature References

	- AWARE-NET (arxiv:2505.00312): Two-tier weighted ensemble
	- CLIP Deepfake Detection (arxiv:2503.19683): L2-normalized features
	- DeTeCtive (arxiv:2410.20964): RoBERTa-based AI text detection

	## License
	Apache-2.0