Add model card with full training and evaluation details

6c43629 verified 23 days ago

3.91 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- audio-classification
	- deepfake-detection
	- audio-deepfake
	- anti-spoofing
	- wav2vec2
	- asvspoof
	datasets:
	- ASVspoof2019
	pipeline_tag: audio-classification
	library_name: pytorch
	---

	# Deepfake Audio Detection — Wav2Vec 2.0 Fine-tuned

	Fine-tuned Wav2Vec 2.0 model for detecting synthetic (deepfake) speech.
	Trained on ASVspoof 2019 LA. Cross-dataset evaluations on ASVspoof 2021 LA and WaveFake.

	## Headline Results

	\| Evaluation \| Equal Error Rate (EER) \|
	\|---\|---\|
	\| ASVspoof 2019 LA dev (seen attacks A01-A06) \| 0.69% \|
	\| ASVspoof 2019 LA eval (unseen attacks A07-A19) \| 5.55% \|
	\| ASVspoof 2021 LA eval (codec-degraded) \| 9.09% \|
	\| WaveFake (LJSpeech vocoders, mean) \| 29.4% \|

	On ASVspoof 2021 LA the model matches the strongest published baselines
	(LFCC-LCNN at 9.26%, RawNet2 at 9.50%) without codec-specific training augmentation.

	## Architecture

	- Backbone: facebook/wav2vec2-base (95M params, 12 transformer layers)
	- Input: raw waveform at 16 kHz, 4-second windows (64,000 samples)
	- Head: mean-pool over time + linear classifier (768 -> 2)
	- Stage 1 training: frozen backbone, classifier head only (1,538 trainable params)
	- Stage 2 training (this checkpoint): top 2 transformer layers + final LayerNorm unfrozen (~14M trainable params)

	## Training Details

	- Dataset: ASVspoof 2019 LA training partition (25,380 utterances)
	- Class weighting: bonafide=4.92, spoof=0.56 (compensates for ~9:1 spoof:bonafide ratio)
	- Optimizer: AdamW
	- Learning rate: 1e-5 with 10% warmup + linear decay
	- Batch size: 16
	- Mixed precision: fp16
	- Gradient clipping: 1.0
	- Epochs: 10 (best at epoch 9)
	- Wall clock: 2h 56m on a single T4 GPU
	- Best dev EER: 0.69%

	## Usage

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# Download the checkpoint
	ckpt_path = hf_hub_download(
	repo_id="Sara1708/deepfake-audio-wav2vec2",
	filename="stage2_best.pt",
	)

	# Load using the inference wrapper from the source repo
	from src.inference.predict import DeepfakeDetector
	detector = DeepfakeDetector(checkpoint_path=ckpt_path, device="cpu")
	result = detector.predict("path/to/audio.wav")
	print(result)
	```

	The full source code, training notebooks, and evaluation scripts are at:
	[github.com/Saracasm/deepfake-audio-detection](https://github.com/Saracasm/deepfake-audio-detection)

	Live demo: [huggingface.co/spaces/Sara1708/deepfake-audio-detector](https://huggingface.co/spaces/Sara1708/deepfake-audio-detector)
	(Space link will be live after deployment.)

	## Limitations

	- WaveFake performance is poor (~29% EER on LJSpeech-based vocoders). This model was trained only on ASVspoof attack types and does not generalize well to standalone neural vocoder pipelines (HiFi-GAN, MelGAN, WaveGlow, etc.).
	- Codec sensitivity: aggressive lossy compression (GSM, PSTN telephone codecs) degrades performance ~6 percentage points relative to uncompressed audio.
	- A10 attack family is a known weakness (15.54% EER on this attack alone).
	- This is a research artifact, not a production deepfake detector. Real-world deepfakes may use synthesis methods this model has never seen.

	## Citation

	If you use this model, please cite the underlying datasets:

	- ASVspoof 2019: Wang et al., 2020. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language.
	- ASVspoof 2021: Yamagishi et al., 2021. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection."
	- WaveFake: Frank & Schonherr, 2021. "WaveFake: A Data Set to Facilitate Audio Deepfake Detection."
	- Wav2Vec 2.0: Baevski et al., 2020. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations."

	## Authors

	Sara Iqbal (23K-0669) and Areeba Arif (23K-0618).
	Spring 2026 Deep Learning Project at FAST-NUCES.