---
license: apache-2.0
language:
- en
tags:
- audio-classification
- deepfake-detection
- audio-deepfake
- anti-spoofing
- wav2vec2
- asvspoof
datasets:
- ASVspoof2019
pipeline_tag: audio-classification
library_name: pytorch
---

# Deepfake Audio Detection — Wav2Vec 2.0 Fine-tuned

Fine-tuned Wav2Vec 2.0 model for detecting synthetic (deepfake) speech.
Trained on ASVspoof 2019 LA. Cross-dataset evaluations on ASVspoof 2021 LA and WaveFake.

## Headline Results

| Evaluation | Equal Error Rate (EER) |
|---|---|
| ASVspoof 2019 LA dev (seen attacks A01-A06) | **0.69%** |
| ASVspoof 2019 LA eval (unseen attacks A07-A19) | **5.55%** |
| ASVspoof 2021 LA eval (codec-degraded) | **9.09%** |
| WaveFake (LJSpeech vocoders, mean) | 29.4% |

On ASVspoof 2021 LA the model matches the strongest published baselines
(LFCC-LCNN at 9.26%, RawNet2 at 9.50%) without codec-specific training augmentation.

## Architecture

- **Backbone:** facebook/wav2vec2-base (95M params, 12 transformer layers)
- **Input:** raw waveform at 16 kHz, 4-second windows (64,000 samples)
- **Head:** mean-pool over time + linear classifier (768 -> 2)
- **Stage 1 training:** frozen backbone, classifier head only (1,538 trainable params)
- **Stage 2 training (this checkpoint):** top 2 transformer layers + final LayerNorm unfrozen (~14M trainable params)

## Training Details

- **Dataset:** ASVspoof 2019 LA training partition (25,380 utterances)
- **Class weighting:** bonafide=4.92, spoof=0.56 (compensates for ~9:1 spoof:bonafide ratio)
- **Optimizer:** AdamW
- **Learning rate:** 1e-5 with 10% warmup + linear decay
- **Batch size:** 16
- **Mixed precision:** fp16
- **Gradient clipping:** 1.0
- **Epochs:** 10 (best at epoch 9)
- **Wall clock:** 2h 56m on a single T4 GPU
- **Best dev EER:** 0.69%

## Usage

```python
import torch
from huggingface_hub import hf_hub_download

# Download the checkpoint
ckpt_path = hf_hub_download(
    repo_id="Sara1708/deepfake-audio-wav2vec2",
    filename="stage2_best.pt",
)

# Load using the inference wrapper from the source repo
from src.inference.predict import DeepfakeDetector
detector = DeepfakeDetector(checkpoint_path=ckpt_path, device="cpu")
result = detector.predict("path/to/audio.wav")
print(result)
```

The full source code, training notebooks, and evaluation scripts are at:
[github.com/Saracasm/deepfake-audio-detection](https://github.com/Saracasm/deepfake-audio-detection)

Live demo: [huggingface.co/spaces/Sara1708/deepfake-audio-detector](https://huggingface.co/spaces/Sara1708/deepfake-audio-detector)
*(Space link will be live after deployment.)*

## Limitations

- **WaveFake performance is poor (~29% EER on LJSpeech-based vocoders).** This model was trained only on ASVspoof attack types and does not generalize well to standalone neural vocoder pipelines (HiFi-GAN, MelGAN, WaveGlow, etc.).
- **Codec sensitivity:** aggressive lossy compression (GSM, PSTN telephone codecs) degrades performance ~6 percentage points relative to uncompressed audio.
- **A10 attack family is a known weakness** (15.54% EER on this attack alone).
- **This is a research artifact, not a production deepfake detector.** Real-world deepfakes may use synthesis methods this model has never seen.

## Citation

If you use this model, please cite the underlying datasets:

- ASVspoof 2019: Wang et al., 2020. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language.
- ASVspoof 2021: Yamagishi et al., 2021. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection."
- WaveFake: Frank & Schonherr, 2021. "WaveFake: A Data Set to Facilitate Audio Deepfake Detection."
- Wav2Vec 2.0: Baevski et al., 2020. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations."

## Authors

Sara Iqbal (23K-0669) and Areeba Arif (23K-0618).
Spring 2026 Deep Learning Project at FAST-NUCES.