| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - audio-classification |
| - deepfake-detection |
| - audio-deepfake |
| - anti-spoofing |
| - wav2vec2 |
| - asvspoof |
| datasets: |
| - ASVspoof2019 |
| pipeline_tag: audio-classification |
| library_name: pytorch |
| --- |
| |
| # Deepfake Audio Detection — Wav2Vec 2.0 Fine-tuned |
|
|
| Fine-tuned Wav2Vec 2.0 model for detecting synthetic (deepfake) speech. |
| Trained on ASVspoof 2019 LA. Cross-dataset evaluations on ASVspoof 2021 LA and WaveFake. |
|
|
| ## Headline Results |
|
|
| | Evaluation | Equal Error Rate (EER) | |
| |---|---| |
| | ASVspoof 2019 LA dev (seen attacks A01-A06) | **0.69%** | |
| | ASVspoof 2019 LA eval (unseen attacks A07-A19) | **5.55%** | |
| | ASVspoof 2021 LA eval (codec-degraded) | **9.09%** | |
| | WaveFake (LJSpeech vocoders, mean) | 29.4% | |
|
|
| On ASVspoof 2021 LA the model matches the strongest published baselines |
| (LFCC-LCNN at 9.26%, RawNet2 at 9.50%) without codec-specific training augmentation. |
|
|
| ## Architecture |
|
|
| - **Backbone:** facebook/wav2vec2-base (95M params, 12 transformer layers) |
| - **Input:** raw waveform at 16 kHz, 4-second windows (64,000 samples) |
| - **Head:** mean-pool over time + linear classifier (768 -> 2) |
| - **Stage 1 training:** frozen backbone, classifier head only (1,538 trainable params) |
| - **Stage 2 training (this checkpoint):** top 2 transformer layers + final LayerNorm unfrozen (~14M trainable params) |
|
|
| ## Training Details |
|
|
| - **Dataset:** ASVspoof 2019 LA training partition (25,380 utterances) |
| - **Class weighting:** bonafide=4.92, spoof=0.56 (compensates for ~9:1 spoof:bonafide ratio) |
| - **Optimizer:** AdamW |
| - **Learning rate:** 1e-5 with 10% warmup + linear decay |
| - **Batch size:** 16 |
| - **Mixed precision:** fp16 |
| - **Gradient clipping:** 1.0 |
| - **Epochs:** 10 (best at epoch 9) |
| - **Wall clock:** 2h 56m on a single T4 GPU |
| - **Best dev EER:** 0.69% |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| |
| # Download the checkpoint |
| ckpt_path = hf_hub_download( |
| repo_id="Sara1708/deepfake-audio-wav2vec2", |
| filename="stage2_best.pt", |
| ) |
| |
| # Load using the inference wrapper from the source repo |
| from src.inference.predict import DeepfakeDetector |
| detector = DeepfakeDetector(checkpoint_path=ckpt_path, device="cpu") |
| result = detector.predict("path/to/audio.wav") |
| print(result) |
| ``` |
|
|
| The full source code, training notebooks, and evaluation scripts are at: |
| [github.com/Saracasm/deepfake-audio-detection](https://github.com/Saracasm/deepfake-audio-detection) |
|
|
| Live demo: [huggingface.co/spaces/Sara1708/deepfake-audio-detector](https://huggingface.co/spaces/Sara1708/deepfake-audio-detector) |
| *(Space link will be live after deployment.)* |
|
|
| ## Limitations |
|
|
| - **WaveFake performance is poor (~29% EER on LJSpeech-based vocoders).** This model was trained only on ASVspoof attack types and does not generalize well to standalone neural vocoder pipelines (HiFi-GAN, MelGAN, WaveGlow, etc.). |
| - **Codec sensitivity:** aggressive lossy compression (GSM, PSTN telephone codecs) degrades performance ~6 percentage points relative to uncompressed audio. |
| - **A10 attack family is a known weakness** (15.54% EER on this attack alone). |
| - **This is a research artifact, not a production deepfake detector.** Real-world deepfakes may use synthesis methods this model has never seen. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the underlying datasets: |
|
|
| - ASVspoof 2019: Wang et al., 2020. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language. |
| - ASVspoof 2021: Yamagishi et al., 2021. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection." |
| - WaveFake: Frank & Schonherr, 2021. "WaveFake: A Data Set to Facilitate Audio Deepfake Detection." |
| - Wav2Vec 2.0: Baevski et al., 2020. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." |
|
|
| ## Authors |
|
|
| Sara Iqbal (23K-0669) and Areeba Arif (23K-0618). |
| Spring 2026 Deep Learning Project at FAST-NUCES. |
|
|