Quran Reciter Identification β€” Fine-tuned ECAPA-TDNN

Identifies which of 362 Quran reciters is speaking from an audio clip, using a fine-tuned ECAPA-TDNN speaker encoder with cosine similarity against per-reciter sub-centroids.

Model Description

  • Architecture: ECAPA-TDNN (SpeechBrain spkrec-ecapa-voxceleb) fine-tuned with AAM-Softmax loss
  • Embedding dimension: 192
  • Inference method: Cosine similarity against K=3 sub-centroids per reciter (captures different vocal conditions: neutral, emotional, different acoustics)
  • Multi-crop inference: Averages embeddings from multiple 20-second crops for robustness
  • Training data: 8,800+ audio files across 362 reciters from MP3Quran.net
  • Validation accuracy: 92.7% on 20-second clips

Files

  • encoder.pth β€” Fine-tuned ECAPA-TDNN encoder weights
  • centroids.pt β€” Sub-centroids tensor, shape (362, 3, 192)
  • metadata.json β€” Reciter ID to name mapping

Usage

import torch
import torch.nn.functional as F
from speechbrain.inference.speaker import EncoderClassifier

# Load encoder
encoder = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
)
state_dict = torch.load("encoder.pth", map_location="cpu")
encoder.mods.load_state_dict(state_dict)

# Load centroids and metadata
centroids = torch.load("centroids.pt")  # (362, 3, 192)
import json
with open("metadata.json") as f:
    metadata = json.load(f)
id_to_reciter = metadata["id_to_reciter"]

# Identify from audio (16kHz mono waveform)
waveform = ...  # torch.Tensor, shape (1, samples)
with torch.no_grad():
    embedding = encoder.encode_batch(waveform).squeeze()
    embedding = F.normalize(embedding, p=2, dim=0)

# Cosine similarity against sub-centroids (max over K=3)
sims = torch.matmul(centroids, embedding)  # (362, 3)
scores = sims.max(dim=1).values  # (362,)
best_id = scores.argmax().item()
print(f"Reciter: {id_to_reciter[str(best_id)]}")

Training Details

  • Base model: speechbrain/spkrec-ecapa-voxceleb
  • Loss: AAM-Softmax (margin=0.2, scale=30)
  • Optimizer: AdamW with dual learning rates (encoder: 1e-4, head: 1e-3)
  • Scheduler: CosineAnnealingWarmRestarts
  • Epochs: 20 (early stopped at patience=8)
  • Batch size: 8
  • Clip duration: 20 seconds (random crop during training)
  • Augmentation: Speed perturbation (0.9x-1.1x)

Evaluation

Tested on client-provided YouTube clips of 9 different reciters (53 test cases):

Metric Score
Validation accuracy (clean audio) 92.7%
YouTube test accuracy 96.2% (51/53)

Limitations

  • Optimized for 20+ second clips; shorter clips may have lower accuracy
  • Emotional/crying recitation may reduce accuracy for some reciters
  • Trained on studio recordings; very noisy environments may degrade performance

Citation

If you use this model, please cite:

@misc{quran-reciter-id-2026,
  title={Quran Reciter Identification using Fine-tuned ECAPA-TDNN},
  author={Arham Anwaar},
  year={2026},
  url={https://huggingface.co/iarhamanwaar/quran-reciter-id-ecapa}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train iarhamanwaar/quran-reciter-id-ecapa

Paper for iarhamanwaar/quran-reciter-id-ecapa