Quran Reciter Identification — Fine-tuned ECAPA-TDNN

Identifies which of 362 Quran reciters is speaking from an audio clip, using a fine-tuned ECAPA-TDNN speaker encoder with cosine similarity against per-reciter sub-centroids.

Model Description

Architecture: ECAPA-TDNN (SpeechBrain spkrec-ecapa-voxceleb) fine-tuned with AAM-Softmax loss
Embedding dimension: 192
Inference method: Cosine similarity against K=3 sub-centroids per reciter (captures different vocal conditions: neutral, emotional, different acoustics)
Multi-crop inference: Averages embeddings from multiple 20-second crops for robustness
Training data: 8,800+ audio files across 362 reciters from MP3Quran.net
Validation accuracy: 92.7% on 20-second clips

Files

encoder.pth — Fine-tuned ECAPA-TDNN encoder weights
centroids.pt — Sub-centroids tensor, shape (362, 3, 192)
metadata.json — Reciter ID to name mapping

Usage

import torch
import torch.nn.functional as F
from speechbrain.inference.speaker import EncoderClassifier

# Load encoder
encoder = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
)
state_dict = torch.load("encoder.pth", map_location="cpu")
encoder.mods.load_state_dict(state_dict)

# Load centroids and metadata
centroids = torch.load("centroids.pt")  # (362, 3, 192)
import json
with open("metadata.json") as f:
    metadata = json.load(f)
id_to_reciter = metadata["id_to_reciter"]

# Identify from audio (16kHz mono waveform)
waveform = ...  # torch.Tensor, shape (1, samples)
with torch.no_grad():
    embedding = encoder.encode_batch(waveform).squeeze()
    embedding = F.normalize(embedding, p=2, dim=0)

# Cosine similarity against sub-centroids (max over K=3)
sims = torch.matmul(centroids, embedding)  # (362, 3)
scores = sims.max(dim=1).values  # (362,)
best_id = scores.argmax().item()
print(f"Reciter: {id_to_reciter[str(best_id)]}")

Training Details

Base model: speechbrain/spkrec-ecapa-voxceleb
Loss: AAM-Softmax (margin=0.2, scale=30)
Optimizer: AdamW with dual learning rates (encoder: 1e-4, head: 1e-3)
Scheduler: CosineAnnealingWarmRestarts
Epochs: 20 (early stopped at patience=8)
Batch size: 8
Clip duration: 20 seconds (random crop during training)
Augmentation: Speed perturbation (0.9x-1.1x)

Evaluation

Tested on client-provided YouTube clips of 9 different reciters (53 test cases):

Metric	Score
Validation accuracy (clean audio)	92.7%
YouTube test accuracy	96.2% (51/53)

Limitations

Optimized for 20+ second clips; shorter clips may have lower accuracy
Emotional/crying recitation may reduce accuracy for some reciters
Trained on studio recordings; very noisy environments may degrade performance

Citation

If you use this model, please cite:

@misc{quran-reciter-id-2026,
  title={Quran Reciter Identification using Fine-tuned ECAPA-TDNN},
  author={Arham Anwaar},
  year={2026},
  url={https://huggingface.co/iarhamanwaar/quran-reciter-id-ecapa}
}

Downloads last month: -

Dataset used to train iarhamanwaar/quran-reciter-id-ecapa

Paper for iarhamanwaar/quran-reciter-id-ecapa

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Paper • 2005.07143 • Published May 14, 2020