ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Paper β’ 2005.07143 β’ Published
Identifies which of 362 Quran reciters is speaking from an audio clip, using a fine-tuned ECAPA-TDNN speaker encoder with cosine similarity against per-reciter sub-centroids.
spkrec-ecapa-voxceleb) fine-tuned with AAM-Softmax lossencoder.pth β Fine-tuned ECAPA-TDNN encoder weightscentroids.pt β Sub-centroids tensor, shape (362, 3, 192)metadata.json β Reciter ID to name mappingimport torch
import torch.nn.functional as F
from speechbrain.inference.speaker import EncoderClassifier
# Load encoder
encoder = EncoderClassifier.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="pretrained_models/spkrec-ecapa-voxceleb",
)
state_dict = torch.load("encoder.pth", map_location="cpu")
encoder.mods.load_state_dict(state_dict)
# Load centroids and metadata
centroids = torch.load("centroids.pt") # (362, 3, 192)
import json
with open("metadata.json") as f:
metadata = json.load(f)
id_to_reciter = metadata["id_to_reciter"]
# Identify from audio (16kHz mono waveform)
waveform = ... # torch.Tensor, shape (1, samples)
with torch.no_grad():
embedding = encoder.encode_batch(waveform).squeeze()
embedding = F.normalize(embedding, p=2, dim=0)
# Cosine similarity against sub-centroids (max over K=3)
sims = torch.matmul(centroids, embedding) # (362, 3)
scores = sims.max(dim=1).values # (362,)
best_id = scores.argmax().item()
print(f"Reciter: {id_to_reciter[str(best_id)]}")
speechbrain/spkrec-ecapa-voxcelebTested on client-provided YouTube clips of 9 different reciters (53 test cases):
| Metric | Score |
|---|---|
| Validation accuracy (clean audio) | 92.7% |
| YouTube test accuracy | 96.2% (51/53) |
If you use this model, please cite:
@misc{quran-reciter-id-2026,
title={Quran Reciter Identification using Fine-tuned ECAPA-TDNN},
author={Arham Anwaar},
year={2026},
url={https://huggingface.co/iarhamanwaar/quran-reciter-id-ecapa}
}