Wav2Vec2-XLS-R-300M Fine-tuned for Qur'anic Mispronunciation Detection
This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m specifically adapted for the Iqra'Eval 2026 Shared Task. It is designed to perform phoneme-level Automatic Speech Recognition (ASR) to detect and diagnose mispronunciations in Modern Standard Arabic (MSA) readings of Qur'anic texts.
Model Description
[Image of single-stage fine-tuning pipeline for Wav2Vec2 acoustic models]
A single-stage fine-tuning strategy was utilized to adapt the generalized cross-lingual speech representations to the phonetic distribution of Qur'anic recitation. The model was trained end-to-end using Connectionist Temporal Classification (CTC) loss. The CNN feature extractor was frozen during training to prevent catastrophic forgetting of the pre-trained acoustic representations, allowing only the Transformer layers to be updated.
Training Data
The model was fine-tuned on a combined dataset totaling approximately 159 hours of Arabic speech, sourced directly from the official Iqra'Eval Hugging Face repositories:
- Native Speech (Pseudo-Labeled):
IqraEval/Iqra_train- Volume: ~79 hours
- Description: Recordings from native MSA speakers. This subset is treated as "Golden" data using pseudo-labels, as the speakers are assumed to pronounce the text correctly.
- Synthetic Mispronunciations:
IqraEval/Iqra_TTS- Volume: ~80 hours
- Description: Synthetic speech generated using trained TTS systems where mispronunciations (substitutions, deletions, insertions) were deliberately introduced into the input text to teach the model explicit error patterns.
Data Preprocessing and Phoneme Vocabulary
The model was trained on audio resampled to 16 kHz. Utterances shorter than 0.3 seconds or longer than 15.0 seconds were filtered out during preprocessing to maintain batch stability.
The vocabulary consists of 74 tokens, built upon the 68 MSA phonemes defined by the Nawar Halabi phonetizer (including explicit tokens for gemination, e.g., /bb/, and emphaticness). Special tokens (<pad>, <unk>, |, <ctc>, <s>, </s>) were appended to facilitate CTC decoding.
Training Hyperparameters
Fine-Tuning Configuration:
- Epochs: 10
- Learning Rate: 3e-5 (Cosine Scheduler)
- Warmup Ratio: 0.1
- Effective Batch Size: 32 (8 per device × 4 gradient accumulation steps)
- Weight Decay: 0.01
- SpecAugment: Enabled (mask_time_prob=0.05, mask_time_length=10)
- Precision: bf16
Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torchaudio
import torch
processor = Wav2Vec2Processor.from_pretrained("FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval")
model = Wav2Vec2ForCTC.from_pretrained("FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval")
# Load and resample audio
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(wav)
inputs = processor(wav.squeeze(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)
Blind Test Performance
Blind test dataset: IqraEval/QuranMB.v2
Submission format: CSV with columns ID and Labels (space-separated phoneme predictions)
Leaderboard: IqraEval Leaderboard
F1: 0.2020
citation
@inproceedings{eldin2026zero,
title={Qari at Iqra'Eval 2026: Zero-Shot HuBERT Inference for Quranic Pronunciation Evaluation},
author={Eldin, Fatimah Emad},
booktitle={Proceedings of Interspeech},
year={2026}
}
- Downloads last month
- 248