Wav2Vec2-XLS-R-300M Fine-tuned for Qur'anic Mispronunciation Detection

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m specifically adapted for the Iqra'Eval 2026 Shared Task. It is designed to perform phoneme-level Automatic Speech Recognition (ASR) to detect and diagnose mispronunciations in Modern Standard Arabic (MSA) readings of Qur'anic texts.

Model Description

[Image of single-stage fine-tuning pipeline for Wav2Vec2 acoustic models]

A single-stage fine-tuning strategy was utilized to adapt the generalized cross-lingual speech representations to the phonetic distribution of Qur'anic recitation. The model was trained end-to-end using Connectionist Temporal Classification (CTC) loss. The CNN feature extractor was frozen during training to prevent catastrophic forgetting of the pre-trained acoustic representations, allowing only the Transformer layers to be updated.

Training Data

The model was fine-tuned on a combined dataset totaling approximately 159 hours of Arabic speech, sourced directly from the official Iqra'Eval Hugging Face repositories:

Native Speech (Pseudo-Labeled): IqraEval/Iqra_train
- Volume: ~79 hours
- Description: Recordings from native MSA speakers. This subset is treated as "Golden" data using pseudo-labels, as the speakers are assumed to pronounce the text correctly.
Synthetic Mispronunciations: IqraEval/Iqra_TTS
- Volume: ~80 hours
- Description: Synthetic speech generated using trained TTS systems where mispronunciations (substitutions, deletions, insertions) were deliberately introduced into the input text to teach the model explicit error patterns.

Data Preprocessing and Phoneme Vocabulary

The model was trained on audio resampled to 16 kHz. Utterances shorter than 0.3 seconds or longer than 15.0 seconds were filtered out during preprocessing to maintain batch stability.

The vocabulary consists of 74 tokens, built upon the 68 MSA phonemes defined by the Nawar Halabi phonetizer (including explicit tokens for gemination, e.g., /bb/, and emphaticness). Special tokens (<pad>, <unk>, |, <ctc>, <s>, </s>) were appended to facilitate CTC decoding.

Training Hyperparameters

Fine-Tuning Configuration:

Epochs: 10
Learning Rate: 3e-5 (Cosine Scheduler)
Warmup Ratio: 0.1
Effective Batch Size: 32 (8 per device × 4 gradient accumulation steps)
Weight Decay: 0.01
SpecAugment: Enabled (mask_time_prob=0.05, mask_time_length=10)
Precision: bf16

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torchaudio
import torch

processor = Wav2Vec2Processor.from_pretrained("FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval")
model = Wav2Vec2ForCTC.from_pretrained("FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval")

# Load and resample audio
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(wav)

inputs = processor(wav.squeeze(), sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
print(transcription)

Blind Test Performance

Blind test dataset: IqraEval/QuranMB.v2
Submission format: CSV with columns ID and Labels (space-separated phoneme predictions)
Leaderboard: IqraEval Leaderboard

F1: 0.2020

citation

@inproceedings{eldin2026zero,
  title={Qari at Iqra'Eval 2026: Zero-Shot HuBERT Inference for Quranic Pronunciation Evaluation},
  author={Eldin, Fatimah Emad},
  booktitle={Proceedings of Interspeech},
  year={2026}
}

Downloads last month: 248

Safetensors

Model size

0.3B params

Tensor type

F32

Datasets used to train FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval

Space using FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval 1

Collection including FatimahEmadEldin/wav2vec2-xls-r-300m-iqraeval

IqraEval

Collection

21 items • Updated Feb 12