Whisper Small Quran (LoRA Fine-Tuned)

This is a specialized Automatic Speech Recognition (ASR) model for Quranic Recitation with tashkeel or diacritics. It is a fine-tuned version of openai/whisper-small, optimized to recognize Quranic Arabic with high accuracy while maintaining excellent inference speed.

Model Performance

Word Error Rate (WER): Achieved 10.8% on a randomly sampled 20% subset of the ahishamm/QURANICWhisperDataset test set.
Accuracy: The model demonstrates high precision in capturing Quranic vocabulary, standard Imla'i script nuances, and correct Tashkeel placement.

Architecture & Trade-offs

This model utilizes the standard Whisper Small architecture, featuring 12 encoder layers and 12 decoder layers (244M parameters).

Pros: It offers a great balance between speed and stability. It is significantly faster to run than the Medium or Large architectures, while its 12-layer decoder maintains much stronger contextual and grammatical stability compared to heavily pruned models.
Cons: It lacks the sheer acoustic depth of the Large models with more layers (24, 32..), thus occasionally tashkeel or diacritics resulting in minor phonetic substitutions on highly similar letters. it is also prone to hallucinations when it encounters silences without proper configuration.

Training Details

The model was trained using LoRA (Low-Rank Adaptation) in a multi-stage curriculum learning process to ensure stability and precision.

Datasets

The training and evaluation process utilized a comprehensive mix of professional recitations:

Training & Validation:
- MohamedRashad/Quran-Recitations
- tarteel-ai/everyayah
Testing:
- ahishamm/QURANICWhisperDataset (A 20% subset was used for final WER calculation).

Methodology

Curriculum Learning: The model was trained gradually across these datasets to refine its understanding of Tajweed and Quranic sentence structures.
Data Augmentation: To ensure the model remains robust against real-world conditions (non-studio microphones, background noise, varying volumes), diverse audio augmentations including gain adjustments and spectral masking were applied during the training process.

Usage

This model is fully compatible with the Hugging Face transformers pipeline.

from transformers import pipeline

# Load the pipeline with automatic device mapping for multi-GPU setups
pipe = pipeline(
    "automatic-speech-recognition",
    model="MaddoggProduction/whisper-small-quran-lora-dataset-mix",
    device_map="auto" # auto for multiple GPUs, n: for GPU n, -1: for CPU
)

# Transcribe audio with chunking
result = pipe(
    "path_to_audio.mp3",
    chunk_length_s=30, 
    stride_length_s=3, 
    batch_size=8,
    return_timestamps=True,
    generate_kwargs={
        "task": "transcribe",
        "language": "arabic",
        #"temperature": 0.0,      # Greedy decoding, recommended for stability. Adjust as needed
        #"num_beams": 5,           # Adjust as needed, 5 for stability
    }
)

print(result["text"])

Downloads last month: 74

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for MaddoggProduction/whisper-small-quran-lora-dataset-mix

Base model

openai/whisper-small

Finetuned

(3445)

this model

MaddoggProduction
/

whisper-small-quran-lora-dataset-mix