Whisper Quran v1 — Arabic Quranic Speech Recognition

The first open-source, production-tested Whisper Large-v3 model fine-tuned specifically for Quran recitation.

Fine-tuned on professional recitations from 5 world-renowned Qurʾān reciters, achieving 5.35% WER on held-out evaluation data. Tested across 19 reciters (including 14 completely unseen) with strong generalization. Deployed and validated in a real-time verse-tracking system on smart glasses.

Key Results

Metric	Value
Word Error Rate (WER)	5.35% (best) / 5.74% (this checkpoint)
Cross-reciter generalization	19/19 reciters produce usable transcription
Short ayah accuracy (<10 words)	~97% across all 19 reciters
Medium ayah accuracy (10-20 words)	~88% across all 19 reciters
Real-time verse matching	Score 1.00 (perfect) on live recitation
Base model	`openai/whisper-large-v3` (1.55B parameters)
Training data	5 reciters × 6,236 ayahs from Buraaq/quran-md-ayahs

What Makes This Model Different

	Tarteel AI	Academic Papers	Other HF Models	This Model
Architecture	whisper-base/tiny	Various	whisper-large-v3	whisper-large-v3
Training data	Crowd-sourced amateurs	Small / private	Tarteel everyayah	Professional reciters
Reciter diversity	1,200+ amateurs	2-10	36 pros	5 trained + 14 unseen validated
Open weights	✅	❌	✅ (no WER docs)	✅ with full pipeline + benchmarks
Production tested	App only	Research	No	Real-time smart glasses system
Reproducible	❌	❌	Partial	✅ full training notebook

Training Details

Reciters (Phase 1)

#	Reciter	Style	Origin
1	Mishary Rashid Alafasy	Modern Murattal	Kuwait
2	Abdurrahman As-Sudais	Haramain Imam	Saudi Arabia
3	Mahmoud Khalil Al-Husary	Classical Murattal	Egypt
4	Mahmoud Khalil Al-Husary	Mujawwad (slow, melodic)	Egypt
5	Muhammad Siddiq Al-Minshawy	Classical Murattal	Egypt

This covers Gulf, Saudi Haramain, and Egyptian classical styles — the three dominant recitation traditions worldwide.

Training Configuration

Parameter	Value
Base model	`openai/whisper-large-v3`
Dataset	`Buraaq/quran-md-ayahs` (ayah-level aligned)
Training samples	~31,180 (5 reciters × 6,236 ayahs)
Evaluation samples	115 (held-out split)
Max steps	5,000
Best checkpoint	Step 3,500 (5.35% WER)
Released checkpoint	Step 4,500 (5.74% WER)
Learning rate	1e-5 with warmup
Batch size	8 (effective, with gradient accumulation)
Precision	bf16
GPU	AMD Radeon RX 7900 XTX (24 GB)
Training time	~38 hours

WER Progression

Step	WER	Notes
1,000	34.46%	Early training
1,500	15.84%	Rapid improvement
2,000	10.89%
2,500	7.92%
3,000	9.90%	Eval variance (small eval set)
3,500	5.35%	Best checkpoint
4,000	9.11%	Eval variance
4,500	5.74%	Released checkpoint

The oscillation between checkpoints is due to the small evaluation set (115 samples). Both 3,500 and 4,500 are strong models.

Cross-Reciter Evaluation (19 Reciters)

Tested on 10 ayahs per reciter spanning short, medium, and long verses. Reciters marked with ✅ were in the training set; all others are completely unseen.

Trained Reciters (5)

Reciter	Short Ayahs	Medium Ayahs	Long Ayahs
✅ Alafasy	Perfect	Perfect	Core words correct, tail drift
✅ As-Sudais	Perfect	Perfect	Core words correct, tail drift
✅ Al-Husary	Perfect	Perfect	Core words correct, tail drift
✅ Al-Husary (Mujawwad)	Perfect	Perfect	Core words correct, tail drift
✅ Al-Minshawy	Perfect	Perfect	Core words correct, tail drift

Unseen Reciters (14) — Zero-Shot Generalization

Reciter	Short Ayahs	Medium Ayahs	Long Ayahs
Abdul Basit (Egyptian)	Near perfect	Good	Drift
Maher Al-Muaiqly (Saudi)	Near perfect	Good	Drift
Muhammad Jibreel (Egyptian)	Good	Good	Drift
Ali Jaber (Saudi)	Good	Good	Drift
Saood Ash-Shuraym (Saudi)	Good	Good	Drift
Ayman Sowaid (Average style)	Near perfect	Near perfect	Drift
Abdullah Basfar (Saudi)	Excellent	Good	Drift
Hani Ar-Rifai (Saudi)	Minor diacritics	Good	More drift
Fares Abbad (Saudi)	Minor errors	Good	Drift
Nasser Al-Qatami (Kuwaiti)	Good	Good	Drift
Yasser Ad-Dossary (Saudi)	Good	Good	Drift
Ghamadi (Saudi)	Good	Good	Drift
Alafasy 64kbps (low bitrate)	Identical to 128kbps	—	—
Abu Bakr Ash-Shatri (Saudi)	Good	Good	Drift

Key finding: Ayman Sowaid (who recites like an average person, not a professional) scored near-perfect — this confirms the model works for everyday reciters, not just professionals.

Long ayah drift is expected and is solved in production by chunking audio into 10-second windows.

Live Production Results

This model has been deployed in a real-time Quran verse tracking system running on G2 smart glasses. The pipeline processes 10-second audio chunks through Whisper, then matches output against all 6,236 Quran ayahs using Token F1 + IDF-weighted scoring.

Real-Time Matching Scores (Live Recitation)

Surah	Verses Tracked	Match Quality	Notes
Al-Fatiha (1)	1:1 → 1:7	Score 0.69–1.00	Perfect tracking
Ar-Rahman (55)	55:37 → 55:56	Score 1.00	Perfect on refrain verse
An-Nazi'at (79)	79:12 → 79:46	Score 0.43–0.88	Full surah tracked

Example from live log:

Whisper: "فَبِأَيِّ آلَاءِ رَبِّكُمَا تُكَذِّبَانِ"
Match:   [55:13] score=1.00 F1=1.00 IDF=1.00 coverage=100%

The 10-second chunking eliminates the long-ayah drift problem entirely — every chunk contains 5-15 words, which is the model's sweet spot.

Usage

Hugging Face Inference Endpoint (custom handler, recommended)

This repo now includes a custom handler.py for HF Inference Endpoints that supports:

micro-batching concurrent requests on one GPU worker
low-latency queue window (ASR_BATCH_WINDOW_MS)
bounded batching (ASR_MAX_BATCH_SIZE)
output shape compatible with clients expecting { text, chunks }

Endpoint configuration

Set your endpoint to use this repository revision and custom handler.

Optional environment variables:

Variable	Default	Description
`ASR_BATCH_WINDOW_MS`	`35`	Queue window to coalesce near-simultaneous requests into one forward pass
`ASR_MAX_BATCH_SIZE`	`4`	Max requests per micro-batch
`ASR_REQUEST_TIMEOUT_S`	`45`	Per-request queue wait timeout

The handler accepts raw audio/wav bytes or JSON payloads with:

inputs (base64/string/bytes audio), and optional
parameters (language, task, return_timestamps, chunk_length_s, temperature)

Basic Transcription

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="wasimlhr/whisper-quran-v1",
    device=0  # GPU, or -1 for CPU
)

result = pipe(
    "recitation.wav",
    generate_kwargs={"language": "ar", "task": "transcribe"}
)
print(result["text"])

From Audio Array (No ffmpeg Required)

import librosa

audio, sr = librosa.load("recitation.wav", sr=16000)
result = pipe(
    {"array": audio, "sampling_rate": 16000},
    generate_kwargs={"language": "ar", "task": "transcribe"}
)
print(result["text"])

Direct Model Usage (Full Control)

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("wasimlhr/whisper-quran-v1")
model = WhisperForConditionalGeneration.from_pretrained(
    "wasimlhr/whisper-quran-v1",
    torch_dtype=torch.float16
).to("cuda")

audio, sr = librosa.load("recitation.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda").half()

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        language="ar",
        task="transcribe",
    )

text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(text)

Production Chunking (Recommended)

For real-time applications, send audio in 5-10 second chunks rather than full recordings. This eliminates long-ayah drift and keeps transcription in the model's optimal accuracy range:

import numpy as np

CHUNK_SECONDS = 10
SAMPLE_RATE = 16000

audio, sr = librosa.load("long_recitation.wav", sr=SAMPLE_RATE)
chunks = [
    audio[i:i + CHUNK_SECONDS * SAMPLE_RATE]
    for i in range(0, len(audio), CHUNK_SECONDS * SAMPLE_RATE)
]

for i, chunk in enumerate(chunks):
    result = pipe(
        {"array": chunk, "sampling_rate": SAMPLE_RATE},
        generate_kwargs={"language": "ar", "task": "transcribe"}
    )
    print(f"Chunk {i}: {result['text']}")

Known Limitations

Long ayahs (20+ words): Transcription drifts after ~15 words when processing full-length audio. Solved by chunking in production.
Systematic training artifact: The phrase "وَإِيَّاكَ نَسْتَعِينُ" (Fatiha 1:5) consistently outputs as "وَإِنَّيِّيكَنَ اسْتَعِينِ" across all reciters — a known training artifact.
Diacritics: The model sometimes produces slightly incorrect tashkeel (vowel marks). The core consonantal text is almost always correct, which is sufficient for verse identification.
Non-Quran audio: The model is specialized for Quran recitation. General Arabic speech will produce poor results. Whisper's hallucination tokens ("شكرا", "موسيقى") appear on non-speech audio.
Single Qira'a: Trained primarily on Hafs 'an 'Asim readings. Warsh, Qalun, and other qira'at are not specifically covered.

Deployment Options

Option	Cost	Latency	Best For
HuggingFace Inference Endpoint (GPU)	~$0.60/hr	~2-3s per chunk	Production apps
HuggingFace Inference Endpoint (CPU)	~$0.06/hr	~8-10s per chunk	Budget / low traffic
Replicate / Modal (serverless)	~$0.0002/sec	~2-3s per chunk	Intermittent use
Self-hosted (Docker + GPU)	Hardware cost	~2-3s per chunk	Privacy, offline
`faster-whisper` (CTranslate2, CPU)	VPS ~$5/mo	~10s per chunk	Single user, budget

Citation

@misc{whisper-quran-v1,
  title={Whisper Quran v1: Fine-tuned Whisper Large-v3 for Quranic Arabic Speech Recognition},
  author={Abdul Rahman Nasim},
  year={2026},
  url={https://huggingface.co/wasimlhr/whisper-quran-v1},
  note={Fine-tuned on Buraaq/quran-md-ayahs dataset}
}

@article{radford2022whisper,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Acknowledgments

OpenAI for the Whisper architecture and pre-trained weights
Buraaq for the comprehensive ayah-level Quran recitation dataset
The Quran reciter community for preserving and sharing these recordings

License

This project and the fine-tuned Whisper model (wasimlhr/whisper-quran-v1) are free for personal, educational, and non-commercial use only. This includes mosques, madaris, Islamic schools, and non-profit organizations. Commercial use is strictly prohibited without prior written permission. This applies to the model weights, the application code, and any derivative works. For commercial licensing inquiries, contact via GitHub: wasimlhr/taraweeh-companion-g2

Downloads last month: 93

Safetensors

Model size

2B params

Tensor type

F32

Model tree for wasimlhr/whisper-quran-v1

Base model

openai/whisper-large-v3

Finetuned

(813)

this model

Quantizations

1 model

Dataset used to train wasimlhr/whisper-quran-v1

Space using wasimlhr/whisper-quran-v1 1

Paper for wasimlhr/whisper-quran-v1

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 53

Evaluation results

WER (best checkpoint) on Buraaq/quran-md-ayahs (holdout)
self-reported

5.350
WER (released checkpoint) on Buraaq/quran-md-ayahs (holdout)
self-reported

5.740