๐ Tadabur โ Quran Speech Recognition
Fine-tuned Whisper Medium on the Tadabur dataset for Quran ASR, Surah/Ayah identification, and reciter recognition.
CS465 Machine Learning Project โ Spring 2026
What This Model Does
Given a Quran audio recitation, the pipeline returns:
- Arabic transcription โ 6.26% WER on unseen data
- Surah & Ayah identification โ fuzzy matched against all 6,236 ayahs
- Reciter name โ identified from 335 supported reciters at 98.47% accuracy
Performance
ASR Results (500 held-out test samples)
| Model | WER (%) | CER (%) |
|---|---|---|
| Whisper Medium Vanilla | 41.10% | 11.47% |
| Tadabur-Whisper-Small (Author) | 47.06% | 12.28% |
| This model | 6.26% | 4.41% |
Reciter Classifier
| Metric | Value |
|---|---|
| Supported reciters | 335 |
| Validation accuracy | 98.47% |
| Training accuracy | 98.71% |
Files in This Repository
| File | Size | Description |
|---|---|---|
model.safetensors |
3.06 GB | Fine-tuned Whisper Medium weights |
reciter_classifier.pt |
2.76 MB | MLP reciter classifier |
reciter_idx_to_id.json |
1.25 KB | Classifier index โ reciter ID |
reciter_id_to_idx.json |
1.25 KB | Reciter ID โ classifier index |
sheikh_dict.json |
2.7 KB | Reciter ID โ Arabic name |
surah_dict.json |
2.7 KB | Surah index โ Arabic name |
quran_simple.json |
~3 MB | Full Quran text for matching |
supported_reciters.txt |
โ | List of all 335 supported reciters |
Quick Start
Install
pip install transformers torch librosa rapidfuzz huggingface_hub
Transcription only
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa, torch
MODEL = "rakansuliman/tadabur-whisper-medium"
processor = WhisperProcessor.from_pretrained(MODEL)
model = WhisperForConditionalGeneration.from_pretrained(MODEL)
model.eval()
audio, _ = librosa.load("recitation.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
ids = model.generate(
inputs,
language="arabic",
task="transcribe",
max_new_tokens=225,
suppress_tokens=[],
forced_decoder_ids=None,
)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
Full pipeline (transcription + reciter)
from huggingface_hub import hf_hub_download
import torch, torch.nn as nn, json
MODEL = "rakansuliman/tadabur-whisper-medium"
# Download classifier files
hf_hub_download(MODEL, "reciter_classifier.pt", local_dir="./")
hf_hub_download(MODEL, "reciter_idx_to_id.json", local_dir="./")
hf_hub_download(MODEL, "sheikh_dict.json", local_dir="./")
# Define classifier (must match training architecture)
class ReciterClassifier(nn.Module):
def __init__(self, hidden_dim, num_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(hidden_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(256, num_classes),
)
def forward(self, x): return self.net(x)
# Load mappings
with open("reciter_idx_to_id.json") as f:
idx_to_id = {int(k): int(v) for k, v in json.load(f).items()}
with open("sheikh_dict.json", encoding="utf-8-sig") as f:
sheikh = {int(v): k for k, v in json.load(f).items()}
# Load classifier
clf = ReciterClassifier(1024, len(idx_to_id))
clf.load_state_dict(torch.load("reciter_classifier.pt", map_location="cpu"))
clf.eval()
# Run encoder + classify reciter
with torch.no_grad():
encoder_out = model.model.encoder(inputs)
embedding = encoder_out.last_hidden_state.mean(dim=1).float()
logits = clf(embedding)
pred_idx = logits.argmax(dim=1).item()
confidence = torch.softmax(logits, dim=1).max().item()
reciter_id = idx_to_id[pred_idx]
reciter_name = sheikh.get(reciter_id, f"ID {reciter_id}")
print(f"Reciter: {reciter_name} ({confidence*100:.1f}%)")
Architecture
Audio Input (mic / file / video)
โ
Whisper Encoder โโ runs once, shared
โโโ Whisper Decoder โ Arabic text
โโโ MLP Classifier โ Reciter name
โ
RapidFuzz matching against 6,236 ayahs
โ
Surah name + Ayah number + confidence
Reciter Classifier Architecture
Linear(1024โ512) โ BatchNorm โ ReLU โ Dropout(0.3)
โ Linear(512โ256) โ BatchNorm โ ReLU โ Dropout(0.2)
โ Linear(256โ335)
Training Details
ASR Fine-tuning
- Base model:
openai/whisper-medium - Dataset: 9,432 samples (1 shard of Tadabur)
- Hardware: NVIDIA RTX 4090 (24GB VRAM)
- Batch size: 8 ร 4 gradient accumulation = 32 effective
- Learning rate: 1e-5 cosine with 500 warmup steps
- Precision: FP16
- Best checkpoint: step 10,000
Reciter Classifier
- Training data: 500 shards (~325k samples, 335 reciters)
- Phase 1: Extract Whisper encoder embeddings shard-by-shard
- Phase 2: Train MLP on pre-extracted embeddings (15 min)
- Optimizer: AdamW with cosine annealing
- Epochs: 20, Batch size: 256
Supported Reciters
See supported_reciters.txt for the full list of 335 supported reciters including:
ุนุจุฏ ุงูุจุงุณุท ุนุจุฏ ุงูุตู
ุฏุ ู
ุญู
ุฏ ุตุฏูู ุงูู
ูุดุงููุ ูุงุณุฑ ุงูุฏูุณุฑูุ ุณุนูุฏ ุงูุดุฑูู
ุ ู
ุงูุฑ ุงูู
ุนููููุ ุนุจุฏุงูุฑุญู
ู ุงูุณุฏูุณุ and 329 more.
Limitations
- ASR trained on 1 shard only โ may have reduced generalization on rare recitation styles
- Reciter classifier covers 335 of 671 total reciters in the dataset
- Surah/Ayah matching accuracy depends on transcription quality
- Model optimized for standard Hafs recitation style
Citation
@misc{suliman2026tadabur,
author = {Suliman, Rakan and Mamdoh, Abdulrahman and Aldosari, Hussam},
title = {Tadabur: Quran ASR with Surah/Ayah Identification and Reciter Recognition},
year = {2026},
url = {https://huggingface.co/rakansuliman/tadabur-whisper-medium}
}
License
CC BY-NC 4.0 โ Research and educational use only. Please engage with Quran content respectfully. ๐คฒ
- Downloads last month
- 201
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for rakansuliman/tadabur-whisper-medium
Base model
openai/whisper-mediumDataset used to train rakansuliman/tadabur-whisper-medium
Space using rakansuliman/tadabur-whisper-medium 1
Evaluation results
- wer on Tadaburself-reported6.260
- cer on Tadaburself-reported4.410