Whisper Medium β Fine-tuned for Arabic ASR on SADA
openai/whisper-medium fine-tuned on the full SADA22
dataset (~420 hours of Saudi Arabic speech) for Arabic automatic speech recognition (ASR).
Used as Baseline 2 in experiments on predicting the Arabic Level of Dialectness (ALDi) from speech: the transcript produced by this model is fed into a text-based ALDi classifier to obtain a dialect score.
Training details
| Setting | Value |
|---|---|
| Base model | openai/whisper-medium (~764M parameters) |
| Dataset | SADA22 (full, ~420 h of Saudi Arabic) |
| Language | Arabic |
| Task | transcribe |
| Epochs | 4 |
| Learning rate | 1e-5 |
| Batch size | 8 |
| Gradient accumulation steps | 1 |
| Warmup ratio | 0.1 |
| FP16 | yes |
Quick Start
1. Install dependencies
pip install torch "transformers>=4.27" torchaudio safetensors
2. Transcribe an audio file
import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("wageehkhad/whisper-medium-finetuned-sada-asr")
processor = WhisperProcessor.from_pretrained("wageehkhad/whisper-medium-finetuned-sada-asr")
def transcribe(audio_path: str, device: str = "cpu") -> str:
"""
Transcribe an Arabic audio file.
Accepts any format supported by torchaudio (WAV, FLAC, MP3, etc.).
"""
wav, sr = torchaudio.load(audio_path)
wav = torchaudio.functional.resample(wav, sr, 16_000).mean(0).numpy()
inputs = processor(
wav,
sampling_rate=16_000,
return_tensors="pt",
return_attention_mask=True,
).to(device)
model.to(device)
with torch.no_grad():
token_ids = model.generate(
inputs.input_features,
attention_mask=inputs.attention_mask,
language="arabic",
task="transcribe",
)
return processor.batch_decode(token_ids, skip_special_tokens=True)[0]
# Example
print(transcribe("example.wav"))
If you get the error:
ImportError: TorchCodec is required for load_with_torchcodec
make sure to run
pip install torchcodec
3. Using as part of the Baseline 2 ALDi pipeline
This model is the ASR component of a two-step pipeline:
Audio β [this model] β Arabic transcript β [AMR-KELEG/Sentence-ALDi] β ALDi score
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
# Step 1: transcribe
asr = pipeline("automatic-speech-recognition",
model="wageehkhad/whisper-medium-finetuned-sada-asr")
result = asr("speech.wav", generate_kwargs={"language": "arabic", "task": "transcribe"})
transcript = result["text"]
# Step 2: score dialect level
aldi_tok = AutoTokenizer.from_pretrained("AMR-KELEG/Sentence-ALDi")
aldi_mdl = AutoModelForSequenceClassification.from_pretrained("AMR-KELEG/Sentence-ALDi")
aldi_mdl.eval()
enc = aldi_tok(transcript, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
score = float(aldi_mdl(**enc).logits.squeeze())
print(f"Transcript : {transcript}")
print(f"ALDi score : {score:.3f}") # 0.0 = MSA, 1.0 = heavy dialect
Limitations
- Fine-tuned on Saudi Arabic (SADA22). WER on other Arabic dialects or MSA broadcasts will be higher.
- Optimised for speech segments up to 30 seconds (Whisper's native window). Longer files are chunked automatically by the pipeline.
Citation
If you use this model, please cite the ALDi paper and the SADA dataset:
@inproceedings{keleg2023aldi,
title = {ALDi: Quantifying the Arabic Level of Dialectness of Text},
author = {Keleg, Amr and Goldwater, Sharon and Magdy, Walid},
booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2023},
publisher = {Association for Computational Linguistics},
address = {Singapore},
url = {https://aclanthology.org/2023.emnlp-main.655}
}
@misc{sada22,
author = {Al-Gamdi, Ahmed and others},
title = {SADA: Saudi Audio Dataset for Arabic},
year = {2022},
howpublished = {\url{https://huggingface.co/datasets/MohamedRashad/SADA22}},
note = {Accessed 2026}
}
- Downloads last month
- 22
Model tree for wageehkhad/whisper-medium-finetuned-sada-asr
Base model
openai/whisper-medium