Whisper Medium β€” Fine-tuned for Arabic ASR on SADA

openai/whisper-medium fine-tuned on the full SADA22 dataset (~420 hours of Saudi Arabic speech) for Arabic automatic speech recognition (ASR).

Used as Baseline 2 in experiments on predicting the Arabic Level of Dialectness (ALDi) from speech: the transcript produced by this model is fed into a text-based ALDi classifier to obtain a dialect score.


Training details

Setting Value
Base model openai/whisper-medium (~764M parameters)
Dataset SADA22 (full, ~420 h of Saudi Arabic)
Language Arabic
Task transcribe
Epochs 4
Learning rate 1e-5
Batch size 8
Gradient accumulation steps 1
Warmup ratio 0.1
FP16 yes

Quick Start

1. Install dependencies

pip install torch "transformers>=4.27" torchaudio safetensors

2. Transcribe an audio file

import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("wageehkhad/whisper-medium-finetuned-sada-asr")
processor = WhisperProcessor.from_pretrained("wageehkhad/whisper-medium-finetuned-sada-asr")


def transcribe(audio_path: str, device: str = "cpu") -> str:
    """
    Transcribe an Arabic audio file.
    Accepts any format supported by torchaudio (WAV, FLAC, MP3, etc.).
    """
    wav, sr = torchaudio.load(audio_path)
    wav = torchaudio.functional.resample(wav, sr, 16_000).mean(0).numpy()

    inputs = processor(
        wav,
        sampling_rate=16_000,
        return_tensors="pt",
        return_attention_mask=True,
    ).to(device)
    model.to(device)

    with torch.no_grad():
        token_ids = model.generate(
            inputs.input_features,
            attention_mask=inputs.attention_mask,
            language="arabic",
            task="transcribe",
        )

    return processor.batch_decode(token_ids, skip_special_tokens=True)[0]

# Example
print(transcribe("example.wav"))

If you get the error:

ImportError: TorchCodec is required for load_with_torchcodec

make sure to run

pip install torchcodec

3. Using as part of the Baseline 2 ALDi pipeline

This model is the ASR component of a two-step pipeline:

Audio β†’ [this model] β†’ Arabic transcript β†’ [AMR-KELEG/Sentence-ALDi] β†’ ALDi score
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch

# Step 1: transcribe
asr = pipeline("automatic-speech-recognition",
               model="wageehkhad/whisper-medium-finetuned-sada-asr")
result = asr("speech.wav", generate_kwargs={"language": "arabic", "task": "transcribe"})
transcript = result["text"]

# Step 2: score dialect level
aldi_tok = AutoTokenizer.from_pretrained("AMR-KELEG/Sentence-ALDi")
aldi_mdl = AutoModelForSequenceClassification.from_pretrained("AMR-KELEG/Sentence-ALDi")
aldi_mdl.eval()

enc = aldi_tok(transcript, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    score = float(aldi_mdl(**enc).logits.squeeze())

print(f"Transcript : {transcript}")
print(f"ALDi score : {score:.3f}")  # 0.0 = MSA, 1.0 = heavy dialect

Limitations

  • Fine-tuned on Saudi Arabic (SADA22). WER on other Arabic dialects or MSA broadcasts will be higher.
  • Optimised for speech segments up to 30 seconds (Whisper's native window). Longer files are chunked automatically by the pipeline.

Citation

If you use this model, please cite the ALDi paper and the SADA dataset:

@inproceedings{keleg2023aldi,
  title     = {ALDi: Quantifying the Arabic Level of Dialectness of Text},
  author    = {Keleg, Amr and Goldwater, Sharon and Magdy, Walid},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2023},
  publisher = {Association for Computational Linguistics},
  address   = {Singapore},
  url       = {https://aclanthology.org/2023.emnlp-main.655}
}

@misc{sada22,
  author       = {Al-Gamdi, Ahmed and others},
  title        = {SADA: Saudi Audio Dataset for Arabic},
  year         = {2022},
  howpublished = {\url{https://huggingface.co/datasets/MohamedRashad/SADA22}},
  note         = {Accessed 2026}
}
Downloads last month
22
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wageehkhad/whisper-medium-finetuned-sada-asr

Finetuned
(852)
this model

Dataset used to train wageehkhad/whisper-medium-finetuned-sada-asr