Whisper Small Darija ASR trained on merged diarized datasets

This repository hosts a fine-tuned Whisper checkpoint for Moroccan Darija ASR trained on a merged corpus built from two Hugging Face datasets:

Fine-tuned on a merged Darija corpus built from TTS-clean metadata and speech-to-text metadata with diarization-derived analysis features and usability filtering.

Training data

This model was trained on a merged Darija speech corpus built from these two Hugging Face datasets:

Data preparation and analysis workflow

The merged corpus was not treated as a plain list of audio/transcript pairs. It was enriched with diarization-derived analysis features and ASR usability signals before training and evaluation. The workflow used the following ideas:

  • Speaker structure analysis: number of speakers, speaker turns, turns per minute, dominant-speaker ratio, second-speaker ratio, speaker-balance score, and speaker entropy.
  • Overlap analysis: overlap duration, overlap ratio, number of overlap regions, mean/max overlap region duration, and maximum concurrent speakers.
  • Transcript-length and pacing analysis: character length, approximate word length, token proxies, characters per second, and tokens per second.
  • Usability filtering: the corpus includes fields such as asr_usability_score, usability_score_custom, quality_bin, and trainability/filter flags. These were used to monitor whether a segment was likely to be useful for ASR fine-tuning.
  • Export control: training/evaluation used the 16 kHz successful audio-export paths (audio_path_16k) and focused on samples that were successfully prepared for training.

Why diarization and usability scores matter

These metadata were used to understand where the model fails, not only how much it fails. In practice, they let us stratify results by:

  • single-speaker vs multi-speaker speech,
  • overlap vs no overlap,
  • slow / medium / fast speaking rate,
  • short / medium / long transcripts,
  • source dataset and original split,
  • harder vs easier segments according to the usability-related fields.

This makes the evaluation much more informative than a single global WER/CER number.

Training setup

  • Base checkpoint: openai/whisper-small
  • Fine-tuned model size: small
  • Total update steps used in the reported run: 10000
  • Training schedule note from the experiment log: 5000 × 2 steps, corresponding to roughly 10 epochs in the author’s setup
  • Training corpus: merged Darija datasets listed above
  • Task: Moroccan Darija ASR / transcription

Important note about epochs

The exact number of effective epochs in sequence-to-sequence training depends on the effective batch size, including gradient accumulation. In this card, the run is described using the author’s experiment note: 5000 × 2 steps, around 10 epochs.

Evaluation protocol

Both checkpoints were evaluated with the same decoding and scoring pipeline on the held-out merged test split.

Inference settings

  • audio loaded from the 16 kHz mono waveform path,
  • Whisper processor used with padding and attention masks,
  • generation run with:
    • language="arabic"
    • task="transcribe"
    • max_new_tokens=225

Text normalization

Before scoring, references and predictions were normalized conservatively by:

  • converting to string,
  • trimming leading/trailing whitespace,
  • collapsing repeated whitespace into a single space.

Metrics

The main reported metrics are:

  • WER: word error rate,
  • CER: character error rate,
  • exact-match rate.

Stratified testing

After global evaluation, results were re-aggregated by:

  • duration_bucket
  • speaker_bucket
  • overlap_bucket
  • turn_rate_bucket
  • speech_rate_bucket
  • text_length_bucket
  • quality_bucket
  • source_dataset_bucket
  • source_split_bucket
  • multiple-speaker and overlap flags.

This was done to separate acoustic difficulty, segmentation difficulty, transcript difficulty, and source-domain effects.

Main results

Global metrics

  • Test samples: 1878
  • WER: 40.52
  • CER: 19.37
  • Exact-match rate: 6.18%

Key findings

  • Global test performance: WER 40.52 and CER 19.37 on 1878 test utterances.
  • Exact-match rate on the test set: 6.18%.
  • Main duration buckets are stable: short clips (709 samples), medium clips (816 samples), and very long clips (347 samples).
  • The 8–15s bucket has only 6 samples, so that bucket should be interpreted cautiously.
  • The hardest large speech-rate condition is slow_speech (WER 78.78), while medium_speech is much easier (WER 37.92).
  • fast_speech remains challenging but is still better than slow speech in this evaluation (WER 44.26).
  • Very short transcripts are the hardest text condition: very_short_text reaches WER 115.74, much worse than short_text at WER 44.42.
  • Longer transcripts are more stable than the very-short-text regime in this setup (long_text WER 37.93).
  • Source-domain breakdown is informative: darija_tts_clean_metadata_full contributes 81.58% of the test set, while darija_speech_to_text_metadata_full contributes 18.42%.
  • Overlap is not the dominant factor in this test set because no_overlap accounts for 90.20% of samples, while heavy_overlap has only 35 samples.

Duration breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
medium_3_8s 816 44.92 23.00 7.11
short_<3s 709 42.22 18.97 8.18
very_long_>=15s 347 38.89 18.46 0.00
long_8_15s 6 87.84 53.21 0.00

Speech-rate breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
medium_speech 1341 37.92 17.10 5.59
slow_speech 291 78.78 51.81 8.93
fast_speech 246 44.26 23.33 6.10

Text-length breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
short_text 1392 44.42 22.10 7.04
long_text 341 37.93 17.55 0.00
very_short_text 125 115.74 81.92 14.40
medium_text 20 49.75 26.90 0.00

Source-dataset breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
darija_tts_clean_metadata_full 1532 46.04 23.41 7.57
darija_speech_to_text_metadata_full 346 37.99 17.62 0.00

Speaker breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
single_speaker 1692 40.63 19.30 6.68
two_speakers 186 40.24 19.57 1.61

Overlap breakdown

group_value n_samples corpus_wer corpus_cer exact_match_rate
no_overlap 1694 40.61 19.31 6.67
light_overlap 149 39.02 18.69 2.01
heavy_overlap 35 45.36 23.03 0.00

Comparison with the base checkpoint

On the same test split, the small checkpoint outperforms the base checkpoint by:

  • 12.63 WER points
  • 6.38 CER points

This indicates that, under this training setup and this merged dataset, the small checkpoint was the stronger fine-tuned model.

Intended use

This model is intended for automatic speech recognition of Moroccan Darija on audio that is reasonably close to the training domain represented by the merged datasets.

Recommended use cases

  • research on Moroccan Darija ASR,
  • baseline or fine-tuning starting point for Darija transcription,
  • analysis of model behavior under different diarization/usability regimes.

Less reliable cases

  • wrong-language or strongly non-Darija audio,
  • aggressive code-switching,
  • extremely short reference segments,
  • highly mismatched segmentation between audio and transcript,
  • very rare bucket conditions with only a few evaluation samples.

Limitations and interpretation notes

  • Very small buckets should not be over-interpreted. In this project, some strata contain only a handful of samples.
  • WER can be much higher than CER because a near-miss word can still count as a full word error while remaining close at the character level.
  • Very short transcripts can produce unstable WER values and may amplify insertion-heavy failures.
  • Language mismatch / code-switching / annotation mismatch can inflate error rates and should be checked during listening-based analysis.
  • Because Moroccan Darija orthography is not fully standardized, part of the residual error may reflect spelling variation rather than a true semantic ASR failure.

Repository contents

This repo contains the trained checkpoint and, when available, evaluation artifacts such as:

  • test_group_summaries.json
  • test_metrics.json
  • test_predictions_preview.json
  • test_predictions_detailed.csv
  • test_predictions_detailed.jsonl

If test_group_summaries.json and test_metrics.json are present, they provide the full numeric breakdown behind the summary tables shown in this model card.

Downloads last month
14
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EtMmohammedHafsati/whisper-small-darija-merged-diarized

Finetuned
(3445)
this model

Datasets used to train EtMmohammedHafsati/whisper-small-darija-merged-diarized