Whisper Small Darija ASR trained on merged diarized datasets
This repository hosts a fine-tuned Whisper checkpoint for Moroccan Darija ASR trained on a merged corpus built from two Hugging Face datasets:
- EtMmohammedHafsati/darija_tts_clean_metadata_full
- EtMmohammedHafsati/darija_speech_to_text_metadata_full
Fine-tuned on a merged Darija corpus built from TTS-clean metadata and speech-to-text metadata with diarization-derived analysis features and usability filtering.
Training data
This model was trained on a merged Darija speech corpus built from these two Hugging Face datasets:
- EtMmohammedHafsati/darija_tts_clean_metadata_full
- EtMmohammedHafsati/darija_speech_to_text_metadata_full
Data preparation and analysis workflow
The merged corpus was not treated as a plain list of audio/transcript pairs. It was enriched with diarization-derived analysis features and ASR usability signals before training and evaluation. The workflow used the following ideas:
- Speaker structure analysis: number of speakers, speaker turns, turns per minute, dominant-speaker ratio, second-speaker ratio, speaker-balance score, and speaker entropy.
- Overlap analysis: overlap duration, overlap ratio, number of overlap regions, mean/max overlap region duration, and maximum concurrent speakers.
- Transcript-length and pacing analysis: character length, approximate word length, token proxies, characters per second, and tokens per second.
- Usability filtering: the corpus includes fields such as
asr_usability_score,usability_score_custom,quality_bin, and trainability/filter flags. These were used to monitor whether a segment was likely to be useful for ASR fine-tuning. - Export control: training/evaluation used the 16 kHz successful audio-export paths (
audio_path_16k) and focused on samples that were successfully prepared for training.
Why diarization and usability scores matter
These metadata were used to understand where the model fails, not only how much it fails. In practice, they let us stratify results by:
- single-speaker vs multi-speaker speech,
- overlap vs no overlap,
- slow / medium / fast speaking rate,
- short / medium / long transcripts,
- source dataset and original split,
- harder vs easier segments according to the usability-related fields.
This makes the evaluation much more informative than a single global WER/CER number.
Training setup
- Base checkpoint: openai/whisper-small
- Fine-tuned model size: small
- Total update steps used in the reported run: 10000
- Training schedule note from the experiment log: 5000 × 2 steps, corresponding to roughly 10 epochs in the author’s setup
- Training corpus: merged Darija datasets listed above
- Task: Moroccan Darija ASR / transcription
Important note about epochs
The exact number of effective epochs in sequence-to-sequence training depends on the effective batch size, including gradient accumulation. In this card, the run is described using the author’s experiment note: 5000 × 2 steps, around 10 epochs.
Evaluation protocol
Both checkpoints were evaluated with the same decoding and scoring pipeline on the held-out merged test split.
Inference settings
- audio loaded from the 16 kHz mono waveform path,
- Whisper processor used with padding and attention masks,
- generation run with:
language="arabic"task="transcribe"max_new_tokens=225
Text normalization
Before scoring, references and predictions were normalized conservatively by:
- converting to string,
- trimming leading/trailing whitespace,
- collapsing repeated whitespace into a single space.
Metrics
The main reported metrics are:
- WER: word error rate,
- CER: character error rate,
- exact-match rate.
Stratified testing
After global evaluation, results were re-aggregated by:
duration_bucketspeaker_bucketoverlap_bucketturn_rate_bucketspeech_rate_buckettext_length_bucketquality_bucketsource_dataset_bucketsource_split_bucket- multiple-speaker and overlap flags.
This was done to separate acoustic difficulty, segmentation difficulty, transcript difficulty, and source-domain effects.
Main results
Global metrics
- Test samples: 1878
- WER: 40.52
- CER: 19.37
- Exact-match rate: 6.18%
Key findings
- Global test performance: WER 40.52 and CER 19.37 on 1878 test utterances.
- Exact-match rate on the test set: 6.18%.
- Main duration buckets are stable: short clips (709 samples), medium clips (816 samples), and very long clips (347 samples).
- The 8–15s bucket has only 6 samples, so that bucket should be interpreted cautiously.
- The hardest large speech-rate condition is slow_speech (WER 78.78), while medium_speech is much easier (WER 37.92).
- fast_speech remains challenging but is still better than slow speech in this evaluation (WER 44.26).
- Very short transcripts are the hardest text condition: very_short_text reaches WER 115.74, much worse than short_text at WER 44.42.
- Longer transcripts are more stable than the very-short-text regime in this setup (long_text WER 37.93).
- Source-domain breakdown is informative: darija_tts_clean_metadata_full contributes 81.58% of the test set, while darija_speech_to_text_metadata_full contributes 18.42%.
- Overlap is not the dominant factor in this test set because no_overlap accounts for 90.20% of samples, while heavy_overlap has only 35 samples.
Duration breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| medium_3_8s | 816 | 44.92 | 23.00 | 7.11 |
| short_<3s | 709 | 42.22 | 18.97 | 8.18 |
| very_long_>=15s | 347 | 38.89 | 18.46 | 0.00 |
| long_8_15s | 6 | 87.84 | 53.21 | 0.00 |
Speech-rate breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| medium_speech | 1341 | 37.92 | 17.10 | 5.59 |
| slow_speech | 291 | 78.78 | 51.81 | 8.93 |
| fast_speech | 246 | 44.26 | 23.33 | 6.10 |
Text-length breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| short_text | 1392 | 44.42 | 22.10 | 7.04 |
| long_text | 341 | 37.93 | 17.55 | 0.00 |
| very_short_text | 125 | 115.74 | 81.92 | 14.40 |
| medium_text | 20 | 49.75 | 26.90 | 0.00 |
Source-dataset breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| darija_tts_clean_metadata_full | 1532 | 46.04 | 23.41 | 7.57 |
| darija_speech_to_text_metadata_full | 346 | 37.99 | 17.62 | 0.00 |
Speaker breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| single_speaker | 1692 | 40.63 | 19.30 | 6.68 |
| two_speakers | 186 | 40.24 | 19.57 | 1.61 |
Overlap breakdown
| group_value | n_samples | corpus_wer | corpus_cer | exact_match_rate |
|---|---|---|---|---|
| no_overlap | 1694 | 40.61 | 19.31 | 6.67 |
| light_overlap | 149 | 39.02 | 18.69 | 2.01 |
| heavy_overlap | 35 | 45.36 | 23.03 | 0.00 |
Comparison with the base checkpoint
On the same test split, the small checkpoint outperforms the base checkpoint by:
- 12.63 WER points
- 6.38 CER points
This indicates that, under this training setup and this merged dataset, the small checkpoint was the stronger fine-tuned model.
Intended use
This model is intended for automatic speech recognition of Moroccan Darija on audio that is reasonably close to the training domain represented by the merged datasets.
Recommended use cases
- research on Moroccan Darija ASR,
- baseline or fine-tuning starting point for Darija transcription,
- analysis of model behavior under different diarization/usability regimes.
Less reliable cases
- wrong-language or strongly non-Darija audio,
- aggressive code-switching,
- extremely short reference segments,
- highly mismatched segmentation between audio and transcript,
- very rare bucket conditions with only a few evaluation samples.
Limitations and interpretation notes
- Very small buckets should not be over-interpreted. In this project, some strata contain only a handful of samples.
- WER can be much higher than CER because a near-miss word can still count as a full word error while remaining close at the character level.
- Very short transcripts can produce unstable WER values and may amplify insertion-heavy failures.
- Language mismatch / code-switching / annotation mismatch can inflate error rates and should be checked during listening-based analysis.
- Because Moroccan Darija orthography is not fully standardized, part of the residual error may reflect spelling variation rather than a true semantic ASR failure.
Repository contents
This repo contains the trained checkpoint and, when available, evaluation artifacts such as:
test_group_summaries.jsontest_metrics.jsontest_predictions_preview.jsontest_predictions_detailed.csvtest_predictions_detailed.jsonl
If test_group_summaries.json and test_metrics.json are present, they provide the full numeric breakdown behind the summary tables shown in this model card.
- Downloads last month
- 14
Model tree for EtMmohammedHafsati/whisper-small-darija-merged-diarized
Base model
openai/whisper-small