Whisper Small Darija ASR trained on merged diarized datasets

This repository hosts a fine-tuned Whisper checkpoint for Moroccan Darija ASR trained on a merged corpus built from two Hugging Face datasets:

Fine-tuned on a merged Darija corpus built from TTS-clean metadata and speech-to-text metadata with diarization-derived analysis features and usability filtering.

Training data

This model was trained on a merged Darija speech corpus built from these two Hugging Face datasets:

Data preparation and analysis workflow

The merged corpus was not treated as a plain list of audio/transcript pairs. It was enriched with diarization-derived analysis features and ASR usability signals before training and evaluation. The workflow used the following ideas:

Speaker structure analysis: number of speakers, speaker turns, turns per minute, dominant-speaker ratio, second-speaker ratio, speaker-balance score, and speaker entropy.
Overlap analysis: overlap duration, overlap ratio, number of overlap regions, mean/max overlap region duration, and maximum concurrent speakers.
Transcript-length and pacing analysis: character length, approximate word length, token proxies, characters per second, and tokens per second.
Usability filtering: the corpus includes fields such as asr_usability_score, usability_score_custom, quality_bin, and trainability/filter flags. These were used to monitor whether a segment was likely to be useful for ASR fine-tuning.
Export control: training/evaluation used the 16 kHz successful audio-export paths (audio_path_16k) and focused on samples that were successfully prepared for training.

Why diarization and usability scores matter

These metadata were used to understand where the model fails, not only how much it fails. In practice, they let us stratify results by:

single-speaker vs multi-speaker speech,
overlap vs no overlap,
slow / medium / fast speaking rate,
short / medium / long transcripts,
source dataset and original split,
harder vs easier segments according to the usability-related fields.

This makes the evaluation much more informative than a single global WER/CER number.

Training setup

Base checkpoint: openai/whisper-small
Fine-tuned model size: small
Total update steps used in the reported run: 10000
Training schedule note from the experiment log: 5000 × 2 steps, corresponding to roughly 10 epochs in the author’s setup
Training corpus: merged Darija datasets listed above
Task: Moroccan Darija ASR / transcription

Important note about epochs

The exact number of effective epochs in sequence-to-sequence training depends on the effective batch size, including gradient accumulation. In this card, the run is described using the author’s experiment note: 5000 × 2 steps, around 10 epochs.

Evaluation protocol

Both checkpoints were evaluated with the same decoding and scoring pipeline on the held-out merged test split.

Inference settings

audio loaded from the 16 kHz mono waveform path,
Whisper processor used with padding and attention masks,
generation run with:
- language="arabic"
- task="transcribe"
- max_new_tokens=225

Text normalization

Before scoring, references and predictions were normalized conservatively by:

converting to string,
trimming leading/trailing whitespace,
collapsing repeated whitespace into a single space.

Metrics

The main reported metrics are:

WER: word error rate,
CER: character error rate,
exact-match rate.

Stratified testing

After global evaluation, results were re-aggregated by:

duration_bucket
speaker_bucket
overlap_bucket
turn_rate_bucket
speech_rate_bucket
text_length_bucket
quality_bucket
source_dataset_bucket
source_split_bucket
multiple-speaker and overlap flags.

This was done to separate acoustic difficulty, segmentation difficulty, transcript difficulty, and source-domain effects.

Main results

Global metrics

Test samples: 1878
WER: 40.52
CER: 19.37
Exact-match rate: 6.18%

Key findings

Global test performance: WER 40.52 and CER 19.37 on 1878 test utterances.
Exact-match rate on the test set: 6.18%.
Main duration buckets are stable: short clips (709 samples), medium clips (816 samples), and very long clips (347 samples).
The 8–15s bucket has only 6 samples, so that bucket should be interpreted cautiously.
The hardest large speech-rate condition is slow_speech (WER 78.78), while medium_speech is much easier (WER 37.92).
fast_speech remains challenging but is still better than slow speech in this evaluation (WER 44.26).
Very short transcripts are the hardest text condition: very_short_text reaches WER 115.74, much worse than short_text at WER 44.42.
Longer transcripts are more stable than the very-short-text regime in this setup (long_text WER 37.93).
Source-domain breakdown is informative: darija_tts_clean_metadata_full contributes 81.58% of the test set, while darija_speech_to_text_metadata_full contributes 18.42%.
Overlap is not the dominant factor in this test set because no_overlap accounts for 90.20% of samples, while heavy_overlap has only 35 samples.

Duration breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
medium_3_8s	816	44.92	23.00	7.11
short_<3s	709	42.22	18.97	8.18
very_long_>=15s	347	38.89	18.46	0.00
long_8_15s	6	87.84	53.21	0.00

Speech-rate breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
medium_speech	1341	37.92	17.10	5.59
slow_speech	291	78.78	51.81	8.93
fast_speech	246	44.26	23.33	6.10

Text-length breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
short_text	1392	44.42	22.10	7.04
long_text	341	37.93	17.55	0.00
very_short_text	125	115.74	81.92	14.40
medium_text	20	49.75	26.90	0.00

Source-dataset breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
darija_tts_clean_metadata_full	1532	46.04	23.41	7.57
darija_speech_to_text_metadata_full	346	37.99	17.62	0.00

Speaker breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
single_speaker	1692	40.63	19.30	6.68
two_speakers	186	40.24	19.57	1.61

Overlap breakdown

group_value	n_samples	corpus_wer	corpus_cer	exact_match_rate
no_overlap	1694	40.61	19.31	6.67
light_overlap	149	39.02	18.69	2.01
heavy_overlap	35	45.36	23.03	0.00

Comparison with the base checkpoint

On the same test split, the small checkpoint outperforms the base checkpoint by:

12.63 WER points
6.38 CER points

This indicates that, under this training setup and this merged dataset, the small checkpoint was the stronger fine-tuned model.

Intended use

This model is intended for automatic speech recognition of Moroccan Darija on audio that is reasonably close to the training domain represented by the merged datasets.

Recommended use cases

research on Moroccan Darija ASR,
baseline or fine-tuning starting point for Darija transcription,
analysis of model behavior under different diarization/usability regimes.

Less reliable cases

wrong-language or strongly non-Darija audio,
aggressive code-switching,
extremely short reference segments,
highly mismatched segmentation between audio and transcript,
very rare bucket conditions with only a few evaluation samples.

Limitations and interpretation notes

Very small buckets should not be over-interpreted. In this project, some strata contain only a handful of samples.
WER can be much higher than CER because a near-miss word can still count as a full word error while remaining close at the character level.
Very short transcripts can produce unstable WER values and may amplify insertion-heavy failures.
Language mismatch / code-switching / annotation mismatch can inflate error rates and should be checked during listening-based analysis.
Because Moroccan Darija orthography is not fully standardized, part of the residual error may reflect spelling variation rather than a true semantic ASR failure.

Repository contents

This repo contains the trained checkpoint and, when available, evaluation artifacts such as:

test_group_summaries.json
test_metrics.json
test_predictions_preview.json
test_predictions_detailed.csv
test_predictions_detailed.jsonl

If test_group_summaries.json and test_metrics.json are present, they provide the full numeric breakdown behind the summary tables shown in this model card.

Downloads last month: 14

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for EtMmohammedHafsati/whisper-small-darija-merged-diarized

Base model

openai/whisper-small

Finetuned

(3445)

this model

EtMmohammedHafsati
/

whisper-small-darija-merged-diarized