Whisper Shamiri - Swahili/English ASR

Whisper Small model finetuned for Kenyan audio with natural Swahili-English code-switching.

Model Details

  • Base model: openai/whisper-small
  • Architecture: WhisperForConditionalGeneration (12 encoder/decoder layers, 768 hidden dim)
  • Parameters: ~242M
  • Training steps: 200
  • Batch size: 16
  • Learning rate: 1e-5
  • Final loss: 1.1068

Evaluation

Note: Evaluated on training split (4,010 segments from 9 recordings). A held-out test set evaluation is recommended for deployment decisions.

Performance Summary

Metric All Samples Clean Samples*
WER 74.97% 34.68%
CER 56.41% 19.64%

*Clean samples exclude predictions with WER ≥ 90% (catastrophic failures)

Understanding the Results

Why the gap? The dataset contains naturalistic therapy recordings with:

  • Background noise and room acoustics
  • Overlapping speech between therapist and client
  • Mumbled or unclear pronunciation
  • Very short utterances (< 1 second)

What this means for users:

  • On clear audio: expect ~35% WER (competitive for code-switching ASR)
  • On noisy/difficult audio: performance degrades significantly
  • Best suited for: clear speech, minimal background noise

Similarity Metrics

Metric Score Interpretation
Semantic Similarity 0.779 Meaning is largely preserved
ROUGE-L F1 0.607 Moderate phrase overlap
Jaccard Similarity 0.722 Good token overlap

Usage

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="shamiri-ai/whisper-shamiri-200")
result = pipe("audio.mp3")
print(result["text"])

Or load directly:

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("shamiri-ai/whisper-shamiri-200")
processor = WhisperProcessor.from_pretrained("shamiri-ai/whisper-shamiri-200")

Training Data

Finetuned on therapy session recordings containing natural Swahili-English code-switching typical of Kenyan conversational speech.

Intended Use

This model is designed for transcribing Kenyan audio that contains:

  • Swahili language
  • English language
  • Natural code-switching between Swahili and English

Limitations

  • Trained on limited domain data (therapy sessions)
  • May not generalize well to other audio types or accents
  • Early checkpoint (200 steps) - may benefit from further training
Downloads last month
2
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shamiri-ai/whisper-shamiri-200

Finetuned
(3445)
this model

Evaluation results