Whisper Shamiri - Swahili/English ASR
Whisper Small model finetuned for Kenyan audio with natural Swahili-English code-switching.
Model Details
- Base model: openai/whisper-small
- Architecture: WhisperForConditionalGeneration (12 encoder/decoder layers, 768 hidden dim)
- Parameters: ~242M
- Training steps: 200
- Batch size: 16
- Learning rate: 1e-5
- Final loss: 1.1068
Evaluation
Note: Evaluated on training split (4,010 segments from 9 recordings). A held-out test set evaluation is recommended for deployment decisions.
Performance Summary
| Metric | All Samples | Clean Samples* |
|---|---|---|
| WER | 74.97% | 34.68% |
| CER | 56.41% | 19.64% |
*Clean samples exclude predictions with WER ≥ 90% (catastrophic failures)
Understanding the Results
Why the gap? The dataset contains naturalistic therapy recordings with:
- Background noise and room acoustics
- Overlapping speech between therapist and client
- Mumbled or unclear pronunciation
- Very short utterances (< 1 second)
What this means for users:
- On clear audio: expect ~35% WER (competitive for code-switching ASR)
- On noisy/difficult audio: performance degrades significantly
- Best suited for: clear speech, minimal background noise
Similarity Metrics
| Metric | Score | Interpretation |
|---|---|---|
| Semantic Similarity | 0.779 | Meaning is largely preserved |
| ROUGE-L F1 | 0.607 | Moderate phrase overlap |
| Jaccard Similarity | 0.722 | Good token overlap |
Usage
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="shamiri-ai/whisper-shamiri-200")
result = pipe("audio.mp3")
print(result["text"])
Or load directly:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("shamiri-ai/whisper-shamiri-200")
processor = WhisperProcessor.from_pretrained("shamiri-ai/whisper-shamiri-200")
Training Data
Finetuned on therapy session recordings containing natural Swahili-English code-switching typical of Kenyan conversational speech.
Intended Use
This model is designed for transcribing Kenyan audio that contains:
- Swahili language
- English language
- Natural code-switching between Swahili and English
Limitations
- Trained on limited domain data (therapy sessions)
- May not generalize well to other audio types or accents
- Early checkpoint (200 steps) - may benefit from further training
- Downloads last month
- 2
Model tree for shamiri-ai/whisper-shamiri-200
Base model
openai/whisper-smallEvaluation results
- WER on Shamiri Therapy Sessionsself-reported74.970
- CER on Shamiri Therapy Sessionsself-reported56.410