Whisper Shamiri - Swahili/English ASR

Whisper Small model finetuned for Kenyan audio with natural Swahili-English code-switching.

Model Details

Base model: openai/whisper-small
Architecture: WhisperForConditionalGeneration (12 encoder/decoder layers, 768 hidden dim)
Parameters: ~242M
Training steps: 200
Batch size: 16
Learning rate: 1e-5
Final loss: 1.1068

Evaluation

Note: Evaluated on training split (4,010 segments from 9 recordings). A held-out test set evaluation is recommended for deployment decisions.

Performance Summary

Metric	All Samples	Clean Samples*
WER	74.97%	34.68%
CER	56.41%	19.64%

*Clean samples exclude predictions with WER ≥ 90% (catastrophic failures)

Understanding the Results

Why the gap? The dataset contains naturalistic therapy recordings with:

Background noise and room acoustics
Overlapping speech between therapist and client
Mumbled or unclear pronunciation
Very short utterances (< 1 second)

What this means for users:

On clear audio: expect ~35% WER (competitive for code-switching ASR)
On noisy/difficult audio: performance degrades significantly
Best suited for: clear speech, minimal background noise

Similarity Metrics

Metric	Score	Interpretation
Semantic Similarity	0.779	Meaning is largely preserved
ROUGE-L F1	0.607	Moderate phrase overlap
Jaccard Similarity	0.722	Good token overlap

Usage

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="shamiri-ai/whisper-shamiri-200")
result = pipe("audio.mp3")
print(result["text"])

Or load directly:

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("shamiri-ai/whisper-shamiri-200")
processor = WhisperProcessor.from_pretrained("shamiri-ai/whisper-shamiri-200")

Training Data

Finetuned on therapy session recordings containing natural Swahili-English code-switching typical of Kenyan conversational speech.

Intended Use

This model is designed for transcribing Kenyan audio that contains:

Swahili language
English language
Natural code-switching between Swahili and English

Limitations

Trained on limited domain data (therapy sessions)
May not generalize well to other audio types or accents
Early checkpoint (200 steps) - may benefit from further training

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for shamiri-ai/whisper-shamiri-200

Base model

openai/whisper-small

Finetuned

(3445)

this model

Evaluation results

WER on Shamiri Therapy Sessions
self-reported

74.970
CER on Shamiri Therapy Sessions
self-reported

56.410