Whisper Small β Krio Speech-to-Text πΈπ±
The first dedicated Speech-to-Text model for Krio, the national language of Sierra Leone. Fine-tuned from openai/whisper-small on 30 hours of Krio speech data.
Performance
| Metric | Score |
|---|---|
| Word Error Rate (WER) | 5.05% |
| Character Error Rate (CER) | 2.18% |
| Training samples | 6,126 |
| Test samples | 681 |
~95 out of every 100 words transcribed correctly on clean Krio speech.
Training Data
- Dataset: MosesJoshuaCoker/30_hours_krio_voice
- Duration: ~30 hours of Krio speech
- Language: Krio (ISO 639-3:
kri)
How to Use
Option 1 β Python Environment (Local)
Step 1: Install dependencies
pip install transformers torch librosa soundfile
Step 2: Transcribe an audio file
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()
# Load your audio file (wav, mp3, m4a, flac etc.)
# Audio is automatically resampled to 16kHz
audio_array, sr = librosa.load("your_audio.wav", sr=16000)
# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
forced_decoder_ids=processor.get_decoder_prompt_ids(
language="english", task="transcribe"
)
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Convert m4a to wav (if needed):
# Install ffmpeg
sudo apt-get install ffmpeg # Linux
brew install ffmpeg # Mac
# Convert β -ar 16000 resamples to 16kHz, -ac 1 converts to mono
ffmpeg -i your_audio.m4a -ar 16000 -ac 1 output.wav
Option 2 β Google Colab
Step 1: Install dependencies
!pip install transformers torch librosa
!apt-get install -y ffmpeg
Step 2: Upload and convert your audio
from google.colab import files
# Opens a file picker β select any .wav, .mp3, or .m4a file
uploaded = files.upload()
audio_file = list(uploaded.keys())[0]
# Convert to 16kHz mono wav (skip if already a .wav)
!ffmpeg -i "{audio_file}" -ar 16000 -ac 1 output.wav -y
print("β
Ready for transcription")
Step 3: Transcribe
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch
# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()
# Load audio
audio_array, sr = librosa.load("output.wav", sr=16000)
print(f"Audio duration: {len(audio_array)/16000:.2f} seconds")
# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
forced_decoder_ids=processor.get_decoder_prompt_ids(
language="english", task="transcribe"
)
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\nπ Krio Transcription: {transcription}")
Option 3 β Test with Training Dataset
from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import numpy as np
import torch
# Load a sample directly from the training dataset
dataset = load_dataset("MosesJoshuaCoker/30_hours_krio_voice", split="train")
sample = dataset[0] # change index to test different samples
audio_array = np.array(sample["audio_path"]["array"], dtype=np.float32)
reference = sample["transcriptions"]
# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()
# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
forced_decoder_ids=processor.get_decoder_prompt_ids(
language="english", task="transcribe"
)
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Reference: {reference}")
print(f"Prediction: {transcription}")
Important Notes
- Audio format: The model expects 16kHz mono audio. Use
librosa.load(file, sr=16000)or ffmpeg to resample. - Best results: Works best on clean, close-microphone Krio speech.
- Language forcing: The model is configured to always output Krio text, even if the audio contains English-sounding words.
- Supported formats:
.wav,.mp3,.m4a,.flac,.ogg(via librosa or ffmpeg conversion)
Training Details
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Epochs | 5 |
| Batch size | 16 (effective) |
| Learning rate | 1e-5 |
| Warmup steps | 100 |
| Hardware | Tesla T4 |
| Training time | 9h 44m |
| Best checkpoint | Step 1150 |
| Framework | HuggingFace Transformers |
Citation
If you use this model in your research or application, please cite:
@misc{whisper-krio-2025,
author = {Moses Joshua Coker},
title = {Whisper Small Fine-tuned for Krio Speech-to-Text},
year = {2025},
url = {https://huggingface.co/MosesJoshuaCoker/best_for_now}
}
Limitations
- Performance may degrade with heavy background noise
- Regional Krio dialect variations not well represented in training data may score lower
- This is a beta model β feedback from native Krio speakers is welcome
Built for Sierra Leone πΈπ± β the first open-source Krio STT model.
- Downloads last month
- 187
Model tree for MosesJoshuaCoker/best_for_now
Base model
openai/whisper-small