Whisper Small — Krio Speech-to-Text 🇸🇱

The first dedicated Speech-to-Text model for Krio, the national language of Sierra Leone. Fine-tuned from openai/whisper-small on 30 hours of Krio speech data.

Performance

Metric	Score
Word Error Rate (WER)	5.05%
Character Error Rate (CER)	2.18%
Training samples	6,126
Test samples	681

~95 out of every 100 words transcribed correctly on clean Krio speech.

Training Data

Dataset: MosesJoshuaCoker/30_hours_krio_voice
Duration: ~30 hours of Krio speech
Language: Krio (ISO 639-3: kri)

How to Use

Option 1 — Python Environment (Local)

Step 1: Install dependencies

pip install transformers torch librosa soundfile

Step 2: Transcribe an audio file

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Load your audio file (wav, mp3, m4a, flac etc.)
# Audio is automatically resampled to 16kHz
audio_array, sr = librosa.load("your_audio.wav", sr=16000)

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

Convert m4a to wav (if needed):

# Install ffmpeg
sudo apt-get install ffmpeg        # Linux
brew install ffmpeg                # Mac

# Convert — -ar 16000 resamples to 16kHz, -ac 1 converts to mono
ffmpeg -i your_audio.m4a -ar 16000 -ac 1 output.wav

Option 2 — Google Colab

Step 1: Install dependencies

!pip install transformers torch librosa
!apt-get install -y ffmpeg

Step 2: Upload and convert your audio

from google.colab import files

# Opens a file picker — select any .wav, .mp3, or .m4a file
uploaded = files.upload()
audio_file = list(uploaded.keys())[0]

# Convert to 16kHz mono wav (skip if already a .wav)
!ffmpeg -i "{audio_file}" -ar 16000 -ac 1 output.wav -y
print("✅ Ready for transcription")

Step 3: Transcribe

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch

# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Load audio
audio_array, sr = librosa.load("output.wav", sr=16000)
print(f"Audio duration: {len(audio_array)/16000:.2f} seconds")

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\n📝 Krio Transcription: {transcription}")

Option 3 — Test with Training Dataset

from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import numpy as np
import torch

# Load a sample directly from the training dataset
dataset = load_dataset("MosesJoshuaCoker/30_hours_krio_voice", split="train")
sample = dataset[0]  # change index to test different samples

audio_array = np.array(sample["audio_path"]["array"], dtype=np.float32)
reference = sample["transcriptions"]

# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Reference:   {reference}")
print(f"Prediction:  {transcription}")

Important Notes

Audio format: The model expects 16kHz mono audio. Use librosa.load(file, sr=16000) or ffmpeg to resample.
Best results: Works best on clean, close-microphone Krio speech.
Language forcing: The model is configured to always output Krio text, even if the audio contains English-sounding words.
Supported formats: .wav, .mp3, .m4a, .flac, .ogg (via librosa or ffmpeg conversion)

Training Details

Parameter	Value
Base model	openai/whisper-small
Epochs	5
Batch size	16 (effective)
Learning rate	1e-5
Warmup steps	100
Hardware	Tesla T4
Training time	9h 44m
Best checkpoint	Step 1150
Framework	HuggingFace Transformers

Citation

If you use this model in your research or application, please cite:

@misc{whisper-krio-2025,
  author = {Moses Joshua Coker},
  title  = {Whisper Small Fine-tuned for Krio Speech-to-Text},
  year   = {2025},
  url    = {https://huggingface.co/MosesJoshuaCoker/best_for_now}
}

Limitations

Performance may degrade with heavy background noise
Regional Krio dialect variations not well represented in training data may score lower
This is a beta model — feedback from native Krio speakers is welcome

Built for Sierra Leone 🇸🇱 — the first open-source Krio STT model.

Downloads last month: 187

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for MosesJoshuaCoker/best_for_now

Base model

openai/whisper-small

Finetuned

(3433)

this model