Whisper Small β€” Krio Speech-to-Text πŸ‡ΈπŸ‡±

The first dedicated Speech-to-Text model for Krio, the national language of Sierra Leone. Fine-tuned from openai/whisper-small on 30 hours of Krio speech data.


Performance

Metric Score
Word Error Rate (WER) 5.05%
Character Error Rate (CER) 2.18%
Training samples 6,126
Test samples 681

~95 out of every 100 words transcribed correctly on clean Krio speech.


Training Data


How to Use

Option 1 β€” Python Environment (Local)

Step 1: Install dependencies

pip install transformers torch librosa soundfile

Step 2: Transcribe an audio file

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Load your audio file (wav, mp3, m4a, flac etc.)
# Audio is automatically resampled to 16kHz
audio_array, sr = librosa.load("your_audio.wav", sr=16000)

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

Convert m4a to wav (if needed):

# Install ffmpeg
sudo apt-get install ffmpeg        # Linux
brew install ffmpeg                # Mac

# Convert β€” -ar 16000 resamples to 16kHz, -ac 1 converts to mono
ffmpeg -i your_audio.m4a -ar 16000 -ac 1 output.wav

Option 2 β€” Google Colab

Step 1: Install dependencies

!pip install transformers torch librosa
!apt-get install -y ffmpeg

Step 2: Upload and convert your audio

from google.colab import files

# Opens a file picker β€” select any .wav, .mp3, or .m4a file
uploaded = files.upload()
audio_file = list(uploaded.keys())[0]

# Convert to 16kHz mono wav (skip if already a .wav)
!ffmpeg -i "{audio_file}" -ar 16000 -ac 1 output.wav -y
print("βœ… Ready for transcription")

Step 3: Transcribe

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
import torch

# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Load audio
audio_array, sr = librosa.load("output.wav", sr=16000)
print(f"Audio duration: {len(audio_array)/16000:.2f} seconds")

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\nπŸ“ Krio Transcription: {transcription}")

Option 3 β€” Test with Training Dataset

from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import numpy as np
import torch

# Load a sample directly from the training dataset
dataset = load_dataset("MosesJoshuaCoker/30_hours_krio_voice", split="train")
sample = dataset[0]  # change index to test different samples

audio_array = np.array(sample["audio_path"]["array"], dtype=np.float32)
reference = sample["transcriptions"]

# Load model
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/best_for_now")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/best_for_now")
model.eval()

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        forced_decoder_ids=processor.get_decoder_prompt_ids(
            language="english", task="transcribe"
        )
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Reference:   {reference}")
print(f"Prediction:  {transcription}")

Important Notes

  • Audio format: The model expects 16kHz mono audio. Use librosa.load(file, sr=16000) or ffmpeg to resample.
  • Best results: Works best on clean, close-microphone Krio speech.
  • Language forcing: The model is configured to always output Krio text, even if the audio contains English-sounding words.
  • Supported formats: .wav, .mp3, .m4a, .flac, .ogg (via librosa or ffmpeg conversion)

Training Details

Parameter Value
Base model openai/whisper-small
Epochs 5
Batch size 16 (effective)
Learning rate 1e-5
Warmup steps 100
Hardware Tesla T4
Training time 9h 44m
Best checkpoint Step 1150
Framework HuggingFace Transformers

Citation

If you use this model in your research or application, please cite:

@misc{whisper-krio-2025,
  author = {Moses Joshua Coker},
  title  = {Whisper Small Fine-tuned for Krio Speech-to-Text},
  year   = {2025},
  url    = {https://huggingface.co/MosesJoshuaCoker/best_for_now}
}

Limitations

  • Performance may degrade with heavy background noise
  • Regional Krio dialect variations not well represented in training data may score lower
  • This is a beta model β€” feedback from native Krio speakers is welcome

Built for Sierra Leone πŸ‡ΈπŸ‡± β€” the first open-source Krio STT model.

Downloads last month
187
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MosesJoshuaCoker/best_for_now

Finetuned
(3433)
this model