Model Description

This model is a fine-tuned version of openai/whisper-base for Kikuyu automatic speech recognition (ASR), trained on 100hrs of transcribed Kikuyu speech from MCAA1-MSU/anv_data_ke dataset. The model achieved a Word Error Rate (WER) of 36% and a Character Error Rate (CER) of 10%.

Developed by: Mary Kariuki
Model type: Automatic Speech Recognition (ASR)
Language: Kikuyu
Finetuned from: openai/whisper-base

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

model_id = "DeKUT-DSAIL/Tunuh-whisper-base-Kikuyu-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# preprocess audio
audio_path = "path_to_your_audio.wav"
audio, sampling_rate = librosa.load(audio_path, sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
input_features = inputs.input_features.to(device, dtype=torch_dtype)

# Generate transcript
predicted_ids = model.generate(input_features, max_length=448)

# Decode output
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Preprocessing

The preprocessing steps included:

Remowing corrupted files
Removal of longer audio(>30 seconds)
Removal of unneccesary columns
sampling data to 16KHz
Tokenization
Feature extraction

Training Hyperparameters

The following parameters were used during training

learning_rate: 1e-05
Training batch_size: 64
Training steps: 2000
Warmup steps: 500
Eval_strategy=steps
Eval_steps=200

Training Results

Training Loss	Epoch	WER	CER
0.7982	1.0	55.6971	14.4809
0.4751	2.0	46.5992	13.0708
0.3492	3.0	40.3700	11.11
0.3483	4.0	37.5377	10.2413
0.2764	5.0	36.5935	10.1114

Metrics

The following metrics were used during model evaluation

Wer: measures the percentage of words that were incorrectly predicted compared to the original transcript.
Cer: measures the percentage of characters that were incorrectly predicted compared to the original transcript.

Evaluation Samples

Below are examples of transcribe Kikuyu audio using this model.

Audio Sample	Human transcription	Model prediction
	Itonga cirĩ arata aingĩ mũno.	Itonga cirĩ arata aingĩ mũno.
	Nĩacoketie ciira ũcio igooti-inĩ nĩũndũ wa akĩoho gĩake.	Nĩacoketie cira ũcio igooti-inĩ nĩũndũ wa akĩohwo gĩake.
	Mathomo maingĩ intaneti-inĩ.	Mathomo maingĩ intaneti-inĩ

Downstream use

Transcription services for Kikuyu content.
Development of language learning apps for Kikuyu speakers.
Subtitle generation for Kikuyu audio and video content.

Downloads last month: 641

Safetensors

Model size

72.6M params

Tensor type

F32

Model tree for DeKUT-DSAIL/tunuh-whisper-base-kikuyu-v1

Base model

openai/whisper-base

Finetuned

(686)

this model

DeKUT-DSAIL
/

tunuh-whisper-base-kikuyu-v1