Model Description

This model is a fine-tuned version of openai/whisper-base for Kikuyu automatic speech recognition (ASR), trained on 100hrs of transcribed Kikuyu speech from MCAA1-MSU/anv_data_ke dataset. The model achieved a Word Error Rate (WER) of 36% and a Character Error Rate (CER) of 10%.

  • Developed by: Mary Kariuki
  • Model type: Automatic Speech Recognition (ASR)
  • Language: Kikuyu
  • Finetuned from: openai/whisper-base

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

model_id = "DeKUT-DSAIL/Tunuh-whisper-base-Kikuyu-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# preprocess audio
audio_path = "path_to_your_audio.wav"
audio, sampling_rate = librosa.load(audio_path, sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
input_features = inputs.input_features.to(device, dtype=torch_dtype)

# Generate transcript
predicted_ids = model.generate(input_features, max_length=448)

# Decode output
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Preprocessing

The preprocessing steps included:

  • Remowing corrupted files
  • Removal of longer audio(>30 seconds)
  • Removal of unneccesary columns
  • sampling data to 16KHz
  • Tokenization
  • Feature extraction

Training Hyperparameters

The following parameters were used during training

  • learning_rate: 1e-05
  • Training batch_size: 64
  • Training steps: 2000
  • Warmup steps: 500
  • Eval_strategy=steps
  • Eval_steps=200

Training Results

Training Loss Epoch WER CER
0.7982 1.0 55.6971 14.4809
0.4751 2.0 46.5992 13.0708
0.3492 3.0 40.3700 11.11
0.3483 4.0 37.5377 10.2413
0.2764 5.0 36.5935 10.1114

Metrics

The following metrics were used during model evaluation

  • Wer: measures the percentage of words that were incorrectly predicted compared to the original transcript.

  • Cer: measures the percentage of characters that were incorrectly predicted compared to the original transcript.

Evaluation Samples

Below are examples of transcribe Kikuyu audio using this model.

Audio Sample Human transcription Model prediction
Itonga cirĩ arata aingĩ mũno. Itonga cirĩ arata aingĩ mũno.
Nĩacoketie ciira ũcio igooti-inĩ nĩũndũ wa akĩoho gĩake. Nĩacoketie cira ũcio igooti-inĩ nĩũndũ wa akĩohwo gĩake.
Mathomo maingĩ intaneti-inĩ. Mathomo maingĩ intaneti-inĩ

Downstream use

  • Transcription services for Kikuyu content.
  • Development of language learning apps for Kikuyu speakers.
  • Subtitle generation for Kikuyu audio and video content.
Downloads last month
641
Safetensors
Model size
72.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DeKUT-DSAIL/tunuh-whisper-base-kikuyu-v1

Finetuned
(686)
this model

Dataset used to train DeKUT-DSAIL/tunuh-whisper-base-kikuyu-v1