Model Description
This model is a fine-tuned version of openai/whisper-base for Kikuyu automatic speech recognition (ASR), trained on 100hrs of transcribed Kikuyu speech from MCAA1-MSU/anv_data_ke dataset. The model achieved a Word Error Rate (WER) of 36% and a Character Error Rate (CER) of 10%.
- Developed by: Mary Kariuki
- Model type: Automatic Speech Recognition (ASR)
- Language: Kikuyu
- Finetuned from: openai/whisper-base
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa
model_id = "DeKUT-DSAIL/Tunuh-whisper-base-Kikuyu-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# preprocess audio
audio_path = "path_to_your_audio.wav"
audio, sampling_rate = librosa.load(audio_path, sr=16000, mono=True)
inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
input_features = inputs.input_features.to(device, dtype=torch_dtype)
# Generate transcript
predicted_ids = model.generate(input_features, max_length=448)
# Decode output
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Training Details
Preprocessing
The preprocessing steps included:
- Remowing corrupted files
- Removal of longer audio(>30 seconds)
- Removal of unneccesary columns
- sampling data to 16KHz
- Tokenization
- Feature extraction
Training Hyperparameters
The following parameters were used during training
- learning_rate: 1e-05
- Training batch_size: 64
- Training steps: 2000
- Warmup steps: 500
- Eval_strategy=steps
- Eval_steps=200
Training Results
| Training Loss | Epoch | WER | CER |
|---|---|---|---|
| 0.7982 | 1.0 | 55.6971 | 14.4809 |
| 0.4751 | 2.0 | 46.5992 | 13.0708 |
| 0.3492 | 3.0 | 40.3700 | 11.11 |
| 0.3483 | 4.0 | 37.5377 | 10.2413 |
| 0.2764 | 5.0 | 36.5935 | 10.1114 |
Metrics
The following metrics were used during model evaluation
Wer: measures the percentage of words that were incorrectly predicted compared to the original transcript.
Cer: measures the percentage of characters that were incorrectly predicted compared to the original transcript.
Evaluation Samples
Below are examples of transcribe Kikuyu audio using this model.
| Audio Sample | Human transcription | Model prediction |
|---|---|---|
| Itonga cirĩ arata aingĩ mũno. | Itonga cirĩ arata aingĩ mũno. | |
| Nĩacoketie ciira ũcio igooti-inĩ nĩũndũ wa akĩoho gĩake. | Nĩacoketie cira ũcio igooti-inĩ nĩũndũ wa akĩohwo gĩake. | |
| Mathomo maingĩ intaneti-inĩ. | Mathomo maingĩ intaneti-inĩ |
Downstream use
- Transcription services for Kikuyu content.
- Development of language learning apps for Kikuyu speakers.
- Subtitle generation for Kikuyu audio and video content.
- Downloads last month
- 641
Model tree for DeKUT-DSAIL/tunuh-whisper-base-kikuyu-v1
Base model
openai/whisper-base