asr-whisper-large-v3-helpline

This model is a fine-tuned version of openai/whisper-large-v3 on the Mozilla Common Voice Swahili 23.0 dataset. It represents a significant improvement in Swahili transcription capabilities, leveraging the latest Whisper architecture (v3) which includes additional language tokens and improved architectural features over v2.

Model Description

This model specifically targets Swahili (Kiswahili) speech recognition. By fine-tuning the massive whisper-large-v3 checkpoint on the Common Voice 23.0 dataset, this model adapts the general multilingual capabilities of Whisper to the specific acoustic and linguistic characteristics of standard Swahili.

Key Highlights:

Base Architecture: Whisper Large v3 (1.55B parameters)
Training Data: Mozilla Common Voice 23.0 (Swahili)
Best Performance: 24.74% WER (achieved at step 6,500)
Final Performance: 26.64% WER (at step 9,000)

Note: The model name suggests "helpline" usage, but this specific version is trained on Common Voice (read speech). It serves as a strong general-purpose baseline before domain adaptation to telephony/helpline audio.

Intended Uses & Limitations

Intended Uses

General Transcription: Suitable for transcribing high-quality Swahili audio (news, audiobooks, clear recordings).
Base for Transfer Learning: Excellent starting point for further fine-tuning on specialized domains (e.g., medical, legal, or telephony/helpline data).
Research: Benchmarking Swahili ASR performance using Large v3 architecture.

Limitations

Read Speech Bias: Since Common Voice consists mostly of read text, the model may be less robust to spontaneous conversational speech (interjections, disfluencies) compared to models trained on conversational data.
Overfitting: Training logs indicate signs of overfitting after step 6,500 (validation loss increased). It is recommended to use the checkpoint from step 6,500 for optimal generalization.
Hardware: Inference requires significant VRAM (~10GB+ in FP16) due to the model size.

Training and Evaluation Data

Dataset: Mozilla Common Voice 23.0 (Swahili)

Common Voice is a crowdsourced dataset where volunteers record sample sentences.

Content: Diverse topics including news, wikipedia, and general conversation topics.
Speakers: Wide variety of ages, accents, and genders (crowdsourced).
Split: Standard Train/Validation/Test splits provided by the Common Voice dataset were used.

Training Procedure

The model was fine-tuned using the Hugging Face Seq2SeqTrainer.

Training Hyperparameters

Learning Rate: 5e-06 (Conservative LR to preserve pre-trained knowledge)
Batch Size: 16 (Train) / 16 (Eval)
Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
Scheduler: Cosine with Restarts (Warmup: 500 steps)
Total Steps: 9,000 (Stopped)
Precision: Mixed Precision (FP16/Native AMP)
Seed: 42

Training Results & Analysis

Epoch	Step	Training Loss	Validation Loss	WER	Status
0.02	500	0.5310	0.6190	41.31%	Warmup
1.00	3000	0.3316	0.3876	30.24%	Improving
2.00	6000	0.1993	0.3512	25.53%	Strong performance
2.03	6500	0.1393	0.3486	24.74%	✅ Best Checkpoint
2.06	7000	0.1035	0.3477	24.86%	Stable
2.08	7500	0.1905	0.4069 🔺	28.92%	Overfitting begins
3.01	9000	0.2348	0.3681	26.64%	Final Step

Analysis: The model reached peak performance at Step 6,500 with a WER of 24.74% and Validation Loss of 0.3486. Subsequent steps (7,500+) show an increase in validation loss and WER, indicating the model began to overfit to the training data. Users are advised to load the model state from step 6,500 if available, or utilize the final model with awareness of this slight degradation.

Usage

Using Transformers Pipeline

from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="your-username/asr-whisper-large-v3-helpline")

# Transcribe
# ensure you force the language to Swahili for best results
result = pipe("audio.mp3", generate_kwargs={"language": "swahili"})
print(result["text"])

Using Low-Level API

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load Model
model_id = "your-username/asr-whisper-large-v3-helpline"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load Audio
audio, sr = librosa.load("sample_swahili.wav", sr=16000)

# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)
input_features = inputs.input_features

# Generate
generated_ids = model.generate(input_features, language="swahili")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(transcription)

Framework Versions

Transformers: 4.56.2
Pytorch: 2.8.0+cu128
Datasets: 2.21.0
Tokenizers: 0.22.1

Downloads last month: 13

Safetensors

Model size

2B params

Tensor type

F32

Model tree for openchs/asr-whisper-largev3-finetuned

Base model

openai/whisper-large-v3

Finetuned

(813)

this model

Dataset used to train openchs/asr-whisper-largev3-finetuned

Evaluation results

WER on Common Voice Swahili 23.0
test set self-reported

24.741