asr-whisper-large-v3-helpline
This model is a fine-tuned version of openai/whisper-large-v3 on the Mozilla Common Voice Swahili 23.0 dataset. It represents a significant improvement in Swahili transcription capabilities, leveraging the latest Whisper architecture (v3) which includes additional language tokens and improved architectural features over v2.
Model Description
This model specifically targets Swahili (Kiswahili) speech recognition. By fine-tuning the massive whisper-large-v3 checkpoint on the Common Voice 23.0 dataset, this model adapts the general multilingual capabilities of Whisper to the specific acoustic and linguistic characteristics of standard Swahili.
Key Highlights:
- Base Architecture: Whisper Large v3 (1.55B parameters)
- Training Data: Mozilla Common Voice 23.0 (Swahili)
- Best Performance: 24.74% WER (achieved at step 6,500)
- Final Performance: 26.64% WER (at step 9,000)
Note: The model name suggests "helpline" usage, but this specific version is trained on Common Voice (read speech). It serves as a strong general-purpose baseline before domain adaptation to telephony/helpline audio.
Intended Uses & Limitations
Intended Uses
- General Transcription: Suitable for transcribing high-quality Swahili audio (news, audiobooks, clear recordings).
- Base for Transfer Learning: Excellent starting point for further fine-tuning on specialized domains (e.g., medical, legal, or telephony/helpline data).
- Research: Benchmarking Swahili ASR performance using Large v3 architecture.
Limitations
- Read Speech Bias: Since Common Voice consists mostly of read text, the model may be less robust to spontaneous conversational speech (interjections, disfluencies) compared to models trained on conversational data.
- Overfitting: Training logs indicate signs of overfitting after step 6,500 (validation loss increased). It is recommended to use the checkpoint from step 6,500 for optimal generalization.
- Hardware: Inference requires significant VRAM (~10GB+ in FP16) due to the model size.
Training and Evaluation Data
Dataset: Mozilla Common Voice 23.0 (Swahili)
Common Voice is a crowdsourced dataset where volunteers record sample sentences.
- Content: Diverse topics including news, wikipedia, and general conversation topics.
- Speakers: Wide variety of ages, accents, and genders (crowdsourced).
- Split: Standard Train/Validation/Test splits provided by the Common Voice dataset were used.
Training Procedure
The model was fine-tuned using the Hugging Face Seq2SeqTrainer.
Training Hyperparameters
- Learning Rate: 5e-06 (Conservative LR to preserve pre-trained knowledge)
- Batch Size: 16 (Train) / 16 (Eval)
- Optimizer: AdamW (
betas=(0.9,0.999),epsilon=1e-08) - Scheduler: Cosine with Restarts (Warmup: 500 steps)
- Total Steps: 9,000 (Stopped)
- Precision: Mixed Precision (FP16/Native AMP)
- Seed: 42
Training Results & Analysis
| Epoch | Step | Training Loss | Validation Loss | WER | Status |
|---|---|---|---|---|---|
| 0.02 | 500 | 0.5310 | 0.6190 | 41.31% | Warmup |
| 1.00 | 3000 | 0.3316 | 0.3876 | 30.24% | Improving |
| 2.00 | 6000 | 0.1993 | 0.3512 | 25.53% | Strong performance |
| 2.03 | 6500 | 0.1393 | 0.3486 | 24.74% | ✅ Best Checkpoint |
| 2.06 | 7000 | 0.1035 | 0.3477 | 24.86% | Stable |
| 2.08 | 7500 | 0.1905 | 0.4069 🔺 | 28.92% | Overfitting begins |
| 3.01 | 9000 | 0.2348 | 0.3681 | 26.64% | Final Step |
Analysis: The model reached peak performance at Step 6,500 with a WER of 24.74% and Validation Loss of 0.3486. Subsequent steps (7,500+) show an increase in validation loss and WER, indicating the model began to overfit to the training data. Users are advised to load the model state from step 6,500 if available, or utilize the final model with awareness of this slight degradation.
Usage
Using Transformers Pipeline
from transformers import pipeline
# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="your-username/asr-whisper-large-v3-helpline")
# Transcribe
# ensure you force the language to Swahili for best results
result = pipe("audio.mp3", generate_kwargs={"language": "swahili"})
print(result["text"])
Using Low-Level API
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa
# Load Model
model_id = "your-username/asr-whisper-large-v3-helpline"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# Load Audio
audio, sr = librosa.load("sample_swahili.wav", sr=16000)
# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)
input_features = inputs.input_features
# Generate
generated_ids = model.generate(input_features, language="swahili")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
Framework Versions
- Transformers: 4.56.2
- Pytorch: 2.8.0+cu128
- Datasets: 2.21.0
- Tokenizers: 0.22.1
- Downloads last month
- 13
Model tree for openchs/asr-whisper-largev3-finetuned
Base model
openai/whisper-large-v3Dataset used to train openchs/asr-whisper-largev3-finetuned
Evaluation results
- WER on Common Voice Swahili 23.0test set self-reported24.741