asr-whisper-large-v3-helpline

This model is a fine-tuned version of openai/whisper-large-v3 on the Mozilla Common Voice Swahili 23.0 dataset. It represents a significant improvement in Swahili transcription capabilities, leveraging the latest Whisper architecture (v3) which includes additional language tokens and improved architectural features over v2.

Model Description

This model specifically targets Swahili (Kiswahili) speech recognition. By fine-tuning the massive whisper-large-v3 checkpoint on the Common Voice 23.0 dataset, this model adapts the general multilingual capabilities of Whisper to the specific acoustic and linguistic characteristics of standard Swahili.

Key Highlights:

  • Base Architecture: Whisper Large v3 (1.55B parameters)
  • Training Data: Mozilla Common Voice 23.0 (Swahili)
  • Best Performance: 24.74% WER (achieved at step 6,500)
  • Final Performance: 26.64% WER (at step 9,000)

Note: The model name suggests "helpline" usage, but this specific version is trained on Common Voice (read speech). It serves as a strong general-purpose baseline before domain adaptation to telephony/helpline audio.

Intended Uses & Limitations

Intended Uses

  • General Transcription: Suitable for transcribing high-quality Swahili audio (news, audiobooks, clear recordings).
  • Base for Transfer Learning: Excellent starting point for further fine-tuning on specialized domains (e.g., medical, legal, or telephony/helpline data).
  • Research: Benchmarking Swahili ASR performance using Large v3 architecture.

Limitations

  • Read Speech Bias: Since Common Voice consists mostly of read text, the model may be less robust to spontaneous conversational speech (interjections, disfluencies) compared to models trained on conversational data.
  • Overfitting: Training logs indicate signs of overfitting after step 6,500 (validation loss increased). It is recommended to use the checkpoint from step 6,500 for optimal generalization.
  • Hardware: Inference requires significant VRAM (~10GB+ in FP16) due to the model size.

Training and Evaluation Data

Dataset: Mozilla Common Voice 23.0 (Swahili)

Common Voice is a crowdsourced dataset where volunteers record sample sentences.

  • Content: Diverse topics including news, wikipedia, and general conversation topics.
  • Speakers: Wide variety of ages, accents, and genders (crowdsourced).
  • Split: Standard Train/Validation/Test splits provided by the Common Voice dataset were used.

Training Procedure

The model was fine-tuned using the Hugging Face Seq2SeqTrainer.

Training Hyperparameters

  • Learning Rate: 5e-06 (Conservative LR to preserve pre-trained knowledge)
  • Batch Size: 16 (Train) / 16 (Eval)
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • Scheduler: Cosine with Restarts (Warmup: 500 steps)
  • Total Steps: 9,000 (Stopped)
  • Precision: Mixed Precision (FP16/Native AMP)
  • Seed: 42

Training Results & Analysis

Epoch Step Training Loss Validation Loss WER Status
0.02 500 0.5310 0.6190 41.31% Warmup
1.00 3000 0.3316 0.3876 30.24% Improving
2.00 6000 0.1993 0.3512 25.53% Strong performance
2.03 6500 0.1393 0.3486 24.74% ✅ Best Checkpoint
2.06 7000 0.1035 0.3477 24.86% Stable
2.08 7500 0.1905 0.4069 🔺 28.92% Overfitting begins
3.01 9000 0.2348 0.3681 26.64% Final Step

Analysis: The model reached peak performance at Step 6,500 with a WER of 24.74% and Validation Loss of 0.3486. Subsequent steps (7,500+) show an increase in validation loss and WER, indicating the model began to overfit to the training data. Users are advised to load the model state from step 6,500 if available, or utilize the final model with awareness of this slight degradation.

Usage

Using Transformers Pipeline

from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="your-username/asr-whisper-large-v3-helpline")

# Transcribe
# ensure you force the language to Swahili for best results
result = pipe("audio.mp3", generate_kwargs={"language": "swahili"})
print(result["text"])

Using Low-Level API

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load Model
model_id = "your-username/asr-whisper-large-v3-helpline"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load Audio
audio, sr = librosa.load("sample_swahili.wav", sr=16000)

# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)
input_features = inputs.input_features

# Generate
generated_ids = model.generate(input_features, language="swahili")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(transcription)

Framework Versions

  • Transformers: 4.56.2
  • Pytorch: 2.8.0+cu128
  • Datasets: 2.21.0
  • Tokenizers: 0.22.1

Downloads last month
13
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openchs/asr-whisper-largev3-finetuned

Finetuned
(813)
this model

Dataset used to train openchs/asr-whisper-largev3-finetuned

Evaluation results