asr-whisper-largev2-v7

This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on an expanded dataset of real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.

Model Description

This ASR model represents the third iteration of domain-specific fine-tuning (v7) with a significantly expanded training corpus compared to v6. The model continues the pipeline of bridging the gap between clean, read speech and real-world telephony audio while benefiting from ~1.9x more training data than its predecessor.

Key Characteristics:

  • Domain: Child helpline phone call transcription (Swahili)
  • Best Checkpoint: Step 6,000
  • Validation WER: 43.73% on real phone call audio
  • Validation Loss: 0.6778
  • Training Dataset: Swahili ASR Dataset v8 (~328.5 hours of augmented telephony speech)

Performance Context: The model achieves 43.73% WER on real telephony audio, representing a 0.29 percentage point improvement over v6 (44.02% WER) while being trained on nearly double the data. The model demonstrates continued scaling benefits with expanded domain-specific training examples.

Training Strategy

Three-Stage Training Pipeline:

  1. Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
  2. Stage 2 (This Model) - Real Phone Calls v8: Domain adaptation on expanded helpline recordings (6,000 steps → 43.73% WER on telephony)

This model represents Stage 2 v7 with enhanced domain-specific optimization through increased training data volume and a higher augmentation factor (5x vs 4x).

Intended Uses & Limitations

Intended Uses

Primary:

  • Transcribing Swahili speech in Tanzania Child Helpline call center environments
  • Real-time or batch processing of telephony audio (8kHz phone quality)
  • Production ASR system for helpline service documentation and analytics

Secondary:

  • General Swahili ASR for telephony/call center applications
  • Research baseline for domain adaptation studies (clean speech → telephony)
  • Transfer learning base for similar low-resource telephony ASR tasks

Key Improvements Over Previous Versions

Largest Training Dataset to Date: ~328.5 hours augmented Higher Augmentation Factor: 5x for greater acoustic diversity Improved Accuracy: CER and WER reduction on telephony audio Better Confidence: Reduction in validation loss Enhanced Coverage: More diverse call scenarios and speaker characteristics Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations Production Ready: Validated on actual helpline audio (not just clean datasets)

Limitations

Domain-Specific Vocabulary:

  • Optimized for child helpline and healthcare-related conversations
  • May underperform on technical, legal, or specialized domains outside training data scope

Dialect Specificity:

  • Best performance on Tanzanian Swahili dialects represented in training data
  • May have reduced accuracy on coastal, northern, or other regional variants not well-represented

Audio Quality Requirements:

  • Designed for telephony (8kHz-16kHz); may need retuning for high-fidelity audio
  • Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)

Code-Switching:

  • Limited handling of Swahili-English code-switching common in urban Tanzania
  • May struggle with mixed-language utterances

Model Size:

  • Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
  • Consider quantization or distillation for edge deployment

Training and Evaluation Data

The model was trained on the Swahili ASR Dataset v8, a private dataset curated specifically for this task with the largest coverage to date across the v5/v6/v7 series.

Data Privacy & Access

Status: 🔒 Private / Internal Use Only

The dataset is not publicly available due to strict privacy and Personally Identifiable Information (PII) concerns. The source audio consists of real calls to the Tanzania Child Helpline. While the model weights are shared, the training data remains confidential to protect the identities of callers, many of whom are minors.

Dataset Volume (Hours & Samples)

The dataset utilizes a 5x augmentation strategy (for training set only) to maximize the utility of the available domain-specific audio.

Split Unique Samples Original Duration Augmented Duration Notes
Training 44,651 ~68.7 hours ~328.5 hours 1 original + 4 augmented versions per sample
Validation 12,758 ~18.7 hours ~18.7 hours Original audio only (No augmentation)
Test 6,379 ~9.4 hours ~9.4 hours Original audio only (No augmentation)
TOTAL 63,788 ~117 hours ~356.6 hours

Data Characteristics

  • Source: Real-world phone call audio (not studio recordings)
  • Language: Tanzanian Swahili with natural conversational characteristics
  • Format: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
  • Content: Domain-relevant vocabulary (child welfare, healthcare, family support)
  • Raw Records: 80,671 annotations; 63,788 retained after deduplication (16,883 duplicates removed)

Audio Augmentation Strategy

To make the model robust against the noisy environment of a call center, the training set was expanded from ~65.7 hours to ~328.5 hours using a multi-technique augmentation strategy. The augmentation factor was increased to 5x (from ~4x in v6) to enhance acoustic diversity.

Training samples were augmented with the following techniques:

  1. Colored Noise (50% probability): Background static simulation (White/Pink/Brown noise, 30% zone coverage)
  2. Packet Loss / Crop (40% probability): VoIP connection drop simulation (15% audio cropped from 10%-90% zone)
  3. VTLP - Speaker Variability (50% probability): Vocal tract length perturbation (factor range 0.9–1.1)
  4. Codec Masking (30% probability): Compression artifact simulation (10% audio masked with noise)
  5. Pitch Shift (40% probability): Tone variation (±2 semitones)
  6. Time Stretch (40% probability): Speed variation (0.9x–1.1x)
  7. Volume Variation (70% probability): Loudness variation (±6 dB)

Each technique is applied independently based on its probability weight. Augmented datasets are shuffled after augmentation to improve mini-batch diversity during training.

Note: Validation and Test splits contain only original audio to ensure unbiased evaluation metrics.

Training Procedure

Training was performed in two phases on a A40 GPU:

  • Phase 1: Steps 1–7,500 (~8 hours wall time)
  • Phase 2: Steps 7,500–10,000, resumed from Phase 1 checkpoint (~3.9 hours wall time)

The best checkpoint (step 6,000) was achieved during Phase 1.

Training Hyperparameters

Optimization:

  • learning_rate: 1e-05
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 500
  • optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
  • max_training_steps: 12,000 (best checkpoint at 6,000)
  • seed: 42

Batch Configuration:

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 1
  • Effective batch size: 16

Memory Optimization:

  • gradient_checkpointing: true
  • mixed_precision_training: Native AMP (FP16)

Evaluation & Checkpointing:

  • evaluation_strategy: steps
  • eval_steps: 500
  • save_steps: 500
  • logging_steps: 50
  • save_total_limit: 3

Best Model Selection:

  • load_best_model_at_end: true
  • metric_for_best_model: "wer"
  • greater_is_better: false
  • early_stopping_patience: 3 evaluations (1,500 steps)

Infrastructure:

  • GPU: A40 (48GB VRAM)
  • Total training time: ~10.9 hours (Phase 1 + Phase 2)
  • Checkpoint size: ~3GB per checkpoint

Training Results

Training Loss Epoch Step Validation Loss WER Notes
1.095 0.046 500 0.8853 57.56 Initial adaptation
0.987 0.083 1000 0.8043 52.16
0.910 0.125 1500 0.7693 49.50
0.872 0.167 2000 0.7435 57.55 Eval sample variability
0.843 0.208 2500 0.7323 47.05
0.822 0.250 3000 0.7152 50.64 Eval sample variability
0.795 0.292 3500 0.7000 47.31
0.756 0.333 4000 0.6966 45.03
0.718 0.375 4500 0.6927 44.41
0.692 0.417 5000 0.6873 45.80
0.668 0.458 5500 0.6813 44.39
0.663 0.500 6000 0.6778 43.73 Best checkpoint
0.650 0.542 6500 0.6750 43.85 No improvement (1/3)
0.621 0.583 7000 0.6770 44.24 No improvement (2/3)
0.621 0.625 7500 0.6768 45.29 No improvement (3/3) - Phase 1 ends
0.629 0.667 8000 0.7024 45.80 Phase 2 begins (resumed)
0.628 0.708 8500 0.7090 45.41
0.604 0.750 9000 0.7114 46.12
0.599 0.792 9500 0.7125 49.07
0.542 0.833 10000 0.7048 46.22 Phase 2 ends

Training Observations:

  • Convergence: Best WER achieved at step 6,000 (43.73%), slightly later than v6's step 4,500
  • WER Spikes: Early training shows evaluation variability (steps 2,000 and 3,000) attributable to stochastic sampling of the 500-sample evaluation subset
  • Phase 2 Degradation: Resumed training from step 7,500 with a new warmup schedule did not improve beyond Phase 1 performance, consistent with the model having already converged
  • Stable Convergence Window: Steps 6,000–7,500 form a stable convergence plateau (WER 43.73%–45.29%) before Phase 2

Final Metrics (Step 6,000 — Best Checkpoint):

  • Training loss: 0.6633
  • Validation loss: 0.6778
  • Validation WER: 43.73%
  • Total training steps: 10,000 (across two phases)

Usage

Quick Start

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="rogendo/asr-whisper-largev2-v7")

# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])

Advanced Usage with Audio Preprocessing

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("rogendo/asr-whisper-largev2-v7")
model = WhisperForConditionalGeneration.from_pretrained("rogendo/asr-whisper-largev2-v7")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)

# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids,
    max_length=448
)

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Production Deployment Recommendations

Audio Requirements:

  • Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
  • Format: Mono (single channel)
  • Duration: Optimal <30 seconds per segment for memory efficiency

Inference Optimization:

# Use half-precision for faster inference
model = model.half()  # FP16

# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)

Real-time Considerations:

  • GPU required for real-time transcription (RTF < 1.0)
  • CPU inference possible but slower (RTF ~3-5x on modern CPUs)
  • Consider model quantization for edge deployment

Evaluation Methodology

Validation Set:

  • 12,758 samples from real phone call recordings
  • Evaluation subset: 500 samples per evaluation step (for speed)
  • Evaluated every 500 training steps
  • Original audio only (no augmentation)

WER Calculation:

  • Standard Word Error Rate: (Substitutions + Deletions + Insertions) / Total Words
  • Normalized text (lowercase, punctuation handling)
  • Swahili-specific text normalization applied

Note on WER Variability: Early-training evaluation metrics (steps 2,000 and 3,000) show elevated WER due to stochastic sampling of the 500-sample evaluation subset. Overall trend is downward, converging at step 6,000.

Best Model Selection:

  • Automatic selection based on lowest validation WER
  • Early stopping with patience of 3 evaluations (1,500 steps)
  • Final model: Step 6,000 checkpoint

Future Work

  • Test set evaluation: Comprehensive evaluation on held-out 6,379-sample test set
  • Code-switching support: Improve Swahili-English mixed utterance handling
  • Model compression: Quantization and distillation for faster inference
  • Streaming ASR: Adapt for real-time streaming transcription
  • Dialect expansion: Include more regional Swahili variants
  • Ablation studies: Analyze impact of individual augmentation techniques on WER

Citation

If you use this model in your research or production systems, please cite:

@misc{openchs-swahili-asr-v7,
  title={Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony (v7)},
  author={OpenCHS Team},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/rogendo/asr-whisper-largev2-v7}},
  note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on expanded phone call data (v8, ~328.5 hours augmented)}
}

Framework Versions

  • Transformers: 4.56.2
  • PyTorch: 2.8.0+cu128
  • Datasets: 2.21.0
  • Tokenizers: 0.22.1

License

Apache 2.0

Acknowledgments


Model Status: Production Ready — Largest training corpus to date for Tanzania Child Helpline telephony transcription

Downloads last month
36
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rogendo/asr-whisper-largev2-v7

Finetuned
(3)
this model

Evaluation results