asr-whisper-largev2-v7
This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on an expanded dataset of real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.
Model Description
This ASR model represents the third iteration of domain-specific fine-tuning (v7) with a significantly expanded training corpus compared to v6. The model continues the pipeline of bridging the gap between clean, read speech and real-world telephony audio while benefiting from ~1.9x more training data than its predecessor.
Key Characteristics:
- Domain: Child helpline phone call transcription (Swahili)
- Best Checkpoint: Step 6,000
- Validation WER: 43.73% on real phone call audio
- Validation Loss: 0.6778
- Training Dataset: Swahili ASR Dataset v8 (~328.5 hours of augmented telephony speech)
Performance Context: The model achieves 43.73% WER on real telephony audio, representing a 0.29 percentage point improvement over v6 (44.02% WER) while being trained on nearly double the data. The model demonstrates continued scaling benefits with expanded domain-specific training examples.
Training Strategy
Three-Stage Training Pipeline:
- Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
- Stage 2 (This Model) - Real Phone Calls v8: Domain adaptation on expanded helpline recordings (6,000 steps → 43.73% WER on telephony)
This model represents Stage 2 v7 with enhanced domain-specific optimization through increased training data volume and a higher augmentation factor (5x vs 4x).
Intended Uses & Limitations
Intended Uses
Primary:
- Transcribing Swahili speech in Tanzania Child Helpline call center environments
- Real-time or batch processing of telephony audio (8kHz phone quality)
- Production ASR system for helpline service documentation and analytics
Secondary:
- General Swahili ASR for telephony/call center applications
- Research baseline for domain adaptation studies (clean speech → telephony)
- Transfer learning base for similar low-resource telephony ASR tasks
Key Improvements Over Previous Versions
Largest Training Dataset to Date: ~328.5 hours augmented Higher Augmentation Factor: 5x for greater acoustic diversity Improved Accuracy: CER and WER reduction on telephony audio Better Confidence: Reduction in validation loss Enhanced Coverage: More diverse call scenarios and speaker characteristics Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations Production Ready: Validated on actual helpline audio (not just clean datasets)
Limitations
Domain-Specific Vocabulary:
- Optimized for child helpline and healthcare-related conversations
- May underperform on technical, legal, or specialized domains outside training data scope
Dialect Specificity:
- Best performance on Tanzanian Swahili dialects represented in training data
- May have reduced accuracy on coastal, northern, or other regional variants not well-represented
Audio Quality Requirements:
- Designed for telephony (8kHz-16kHz); may need retuning for high-fidelity audio
- Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)
Code-Switching:
- Limited handling of Swahili-English code-switching common in urban Tanzania
- May struggle with mixed-language utterances
Model Size:
- Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
- Consider quantization or distillation for edge deployment
Training and Evaluation Data
The model was trained on the Swahili ASR Dataset v8, a private dataset curated specifically for this task with the largest coverage to date across the v5/v6/v7 series.
Data Privacy & Access
Status: 🔒 Private / Internal Use Only
The dataset is not publicly available due to strict privacy and Personally Identifiable Information (PII) concerns. The source audio consists of real calls to the Tanzania Child Helpline. While the model weights are shared, the training data remains confidential to protect the identities of callers, many of whom are minors.
Dataset Volume (Hours & Samples)
The dataset utilizes a 5x augmentation strategy (for training set only) to maximize the utility of the available domain-specific audio.
| Split | Unique Samples | Original Duration | Augmented Duration | Notes |
|---|---|---|---|---|
| Training | 44,651 | ~68.7 hours | ~328.5 hours | 1 original + 4 augmented versions per sample |
| Validation | 12,758 | ~18.7 hours | ~18.7 hours | Original audio only (No augmentation) |
| Test | 6,379 | ~9.4 hours | ~9.4 hours | Original audio only (No augmentation) |
| TOTAL | 63,788 | ~117 hours | ~356.6 hours |
Data Characteristics
- Source: Real-world phone call audio (not studio recordings)
- Language: Tanzanian Swahili with natural conversational characteristics
- Format: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
- Content: Domain-relevant vocabulary (child welfare, healthcare, family support)
- Raw Records: 80,671 annotations; 63,788 retained after deduplication (16,883 duplicates removed)
Audio Augmentation Strategy
To make the model robust against the noisy environment of a call center, the training set was expanded from ~65.7 hours to ~328.5 hours using a multi-technique augmentation strategy. The augmentation factor was increased to 5x (from ~4x in v6) to enhance acoustic diversity.
Training samples were augmented with the following techniques:
- Colored Noise (50% probability): Background static simulation (White/Pink/Brown noise, 30% zone coverage)
- Packet Loss / Crop (40% probability): VoIP connection drop simulation (15% audio cropped from 10%-90% zone)
- VTLP - Speaker Variability (50% probability): Vocal tract length perturbation (factor range 0.9–1.1)
- Codec Masking (30% probability): Compression artifact simulation (10% audio masked with noise)
- Pitch Shift (40% probability): Tone variation (±2 semitones)
- Time Stretch (40% probability): Speed variation (0.9x–1.1x)
- Volume Variation (70% probability): Loudness variation (±6 dB)
Each technique is applied independently based on its probability weight. Augmented datasets are shuffled after augmentation to improve mini-batch diversity during training.
Note: Validation and Test splits contain only original audio to ensure unbiased evaluation metrics.
Training Procedure
Training was performed in two phases on a A40 GPU:
- Phase 1: Steps 1–7,500 (~8 hours wall time)
- Phase 2: Steps 7,500–10,000, resumed from Phase 1 checkpoint (~3.9 hours wall time)
The best checkpoint (step 6,000) was achieved during Phase 1.
Training Hyperparameters
Optimization:
- learning_rate: 1e-05
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 500
- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
- max_training_steps: 12,000 (best checkpoint at 6,000)
- seed: 42
Batch Configuration:
- per_device_train_batch_size: 16
- per_device_eval_batch_size: 16
- gradient_accumulation_steps: 1
- Effective batch size: 16
Memory Optimization:
- gradient_checkpointing: true
- mixed_precision_training: Native AMP (FP16)
Evaluation & Checkpointing:
- evaluation_strategy: steps
- eval_steps: 500
- save_steps: 500
- logging_steps: 50
- save_total_limit: 3
Best Model Selection:
- load_best_model_at_end: true
- metric_for_best_model: "wer"
- greater_is_better: false
- early_stopping_patience: 3 evaluations (1,500 steps)
Infrastructure:
- GPU: A40 (48GB VRAM)
- Total training time: ~10.9 hours (Phase 1 + Phase 2)
- Checkpoint size: ~3GB per checkpoint
Training Results
| Training Loss | Epoch | Step | Validation Loss | WER | Notes |
|---|---|---|---|---|---|
| 1.095 | 0.046 | 500 | 0.8853 | 57.56 | Initial adaptation |
| 0.987 | 0.083 | 1000 | 0.8043 | 52.16 | |
| 0.910 | 0.125 | 1500 | 0.7693 | 49.50 | |
| 0.872 | 0.167 | 2000 | 0.7435 | 57.55 | Eval sample variability |
| 0.843 | 0.208 | 2500 | 0.7323 | 47.05 | |
| 0.822 | 0.250 | 3000 | 0.7152 | 50.64 | Eval sample variability |
| 0.795 | 0.292 | 3500 | 0.7000 | 47.31 | |
| 0.756 | 0.333 | 4000 | 0.6966 | 45.03 | |
| 0.718 | 0.375 | 4500 | 0.6927 | 44.41 | |
| 0.692 | 0.417 | 5000 | 0.6873 | 45.80 | |
| 0.668 | 0.458 | 5500 | 0.6813 | 44.39 | |
| 0.663 | 0.500 | 6000 | 0.6778 | 43.73 | Best checkpoint |
| 0.650 | 0.542 | 6500 | 0.6750 | 43.85 | No improvement (1/3) |
| 0.621 | 0.583 | 7000 | 0.6770 | 44.24 | No improvement (2/3) |
| 0.621 | 0.625 | 7500 | 0.6768 | 45.29 | No improvement (3/3) - Phase 1 ends |
| 0.629 | 0.667 | 8000 | 0.7024 | 45.80 | Phase 2 begins (resumed) |
| 0.628 | 0.708 | 8500 | 0.7090 | 45.41 | |
| 0.604 | 0.750 | 9000 | 0.7114 | 46.12 | |
| 0.599 | 0.792 | 9500 | 0.7125 | 49.07 | |
| 0.542 | 0.833 | 10000 | 0.7048 | 46.22 | Phase 2 ends |
Training Observations:
- Convergence: Best WER achieved at step 6,000 (43.73%), slightly later than v6's step 4,500
- WER Spikes: Early training shows evaluation variability (steps 2,000 and 3,000) attributable to stochastic sampling of the 500-sample evaluation subset
- Phase 2 Degradation: Resumed training from step 7,500 with a new warmup schedule did not improve beyond Phase 1 performance, consistent with the model having already converged
- Stable Convergence Window: Steps 6,000–7,500 form a stable convergence plateau (WER 43.73%–45.29%) before Phase 2
Final Metrics (Step 6,000 — Best Checkpoint):
- Training loss: 0.6633
- Validation loss: 0.6778
- Validation WER: 43.73%
- Total training steps: 10,000 (across two phases)
Usage
Quick Start
from transformers import pipeline
# Load the model
pipe = pipeline("automatic-speech-recognition",
model="rogendo/asr-whisper-largev2-v7")
# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])
Advanced Usage with Audio Preprocessing
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
# Load model and processor
processor = WhisperProcessor.from_pretrained("rogendo/asr-whisper-largev2-v7")
model = WhisperForConditionalGeneration.from_pretrained("rogendo/asr-whisper-largev2-v7")
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)
# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)
# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
max_length=448
)
# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Production Deployment Recommendations
Audio Requirements:
- Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
- Format: Mono (single channel)
- Duration: Optimal <30 seconds per segment for memory efficiency
Inference Optimization:
# Use half-precision for faster inference
model = model.half() # FP16
# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)
Real-time Considerations:
- GPU required for real-time transcription (RTF < 1.0)
- CPU inference possible but slower (RTF ~3-5x on modern CPUs)
- Consider model quantization for edge deployment
Evaluation Methodology
Validation Set:
- 12,758 samples from real phone call recordings
- Evaluation subset: 500 samples per evaluation step (for speed)
- Evaluated every 500 training steps
- Original audio only (no augmentation)
WER Calculation:
- Standard Word Error Rate:
(Substitutions + Deletions + Insertions) / Total Words - Normalized text (lowercase, punctuation handling)
- Swahili-specific text normalization applied
Note on WER Variability: Early-training evaluation metrics (steps 2,000 and 3,000) show elevated WER due to stochastic sampling of the 500-sample evaluation subset. Overall trend is downward, converging at step 6,000.
Best Model Selection:
- Automatic selection based on lowest validation WER
- Early stopping with patience of 3 evaluations (1,500 steps)
- Final model: Step 6,000 checkpoint
Future Work
- Test set evaluation: Comprehensive evaluation on held-out 6,379-sample test set
- Code-switching support: Improve Swahili-English mixed utterance handling
- Model compression: Quantization and distillation for faster inference
- Streaming ASR: Adapt for real-time streaming transcription
- Dialect expansion: Include more regional Swahili variants
- Ablation studies: Analyze impact of individual augmentation techniques on WER
Citation
If you use this model in your research or production systems, please cite:
@misc{openchs-swahili-asr-v7,
title={Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony (v7)},
author={OpenCHS Team},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/rogendo/asr-whisper-largev2-v7}},
note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on expanded phone call data (v8, ~328.5 hours augmented)}
}
Framework Versions
- Transformers: 4.56.2
- PyTorch: 2.8.0+cu128
- Datasets: 2.21.0
- Tokenizers: 0.22.1
License
Apache 2.0
Acknowledgments
- Base model: openchs/asr-whisper-helpline-sw-v1
- Previous version: openchs/asr-whisper-largev2-v6
- Foundation model: OpenAI Whisper Large v2
- Training infrastructure: A40 GPU
- Project: OpenCHS - Open Source Child Helpline System
- Data collection: Tanzania Child Helpline operations team
Model Status: Production Ready — Largest training corpus to date for Tanzania Child Helpline telephony transcription
- Downloads last month
- 36
Model tree for Rogendo/asr-whisper-largev2-v7
Base model
openai/whisper-large-v2Evaluation results
- WER on Custom Swahili ASR v8 (Phone Calls)validation set self-reported43.728