asr-whisper-largev2-v7

This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on an expanded dataset of real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.

Model Description

This ASR model represents the third iteration of domain-specific fine-tuning (v7) with a significantly expanded training corpus compared to v6. The model continues the pipeline of bridging the gap between clean, read speech and real-world telephony audio while benefiting from ~1.9x more training data than its predecessor.

Key Characteristics:

Domain: Child helpline phone call transcription (Swahili)
Best Checkpoint: Step 6,000
Validation WER: 43.73% on real phone call audio
Validation Loss: 0.6778
Training Dataset: Swahili ASR Dataset v8 (~328.5 hours of augmented telephony speech)

Performance Context: The model achieves 43.73% WER on real telephony audio, representing a 0.29 percentage point improvement over v6 (44.02% WER) while being trained on nearly double the data. The model demonstrates continued scaling benefits with expanded domain-specific training examples.

Training Strategy

Three-Stage Training Pipeline:

Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
Stage 2 (This Model) - Real Phone Calls v8: Domain adaptation on expanded helpline recordings (6,000 steps → 43.73% WER on telephony)

This model represents Stage 2 v7 with enhanced domain-specific optimization through increased training data volume and a higher augmentation factor (5x vs 4x).

Intended Uses & Limitations

Intended Uses

Primary:

Transcribing Swahili speech in Tanzania Child Helpline call center environments
Real-time or batch processing of telephony audio (8kHz phone quality)
Production ASR system for helpline service documentation and analytics

Secondary:

General Swahili ASR for telephony/call center applications
Research baseline for domain adaptation studies (clean speech → telephony)
Transfer learning base for similar low-resource telephony ASR tasks

Key Improvements Over Previous Versions

Largest Training Dataset to Date: ~328.5 hours augmented Higher Augmentation Factor: 5x for greater acoustic diversity Improved Accuracy: CER and WER reduction on telephony audio Better Confidence: Reduction in validation loss Enhanced Coverage: More diverse call scenarios and speaker characteristics Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations Production Ready: Validated on actual helpline audio (not just clean datasets)

Limitations

Domain-Specific Vocabulary:

Optimized for child helpline and healthcare-related conversations
May underperform on technical, legal, or specialized domains outside training data scope

Dialect Specificity:

Best performance on Tanzanian Swahili dialects represented in training data
May have reduced accuracy on coastal, northern, or other regional variants not well-represented

Audio Quality Requirements:

Designed for telephony (8kHz-16kHz); may need retuning for high-fidelity audio
Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)

Code-Switching:

Limited handling of Swahili-English code-switching common in urban Tanzania
May struggle with mixed-language utterances

Model Size:

Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
Consider quantization or distillation for edge deployment

Training and Evaluation Data

The model was trained on the Swahili ASR Dataset v8, a private dataset curated specifically for this task with the largest coverage to date across the v5/v6/v7 series.

Data Privacy & Access

Status: 🔒 Private / Internal Use Only

The dataset is not publicly available due to strict privacy and Personally Identifiable Information (PII) concerns. The source audio consists of real calls to the Tanzania Child Helpline. While the model weights are shared, the training data remains confidential to protect the identities of callers, many of whom are minors.

Dataset Volume (Hours & Samples)

The dataset utilizes a 5x augmentation strategy (for training set only) to maximize the utility of the available domain-specific audio.

Split	Unique Samples	Original Duration	Augmented Duration	Notes
Training	44,651	~68.7 hours	~328.5 hours	1 original + 4 augmented versions per sample
Validation	12,758	~18.7 hours	~18.7 hours	Original audio only (No augmentation)
Test	6,379	~9.4 hours	~9.4 hours	Original audio only (No augmentation)
TOTAL	63,788	~117 hours	~356.6 hours

Data Characteristics

Source: Real-world phone call audio (not studio recordings)
Language: Tanzanian Swahili with natural conversational characteristics
Format: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
Content: Domain-relevant vocabulary (child welfare, healthcare, family support)
Raw Records: 80,671 annotations; 63,788 retained after deduplication (16,883 duplicates removed)

Audio Augmentation Strategy

To make the model robust against the noisy environment of a call center, the training set was expanded from ~65.7 hours to ~328.5 hours using a multi-technique augmentation strategy. The augmentation factor was increased to 5x (from ~4x in v6) to enhance acoustic diversity.

Training samples were augmented with the following techniques:

Colored Noise (50% probability): Background static simulation (White/Pink/Brown noise, 30% zone coverage)
Packet Loss / Crop (40% probability): VoIP connection drop simulation (15% audio cropped from 10%-90% zone)
VTLP - Speaker Variability (50% probability): Vocal tract length perturbation (factor range 0.9–1.1)
Codec Masking (30% probability): Compression artifact simulation (10% audio masked with noise)
Pitch Shift (40% probability): Tone variation (±2 semitones)
Time Stretch (40% probability): Speed variation (0.9x–1.1x)
Volume Variation (70% probability): Loudness variation (±6 dB)

Each technique is applied independently based on its probability weight. Augmented datasets are shuffled after augmentation to improve mini-batch diversity during training.

Note: Validation and Test splits contain only original audio to ensure unbiased evaluation metrics.

Training Procedure

Training was performed in two phases on a A40 GPU:

Phase 1: Steps 1–7,500 (~8 hours wall time)
Phase 2: Steps 7,500–10,000, resumed from Phase 1 checkpoint (~3.9 hours wall time)

The best checkpoint (step 6,000) was achieved during Phase 1.

Training Hyperparameters

Optimization:

learning_rate: 1e-05
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 500
optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
max_training_steps: 12,000 (best checkpoint at 6,000)
seed: 42

Batch Configuration:

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 1
Effective batch size: 16

Memory Optimization:

gradient_checkpointing: true
mixed_precision_training: Native AMP (FP16)

Evaluation & Checkpointing:

evaluation_strategy: steps
eval_steps: 500
save_steps: 500
logging_steps: 50
save_total_limit: 3

Best Model Selection:

load_best_model_at_end: true
metric_for_best_model: "wer"
greater_is_better: false
early_stopping_patience: 3 evaluations (1,500 steps)

Infrastructure:

GPU: A40 (48GB VRAM)
Total training time: ~10.9 hours (Phase 1 + Phase 2)
Checkpoint size: ~3GB per checkpoint

Training Results

Training Loss	Epoch	Step	Validation Loss	WER	Notes
1.095	0.046	500	0.8853	57.56	Initial adaptation
0.987	0.083	1000	0.8043	52.16
0.910	0.125	1500	0.7693	49.50
0.872	0.167	2000	0.7435	57.55	Eval sample variability
0.843	0.208	2500	0.7323	47.05
0.822	0.250	3000	0.7152	50.64	Eval sample variability
0.795	0.292	3500	0.7000	47.31
0.756	0.333	4000	0.6966	45.03
0.718	0.375	4500	0.6927	44.41
0.692	0.417	5000	0.6873	45.80
0.668	0.458	5500	0.6813	44.39
0.663	0.500	6000	0.6778	43.73	Best checkpoint
0.650	0.542	6500	0.6750	43.85	No improvement (1/3)
0.621	0.583	7000	0.6770	44.24	No improvement (2/3)
0.621	0.625	7500	0.6768	45.29	No improvement (3/3) - Phase 1 ends
0.629	0.667	8000	0.7024	45.80	Phase 2 begins (resumed)
0.628	0.708	8500	0.7090	45.41
0.604	0.750	9000	0.7114	46.12
0.599	0.792	9500	0.7125	49.07
0.542	0.833	10000	0.7048	46.22	Phase 2 ends

Training Observations:

Convergence: Best WER achieved at step 6,000 (43.73%), slightly later than v6's step 4,500
WER Spikes: Early training shows evaluation variability (steps 2,000 and 3,000) attributable to stochastic sampling of the 500-sample evaluation subset
Phase 2 Degradation: Resumed training from step 7,500 with a new warmup schedule did not improve beyond Phase 1 performance, consistent with the model having already converged
Stable Convergence Window: Steps 6,000–7,500 form a stable convergence plateau (WER 43.73%–45.29%) before Phase 2

Final Metrics (Step 6,000 — Best Checkpoint):

Training loss: 0.6633
Validation loss: 0.6778
Validation WER: 43.73%
Total training steps: 10,000 (across two phases)

Usage

Quick Start

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="rogendo/asr-whisper-largev2-v7")

# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])

Advanced Usage with Audio Preprocessing

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("rogendo/asr-whisper-largev2-v7")
model = WhisperForConditionalGeneration.from_pretrained("rogendo/asr-whisper-largev2-v7")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)

# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids,
    max_length=448
)

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Production Deployment Recommendations

Audio Requirements:

Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
Format: Mono (single channel)
Duration: Optimal <30 seconds per segment for memory efficiency

Inference Optimization:

# Use half-precision for faster inference
model = model.half()  # FP16

# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)

Real-time Considerations:

GPU required for real-time transcription (RTF < 1.0)
CPU inference possible but slower (RTF ~3-5x on modern CPUs)
Consider model quantization for edge deployment

Evaluation Methodology

Validation Set:

12,758 samples from real phone call recordings
Evaluation subset: 500 samples per evaluation step (for speed)
Evaluated every 500 training steps
Original audio only (no augmentation)

WER Calculation:

Standard Word Error Rate: (Substitutions + Deletions + Insertions) / Total Words
Normalized text (lowercase, punctuation handling)
Swahili-specific text normalization applied

Note on WER Variability: Early-training evaluation metrics (steps 2,000 and 3,000) show elevated WER due to stochastic sampling of the 500-sample evaluation subset. Overall trend is downward, converging at step 6,000.

Best Model Selection:

Automatic selection based on lowest validation WER
Early stopping with patience of 3 evaluations (1,500 steps)
Final model: Step 6,000 checkpoint

Future Work

Test set evaluation: Comprehensive evaluation on held-out 6,379-sample test set
Code-switching support: Improve Swahili-English mixed utterance handling
Model compression: Quantization and distillation for faster inference
Streaming ASR: Adapt for real-time streaming transcription
Dialect expansion: Include more regional Swahili variants
Ablation studies: Analyze impact of individual augmentation techniques on WER

Citation

If you use this model in your research or production systems, please cite:

@misc{openchs-swahili-asr-v7,
  title={Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony (v7)},
  author={OpenCHS Team},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/rogendo/asr-whisper-largev2-v7}},
  note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on expanded phone call data (v8, ~328.5 hours augmented)}
}

Framework Versions

Transformers: 4.56.2
PyTorch: 2.8.0+cu128
Datasets: 2.21.0
Tokenizers: 0.22.1

License

Apache 2.0

Acknowledgments

Base model: openchs/asr-whisper-helpline-sw-v1
Previous version: openchs/asr-whisper-largev2-v6
Foundation model: OpenAI Whisper Large v2
Training infrastructure: A40 GPU
Project: OpenCHS - Open Source Child Helpline System
Data collection: Tanzania Child Helpline operations team

Model Status: Production Ready — Largest training corpus to date for Tanzania Child Helpline telephony transcription

Downloads last month: 36

Safetensors

Model size

2B params

Tensor type

F32

Model tree for Rogendo/asr-whisper-largev2-v7

Base model

openai/whisper-large-v2

Finetuned

openchs/asr-whisper-helpline-sw-v1

Finetuned

(3)

this model

Evaluation results

WER on Custom Swahili ASR v8 (Phone Calls)
validation set self-reported

43.728