Wav2Vec2 Luganda Dysarthric Speech Recognition (v7)
The first ever Automatic Speech Recognition model specifically built for Luganda-speaking people with dysarthria — a motor speech disorder affecting millions across Uganda.
Model Description
- Base model:
indonesian-nlp/wav2vec2-luganda(already knows Luganda phonetics) - Fine-tuned on:
cdli/ugandan_luganda_nonstandard_speech_v1.0— real dysarthric Luganda speakers - Architecture: Wav2Vec2 Large XLSR with CTC head
- Language: Luganda (lg), Uganda
Performance
| Decoding | WER |
|---|---|
| Greedy (no LM) | 27.87% |
| + KenLM 5-gram | 24.03% |
Baseline comparison:
- Base Wav2Vec2-Luganda (no fine-tuning): ~90%+ WER on dysarthric speech
- This model: 24.03% WER — a ~66% relative improvement
Training Data
- Real dysarthric samples: 6,347 (from
cdli/ugandan_luganda_nonstandard_speech_v1.0) - Augmented samples: 95,000+ generated using 12 augmentation types:
- Speed perturbation (0.7x, 0.75x, 0.8x)
- Pitch shifting (±3-4 semitones)
- Additive noise (10dB, 20dB SNR)
- Vocal jitter (tremor simulation)
- Amplitude shimmer
- Formant shifting
- Room reverb
- Combined augmentations
- Total training samples: ~102,629
How to Use
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
# Load your audio (must be 16kHz mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)
With KenLM Language Model (recommended)
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from pyctcdecode import build_ctcdecoder
import torch
processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
# Build decoder with KenLM (download arpa file separately)
labels = [' ','a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z','⁇','']
decoder = build_ctcdecoder(labels=labels, kenlm_model_path="luganda_5gram.arpa", alpha=0.5, beta=1.0)
decoder._alphabet._labels.extend(['<s>', '</s>'])
processor_with_lm = Wav2Vec2ProcessorWithLM(
feature_extractor=processor.feature_extractor,
tokenizer=processor.tokenizer,
decoder=decoder,
)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
transcription = processor_with_lm.batch_decode(logits.numpy()).text[0]
print(transcription)
Offline Usage
Both the model and processor can be saved locally and used completely offline:
# Save locally
processor.save_pretrained("./luganda-asr-offline")
model.save_pretrained("./luganda-asr-offline")
# Load offline
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("./luganda-asr-offline")
model = Wav2Vec2ForCTC.from_pretrained("./luganda-asr-offline")
Intended Use
- Doctor-patient communication aids in Uganda
- Speech therapy progress tracking
- Classroom captioning for students with dysarthria
- WhatsApp voice message transcription
- Smart home voice control
Limitations
- Trained on a specific set of Ugandan Luganda speakers — may not generalize perfectly to all dialects
- Best performance on mild-to-moderate dysarthria
- Severe dysarthria (intelligibility < 25%) may still show high WER
Citation
If you use this model, please cite:
@misc{luganda-dysarthric-asr-2026,
title={Luganda Dysarthric Speech Recognition},
author={Nugwa Mark},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/nugwa-mark/wav2vec2-luganda-dysarthric-v7}
}
- Downloads last month
- 25
Dataset used to train Nkugwa/wav2vec2-luganda-dysarthric-v7
Evaluation results
- Test WER (acoustic only) on Ugandan Luganda Nonstandard Speechself-reported27.870
- Test WER (with KenLM 5-gram LM) on Ugandan Luganda Nonstandard Speechself-reported24.030