Wav2Vec2 Luganda Dysarthric Speech Recognition (v7)

The first ever Automatic Speech Recognition model specifically built for Luganda-speaking people with dysarthria — a motor speech disorder affecting millions across Uganda.

Model Description

Base model: indonesian-nlp/wav2vec2-luganda (already knows Luganda phonetics)
Fine-tuned on: cdli/ugandan_luganda_nonstandard_speech_v1.0 — real dysarthric Luganda speakers
Architecture: Wav2Vec2 Large XLSR with CTC head
Language: Luganda (lg), Uganda

Performance

Decoding	WER
Greedy (no LM)	27.87%
+ KenLM 5-gram	24.03%

Baseline comparison:

Base Wav2Vec2-Luganda (no fine-tuning): ~90%+ WER on dysarthric speech
This model: 24.03% WER — a ~66% relative improvement

Training Data

Real dysarthric samples: 6,347 (from cdli/ugandan_luganda_nonstandard_speech_v1.0)
Augmented samples: 95,000+ generated using 12 augmentation types:
- Speed perturbation (0.7x, 0.75x, 0.8x)
- Pitch shifting (±3-4 semitones)
- Additive noise (10dB, 20dB SNR)
- Vocal jitter (tremor simulation)
- Amplitude shimmer
- Formant shifting
- Room reverb
- Combined augmentations
Total training samples: ~102,629

How to Use

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")

# Load your audio (must be 16kHz mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)

With KenLM Language Model (recommended)

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from pyctcdecode import build_ctcdecoder
import torch

processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")

# Build decoder with KenLM (download arpa file separately)
labels = [' ','a','b','c','d','e','f','g','h','i','j','k','l','m',
          'n','o','p','q','r','s','t','u','v','w','x','y','z','⁇','']
decoder = build_ctcdecoder(labels=labels, kenlm_model_path="luganda_5gram.arpa", alpha=0.5, beta=1.0)
decoder._alphabet._labels.extend(['<s>', '</s>'])
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder,
)

import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
transcription = processor_with_lm.batch_decode(logits.numpy()).text[0]
print(transcription)

Offline Usage

Both the model and processor can be saved locally and used completely offline:

# Save locally
processor.save_pretrained("./luganda-asr-offline")
model.save_pretrained("./luganda-asr-offline")

# Load offline
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("./luganda-asr-offline")
model = Wav2Vec2ForCTC.from_pretrained("./luganda-asr-offline")

Intended Use

Doctor-patient communication aids in Uganda
Speech therapy progress tracking
Classroom captioning for students with dysarthria
WhatsApp voice message transcription
Smart home voice control

Limitations

Trained on a specific set of Ugandan Luganda speakers — may not generalize perfectly to all dialects
Best performance on mild-to-moderate dysarthria
Severe dysarthria (intelligibility < 25%) may still show high WER

Citation

If you use this model, please cite:

@misc{luganda-dysarthric-asr-2026,
  title={Luganda Dysarthric Speech Recognition},
  author={Nugwa Mark},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/nugwa-mark/wav2vec2-luganda-dysarthric-v7}
}

Downloads last month: 25

Safetensors

Model size

0.3B params

Tensor type

F32

Dataset used to train Nkugwa/wav2vec2-luganda-dysarthric-v7

Evaluation results

Test WER (acoustic only) on Ugandan Luganda Nonstandard Speech
self-reported

27.870
Test WER (with KenLM 5-gram LM) on Ugandan Luganda Nonstandard Speech
self-reported

24.030