Wav2Vec2 Luganda Dysarthric Speech Recognition (v7)

The first ever Automatic Speech Recognition model specifically built for Luganda-speaking people with dysarthria — a motor speech disorder affecting millions across Uganda.

Model Description

  • Base model: indonesian-nlp/wav2vec2-luganda (already knows Luganda phonetics)
  • Fine-tuned on: cdli/ugandan_luganda_nonstandard_speech_v1.0 — real dysarthric Luganda speakers
  • Architecture: Wav2Vec2 Large XLSR with CTC head
  • Language: Luganda (lg), Uganda

Performance

Decoding WER
Greedy (no LM) 27.87%
+ KenLM 5-gram 24.03%

Baseline comparison:

  • Base Wav2Vec2-Luganda (no fine-tuning): ~90%+ WER on dysarthric speech
  • This model: 24.03% WER — a ~66% relative improvement

Training Data

  • Real dysarthric samples: 6,347 (from cdli/ugandan_luganda_nonstandard_speech_v1.0)
  • Augmented samples: 95,000+ generated using 12 augmentation types:
    • Speed perturbation (0.7x, 0.75x, 0.8x)
    • Pitch shifting (±3-4 semitones)
    • Additive noise (10dB, 20dB SNR)
    • Vocal jitter (tremor simulation)
    • Amplitude shimmer
    • Formant shifting
    • Room reverb
    • Combined augmentations
  • Total training samples: ~102,629

How to Use

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")

# Load your audio (must be 16kHz mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)

With KenLM Language Model (recommended)

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from pyctcdecode import build_ctcdecoder
import torch

processor = Wav2Vec2Processor.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")
model = Wav2Vec2ForCTC.from_pretrained("nugwa-mark/wav2vec2-luganda-dysarthric-v7")

# Build decoder with KenLM (download arpa file separately)
labels = [' ','a','b','c','d','e','f','g','h','i','j','k','l','m',
          'n','o','p','q','r','s','t','u','v','w','x','y','z','⁇','']
decoder = build_ctcdecoder(labels=labels, kenlm_model_path="luganda_5gram.arpa", alpha=0.5, beta=1.0)
decoder._alphabet._labels.extend(['<s>', '</s>'])
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder,
)

import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
transcription = processor_with_lm.batch_decode(logits.numpy()).text[0]
print(transcription)

Offline Usage

Both the model and processor can be saved locally and used completely offline:

# Save locally
processor.save_pretrained("./luganda-asr-offline")
model.save_pretrained("./luganda-asr-offline")

# Load offline
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("./luganda-asr-offline")
model = Wav2Vec2ForCTC.from_pretrained("./luganda-asr-offline")

Intended Use

  • Doctor-patient communication aids in Uganda
  • Speech therapy progress tracking
  • Classroom captioning for students with dysarthria
  • WhatsApp voice message transcription
  • Smart home voice control

Limitations

  • Trained on a specific set of Ugandan Luganda speakers — may not generalize perfectly to all dialects
  • Best performance on mild-to-moderate dysarthria
  • Severe dysarthria (intelligibility < 25%) may still show high WER

Citation

If you use this model, please cite:

@misc{luganda-dysarthric-asr-2026,
  title={Luganda Dysarthric Speech Recognition},
  author={Nugwa Mark},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/nugwa-mark/wav2vec2-luganda-dysarthric-v7}
}
Downloads last month
25
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Nkugwa/wav2vec2-luganda-dysarthric-v7

Evaluation results

  • Test WER (acoustic only) on Ugandan Luganda Nonstandard Speech
    self-reported
    27.870
  • Test WER (with KenLM 5-gram LM) on Ugandan Luganda Nonstandard Speech
    self-reported
    24.030