Sauti β Whisper Small fine-tuned on Luganda (WAXAL)
Sauti (Swahili/East African for "voice") is the first publicly released automatic speech recognition model fine-tuned on Google's WAXAL dataset for the Luganda language. It was built by a solo developer in Kampala, Uganda as part of Project Sauti β an initiative to build open voice AI infrastructure for Ugandan languages.
Try the live demo
You can transcribe Luganda speech directly in your browser β no Python or setup required:
π **[Open the Sauti Luganda Demo] https://huggingface.co/ReyMugoo/whisper-small-luganda-sauti **
Model details
This model is a fine-tuned version of openai/whisper-small, trained on the lug_asr (Luganda Automatic Speech Recognition) subset of Google's WAXAL dataset. WAXAL β named after the Wolof word for "speak" β is a large-scale open speech dataset for 21 Sub-Saharan African languages, co-developed over three years by Google and partner universities including Makerere University in Uganda.
| Property | Value |
|---|---|
| Base model | openai/whisper-small |
| Language | Luganda (lug) |
| Task | Automatic Speech Recognition |
| Dataset | google/WaxalNLP β lug_asr subset |
| Training samples | 600 |
| Test samples | 100 |
| Training epochs | 3 |
| Word Error Rate (WER) | 49.4% |
| Hardware | NVIDIA T4 GPU (Google Colab free tier) |
| Training time | ~2.5 hours |
Performance
The current WER of 49.4% was achieved on a first fine-tuning run using only 600 training samples. This is the baseline β subsequent versions trained on more data will significantly improve this score. For context, the best published academic result on Luganda ASR (using ~3,900 hours of data and a Conformer architecture) achieves 21.95% WER. The gap between these numbers is almost entirely explained by training data volume, not model architecture.
Retraining with 3,000+ samples is expected to bring WER into the 32β40% range. Versions will be published as training progresses.
How to use
You can use this model directly with the Hugging Face pipeline API:
from transformers import pipeline
import torch
# Load the model
transcriber = pipeline(
task="automatic-speech-recognition",
model="onealleyai/whisper-small-luganda-sauti",
device=0 if torch.cuda.is_available() else -1,
chunk_length_s=28,
stride_length_s=5,
)
# Transcribe a local audio file
result = transcriber("your_luganda_audio.wav")
print(result["text"])
The model expects audio at 16kHz sample rate. If your audio is at a different rate, use librosa to resample it:
import librosa
import soundfile as sf
# Resample to 16kHz
audio, sr = librosa.load("your_audio.mp3", sr=16000)
sf.write("resampled.wav", audio, 16000)
result = transcriber("resampled.wav")
Training details
Training was performed using the Hugging Face Seq2SeqTrainer with the following configuration:
- Learning rate: 1e-5
- Batch size: 8 (effective batch size 16 with gradient accumulation steps of 2)
- Warmup steps: 50
- Mixed precision: fp16
- Evaluation strategy: every 200 steps
- Best checkpoint selection: lowest WER on test split
Data was filtered to remove clips over 28 seconds (Whisper's hard limit) and transcripts under 2 words, then preprocessed using WhisperProcessor with the feature extractor converting waveforms to log-mel spectrograms and the tokenizer encoding Luganda transcripts to token IDs.
About the WAXAL dataset
WAXAL was officially released in February 2026 after three years of development. It contains over 11,000 hours of speech from nearly 2 million individual recordings across 21 Sub-Saharan African languages. For Uganda specifically, WAXAL includes Luganda, Acholi, Runyankole, Lusoga, Rukiga and Masaaba β all co-collected with participation from Makerere University. The dataset is released under the CC-BY-4.0 license.
Roadmap
This model is actively being improved. Planned updates include retraining on 3,000+ samples to push WER below 35%, fine-tunes for the remaining 5 Ugandan WAXAL languages (Acholi, Runyankole, Lusoga, Rukiga, Masaaba), domain-specific models for agricultural and medical vocabulary, and a public REST API for developer integration.
About Project Sauti
Project Sauti is an open initiative to build voice AI infrastructure for Ugandan languages β starting with ASR, expanding to domain-specific models, and eventually a full speech pipeline combining transcription, translation and synthesis. The project is built by a solo developer in Kampala,Godfrey Mugoya using entirely free and open-source tools, demonstrating that meaningful African language AI can be developed without large institutional resources.
Collaboration, feedback and contributions are welcome. Reach out via Hugging Face or open an issue on this repository.
Citation
If you use this model in research or applications, please cite:
@misc{onealleyai2026sauti,
author = {onealleyai Godfrey Mugoya},
title = {Sauti: Whisper Small fine-tuned on Luganda WAXAL},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/onealleyai/whisper-small-luganda-sauti}
}
Acknowledgements
This model was trained using Google's WAXAL dataset, co-developed with Makerere University and partner institutions across Africa. The base model is OpenAI's Whisper, released under the MIT license. Training was performed on Google Colab's free T4 GPU tier.
- Downloads last month
- 22
Model tree for ReyMugoo/whisper-small-luganda-sauti
Base model
openai/whisper-small