Sauti — Whisper Small fine-tuned on Luganda (WAXAL)

Sauti (Swahili/East African for "voice") is the first publicly released automatic speech recognition model fine-tuned on Google's WAXAL dataset for the Luganda language. It was built by a solo developer in Kampala, Uganda as part of Project Sauti — an initiative to build open voice AI infrastructure for Ugandan languages.

Try the live demo

You can transcribe Luganda speech directly in your browser — no Python or setup required:

👉 **[Open the Sauti Luganda Demo] https://huggingface.co/ReyMugoo/whisper-small-luganda-sauti **

Model details

This model is a fine-tuned version of openai/whisper-small, trained on the lug_asr (Luganda Automatic Speech Recognition) subset of Google's WAXAL dataset. WAXAL — named after the Wolof word for "speak" — is a large-scale open speech dataset for 21 Sub-Saharan African languages, co-developed over three years by Google and partner universities including Makerere University in Uganda.

Property	Value
Base model	openai/whisper-small
Language	Luganda (lug)
Task	Automatic Speech Recognition
Dataset	google/WaxalNLP — lug_asr subset
Training samples	600
Test samples	100
Training epochs	3
Word Error Rate (WER)	49.4%
Hardware	NVIDIA T4 GPU (Google Colab free tier)
Training time	~2.5 hours

Performance

The current WER of 49.4% was achieved on a first fine-tuning run using only 600 training samples. This is the baseline — subsequent versions trained on more data will significantly improve this score. For context, the best published academic result on Luganda ASR (using ~3,900 hours of data and a Conformer architecture) achieves 21.95% WER. The gap between these numbers is almost entirely explained by training data volume, not model architecture.

Retraining with 3,000+ samples is expected to bring WER into the 32–40% range. Versions will be published as training progresses.

How to use

You can use this model directly with the Hugging Face pipeline API:

from transformers import pipeline
import torch

# Load the model
transcriber = pipeline(
    task="automatic-speech-recognition",
    model="onealleyai/whisper-small-luganda-sauti",
    device=0 if torch.cuda.is_available() else -1,
    chunk_length_s=28,
    stride_length_s=5,
)

# Transcribe a local audio file
result = transcriber("your_luganda_audio.wav")
print(result["text"])

The model expects audio at 16kHz sample rate. If your audio is at a different rate, use librosa to resample it:

import librosa
import soundfile as sf

# Resample to 16kHz
audio, sr = librosa.load("your_audio.mp3", sr=16000)
sf.write("resampled.wav", audio, 16000)
result = transcriber("resampled.wav")

Training details

Training was performed using the Hugging Face Seq2SeqTrainer with the following configuration:

Learning rate: 1e-5
Batch size: 8 (effective batch size 16 with gradient accumulation steps of 2)
Warmup steps: 50
Mixed precision: fp16
Evaluation strategy: every 200 steps
Best checkpoint selection: lowest WER on test split

Data was filtered to remove clips over 28 seconds (Whisper's hard limit) and transcripts under 2 words, then preprocessed using WhisperProcessor with the feature extractor converting waveforms to log-mel spectrograms and the tokenizer encoding Luganda transcripts to token IDs.

About the WAXAL dataset

WAXAL was officially released in February 2026 after three years of development. It contains over 11,000 hours of speech from nearly 2 million individual recordings across 21 Sub-Saharan African languages. For Uganda specifically, WAXAL includes Luganda, Acholi, Runyankole, Lusoga, Rukiga and Masaaba — all co-collected with participation from Makerere University. The dataset is released under the CC-BY-4.0 license.

Roadmap

This model is actively being improved. Planned updates include retraining on 3,000+ samples to push WER below 35%, fine-tunes for the remaining 5 Ugandan WAXAL languages (Acholi, Runyankole, Lusoga, Rukiga, Masaaba), domain-specific models for agricultural and medical vocabulary, and a public REST API for developer integration.

About Project Sauti

Project Sauti is an open initiative to build voice AI infrastructure for Ugandan languages — starting with ASR, expanding to domain-specific models, and eventually a full speech pipeline combining transcription, translation and synthesis. The project is built by a solo developer in Kampala,Godfrey Mugoya using entirely free and open-source tools, demonstrating that meaningful African language AI can be developed without large institutional resources.

Collaboration, feedback and contributions are welcome. Reach out via Hugging Face or open an issue on this repository.

Citation

If you use this model in research or applications, please cite:

@misc{onealleyai2026sauti,
  author    = {onealleyai Godfrey Mugoya},
  title     = {Sauti: Whisper Small fine-tuned on Luganda WAXAL},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/onealleyai/whisper-small-luganda-sauti}
}

Acknowledgements

This model was trained using Google's WAXAL dataset, co-developed with Makerere University and partner institutions across Africa. The base model is OpenAI's Whisper, released under the MIT license. Training was performed on Google Colab's free T4 GPU tier.

Downloads last month: 22

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for ReyMugoo/whisper-small-luganda-sauti

Base model

openai/whisper-small

Finetuned

(3443)

this model

ReyMugoo
/

whisper-small-luganda-sauti