IkengaTTS: Igbo Text-to-Speech (F5-TTS Fine-Tune)
The first high-quality neural text-to-speech system for Igbo, fine-tuned from F5-TTS on 112,521 clips (255 hours) of African Voices data.
Paper: Igbo Kwenu: IkengaTTS — a Tone-Preserving Igbo Text-to-Speech via Transfer Learning
Key Results
- Matches ground truth intelligibility: ΔWER = −1.3 pp (synthesized speech is slightly more ASR-transcribable than original recordings)
- Speaker similarity: SIM-o = 0.963 (high voice cloning fidelity)
- 20 speaker voices across 3 Igbo dialect regions
- Unadapted F5-TTS fails entirely on Igbo (WER > 100%), confirming fine-tuning is essential
Model Details
| Property | Value |
|---|---|
| Architecture | F5-TTS (DiT, flow-matching, non-autoregressive) |
| Parameters | 335.8M |
| Base model | SWivid/F5-TTS |
| Training data | African Voices Igbo (112,521 clips, 255 hours, 447 speakers) |
| Training time | ~20 hours on 1× A100 SXM 80GB |
| Vocabulary | 2,604 character tokens (expanded from 2,546 to include Igbo combining marks) |
| Audio format | 24 kHz mono WAV (Vocos vocoder) |
Igbo-Specific Design
- Tone-aware vocabulary: Combining diacritics (acute U+0301, grave U+0300, dot-below U+0323, macron U+0304) are explicit tokens
- Tone marks stripped at synthesis time: A/B testing showed toned input produces worse output (model trained on 78% untoned text). Subdots (ụ, ọ, ṅ) are preserved.
- Character-level tokenization: No G2P or phoneme pipeline needed — Igbo's regular orthography allows direct character input
Usage
from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder, preprocess_ref_audio_text
from f5_tts.model import DiT
# Load model
MODEL_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
vocoder = load_vocoder("vocos", device="cuda")
model = load_model(DiT, MODEL_CFG, "path/to/igbo_tts_f5_wrapped.pt",
mel_spec_type="vocos", vocab_file="path/to/vocab.txt", device="cuda")
# Synthesize
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "Reference text here")
audio, sr, _ = infer_process(
ref_audio=ref_audio, ref_text=ref_text,
gen_text="Igbo bụ asụsụ anyị",
model_obj=model, vocoder=vocoder,
mel_spec_type="vocos", device="cuda",
)
API
A FastAPI server is available at api/server.py with endpoints:
POST /synthesize— text → WAV audioPOST /clone— reference audio + text → WAV in cloned voiceGET /speakers— list available voices
Weights
Model weights are not hosted on this repository. See the GitHub repo for access instructions.
Checkpoint files:
igbo_tts_f5_wrapped.pt(643 MB, fp16) — inference checkpointvocab.txt(11 KB) — 2,604-token character vocabulary
What Failed Before This Worked
We tried 4 from-scratch approaches before transfer learning succeeded:
| Attempt | Architecture | Data | Result |
|---|---|---|---|
| Kokoro (AR) | StyleTTS2-like | 934 clips (2.7h) | Exposure-bias collapse → silence |
| VITS (E2E) | VAE + GAN | 112K clips (255h) | Duration compressed to 40-55% |
| FastSpeech2 (25M) | NAR | 934 clips (2.7h) | Overfit, unintelligible |
| FastSpeech2 (1.1M) | NAR | 934 clips (2.7h) | Same — silence |
| F5-TTS fine-tune | Flow-matching | 112K clips (255h) | Success |
Lesson: Pre-training learns how to speak; fine-tuning learns how to speak Igbo.
Evaluation
Evaluated on 987 held-out clips from African Voices dev_test:
| System | MMS WER | ΔWER | SIM-o | UTMOS |
|---|---|---|---|---|
| Ground truth | 55.6% | — | — | 1.28 |
| F5-TTS fine-tuned | 50.3% | −5.3 pp | 0.963 | 1.68 |
| F5-TTS base (no fine-tuning) | 100.5% | +44.9 pp | 0.958 | 1.55 |
Re-evaluation with fine-tuned ASR (28% WER) confirms: ΔWER = −1.3 pp.
License
This model is released under CC-BY-NC-SA 4.0.
- BY: You must give appropriate credit
- NC: Non-commercial use only (upstream F5-TTS weights are CC-BY-NC)
- SA: Derivatives must use the same license
Citation
@misc{chimezie2026ikengaTTS,
title={Igbo Kwenu: IkengaTTS -- a Tone-Preserving Igbo Text-to-Speech via Transfer Learning},
author={Chimezie, Emmanuel},
year={2026},
url={https://github.com/chimezie90/igbotts}
}
Author
Emmanuel Chimezie — Mexkoy Labs
- Downloads last month
- -