IkengaTTS: Igbo Text-to-Speech (F5-TTS Fine-Tune)

The first high-quality neural text-to-speech system for Igbo, fine-tuned from F5-TTS on 112,521 clips (255 hours) of African Voices data.

Paper: Igbo Kwenu: IkengaTTS — a Tone-Preserving Igbo Text-to-Speech via Transfer Learning

Key Results

  • Matches ground truth intelligibility: ΔWER = −1.3 pp (synthesized speech is slightly more ASR-transcribable than original recordings)
  • Speaker similarity: SIM-o = 0.963 (high voice cloning fidelity)
  • 20 speaker voices across 3 Igbo dialect regions
  • Unadapted F5-TTS fails entirely on Igbo (WER > 100%), confirming fine-tuning is essential

Model Details

Property Value
Architecture F5-TTS (DiT, flow-matching, non-autoregressive)
Parameters 335.8M
Base model SWivid/F5-TTS
Training data African Voices Igbo (112,521 clips, 255 hours, 447 speakers)
Training time ~20 hours on 1× A100 SXM 80GB
Vocabulary 2,604 character tokens (expanded from 2,546 to include Igbo combining marks)
Audio format 24 kHz mono WAV (Vocos vocoder)

Igbo-Specific Design

  • Tone-aware vocabulary: Combining diacritics (acute U+0301, grave U+0300, dot-below U+0323, macron U+0304) are explicit tokens
  • Tone marks stripped at synthesis time: A/B testing showed toned input produces worse output (model trained on 78% untoned text). Subdots (ụ, ọ, ṅ) are preserved.
  • Character-level tokenization: No G2P or phoneme pipeline needed — Igbo's regular orthography allows direct character input

Usage

from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder, preprocess_ref_audio_text
from f5_tts.model import DiT

# Load model
MODEL_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
vocoder = load_vocoder("vocos", device="cuda")
model = load_model(DiT, MODEL_CFG, "path/to/igbo_tts_f5_wrapped.pt",
                   mel_spec_type="vocos", vocab_file="path/to/vocab.txt", device="cuda")

# Synthesize
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "Reference text here")
audio, sr, _ = infer_process(
    ref_audio=ref_audio, ref_text=ref_text,
    gen_text="Igbo bụ asụsụ anyị",
    model_obj=model, vocoder=vocoder,
    mel_spec_type="vocos", device="cuda",
)

API

A FastAPI server is available at api/server.py with endpoints:

  • POST /synthesize — text → WAV audio
  • POST /clone — reference audio + text → WAV in cloned voice
  • GET /speakers — list available voices

Weights

Model weights are not hosted on this repository. See the GitHub repo for access instructions.

Checkpoint files:

  • igbo_tts_f5_wrapped.pt (643 MB, fp16) — inference checkpoint
  • vocab.txt (11 KB) — 2,604-token character vocabulary

What Failed Before This Worked

We tried 4 from-scratch approaches before transfer learning succeeded:

Attempt Architecture Data Result
Kokoro (AR) StyleTTS2-like 934 clips (2.7h) Exposure-bias collapse → silence
VITS (E2E) VAE + GAN 112K clips (255h) Duration compressed to 40-55%
FastSpeech2 (25M) NAR 934 clips (2.7h) Overfit, unintelligible
FastSpeech2 (1.1M) NAR 934 clips (2.7h) Same — silence
F5-TTS fine-tune Flow-matching 112K clips (255h) Success

Lesson: Pre-training learns how to speak; fine-tuning learns how to speak Igbo.

Evaluation

Evaluated on 987 held-out clips from African Voices dev_test:

System MMS WER ΔWER SIM-o UTMOS
Ground truth 55.6% 1.28
F5-TTS fine-tuned 50.3% −5.3 pp 0.963 1.68
F5-TTS base (no fine-tuning) 100.5% +44.9 pp 0.958 1.55

Re-evaluation with fine-tuned ASR (28% WER) confirms: ΔWER = −1.3 pp.

License

This model is released under CC-BY-NC-SA 4.0.

  • BY: You must give appropriate credit
  • NC: Non-commercial use only (upstream F5-TTS weights are CC-BY-NC)
  • SA: Derivatives must use the same license

Citation

@misc{chimezie2026ikengaTTS,
  title={Igbo Kwenu: IkengaTTS -- a Tone-Preserving Igbo Text-to-Speech via Transfer Learning},
  author={Chimezie, Emmanuel},
  year={2026},
  url={https://github.com/chimezie90/igbotts}
}

Author

Emmanuel ChimezieMexkoy Labs

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support