IkengaTTS: Igbo Text-to-Speech (F5-TTS Fine-Tune)

The first high-quality neural text-to-speech system for Igbo, fine-tuned from F5-TTS on 112,521 clips (255 hours) of African Voices data.

Paper: Igbo Kwenu: IkengaTTS — a Tone-Preserving Igbo Text-to-Speech via Transfer Learning

Key Results

Matches ground truth intelligibility: ΔWER = −1.3 pp (synthesized speech is slightly more ASR-transcribable than original recordings)
Speaker similarity: SIM-o = 0.963 (high voice cloning fidelity)
20 speaker voices across 3 Igbo dialect regions
Unadapted F5-TTS fails entirely on Igbo (WER > 100%), confirming fine-tuning is essential

Model Details

Property	Value
Architecture	F5-TTS (DiT, flow-matching, non-autoregressive)
Parameters	335.8M
Base model	SWivid/F5-TTS
Training data	African Voices Igbo (112,521 clips, 255 hours, 447 speakers)
Training time	~20 hours on 1× A100 SXM 80GB
Vocabulary	2,604 character tokens (expanded from 2,546 to include Igbo combining marks)
Audio format	24 kHz mono WAV (Vocos vocoder)

Igbo-Specific Design

Tone-aware vocabulary: Combining diacritics (acute U+0301, grave U+0300, dot-below U+0323, macron U+0304) are explicit tokens
Tone marks stripped at synthesis time: A/B testing showed toned input produces worse output (model trained on 78% untoned text). Subdots (ụ, ọ, ṅ) are preserved.
Character-level tokenization: No G2P or phoneme pipeline needed — Igbo's regular orthography allows direct character input

Usage

from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder, preprocess_ref_audio_text
from f5_tts.model import DiT

# Load model
MODEL_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
vocoder = load_vocoder("vocos", device="cuda")
model = load_model(DiT, MODEL_CFG, "path/to/igbo_tts_f5_wrapped.pt",
                   mel_spec_type="vocos", vocab_file="path/to/vocab.txt", device="cuda")

# Synthesize
ref_audio, ref_text = preprocess_ref_audio_text("reference.wav", "Reference text here")
audio, sr, _ = infer_process(
    ref_audio=ref_audio, ref_text=ref_text,
    gen_text="Igbo bụ asụsụ anyị",
    model_obj=model, vocoder=vocoder,
    mel_spec_type="vocos", device="cuda",
)

API

A FastAPI server is available at api/server.py with endpoints:

POST /synthesize — text → WAV audio
POST /clone — reference audio + text → WAV in cloned voice
GET /speakers — list available voices

Weights

Model weights are not hosted on this repository. See the GitHub repo for access instructions.

Checkpoint files:

igbo_tts_f5_wrapped.pt (643 MB, fp16) — inference checkpoint
vocab.txt (11 KB) — 2,604-token character vocabulary

What Failed Before This Worked

We tried 4 from-scratch approaches before transfer learning succeeded:

Attempt	Architecture	Data	Result
Kokoro (AR)	StyleTTS2-like	934 clips (2.7h)	Exposure-bias collapse → silence
VITS (E2E)	VAE + GAN	112K clips (255h)	Duration compressed to 40-55%
FastSpeech2 (25M)	NAR	934 clips (2.7h)	Overfit, unintelligible
FastSpeech2 (1.1M)	NAR	934 clips (2.7h)	Same — silence
F5-TTS fine-tune	Flow-matching	112K clips (255h)	Success

Lesson: Pre-training learns how to speak; fine-tuning learns how to speak Igbo.

Evaluation

Evaluated on 987 held-out clips from African Voices dev_test:

System	MMS WER	ΔWER	SIM-o	UTMOS
Ground truth	55.6%	—	—	1.28
F5-TTS fine-tuned	50.3%	−5.3 pp	0.963	1.68
F5-TTS base (no fine-tuning)	100.5%	+44.9 pp	0.958	1.55

Re-evaluation with fine-tuned ASR (28% WER) confirms: ΔWER = −1.3 pp.

License

This model is released under CC-BY-NC-SA 4.0.

BY: You must give appropriate credit
NC: Non-commercial use only (upstream F5-TTS weights are CC-BY-NC)
SA: Derivatives must use the same license

Citation

@misc{chimezie2026ikengaTTS,
  title={Igbo Kwenu: IkengaTTS -- a Tone-Preserving Igbo Text-to-Speech via Transfer Learning},
  author={Chimezie, Emmanuel},
  year={2026},
  url={https://github.com/chimezie90/igbotts}
}

Author

Emmanuel Chimezie — Mexkoy Labs

Downloads last month: -