XTTS-v2 Fine-tuned for Wolof 🇸🇳
A fine-tuned version of Coqui XTTS-v2 with improved Wolof language support. This model was fine-tuned to bring high-quality text-to-speech synthesis to the Wolof language — one of the most widely spoken languages in Senegal and West Africa.
Model Description
XTTS-v2 is a multilingual text-to-speech model that supports voice cloning. This fine-tuned version enhances performance on Wolof while retaining capabilities in French and English.
| Property | Value |
|---|---|
| Base Model | coqui/XTTS-v2 |
| Architecture | GPT-2 based encoder + HiFi-GAN decoder |
| Parameters | ~467M |
| Audio Sample Rate | 24,000 Hz |
| Model Size | 1.7 GB |
Training Details
Datasets
The model was fine-tuned on a curated multilingual dataset of 10,861 audio samples:
| Dataset | Language | Samples | Description |
|---|---|---|---|
| WaxalNLP/wolof_speech | Wolof | 1,042 | Wolof speech corpus |
| GalsenAI/WaxalNNLP | Wolof | 1042 | Enhanced Wolof Speech |
| google/fleurs (wo) | Wolof | 2,819 | FLEURS Wolof split |
| google/fleurs (fr) | French | 2,000 | FLEURS French (subset) |
| google/fleurs (en) | English | 2,000 | FLEURS English (subset) |
| keithito/lj_speech | English | 3,000 | LJSpeech (subset) |
French and English data was included to prevent catastrophic forgetting of multilingual capabilities during fine-tuning.
Fine-tuning Technique
- Method: GPT encoder fine-tuning (Stage 1) using
GPTTrainerfrom Coqui TTS 0.22.0 - All 898 parameters were trained (full fine-tuning, no freezing)
- Optimizer: AdamW with learning rate 5e-6
- Batch size: 4 (effective batch size 16 with gradient accumulation of 4)
- Epochs: 3 (7,740 total steps)
- Precision: FP32 (mixed precision disabled — FP16 caused NaN losses on A100)
- Hardware: NVIDIA A100-SXM4-40GB
- Training loss: 0.924 → 0.767
- Best eval loss: 3.049
Key Training Decisions
- FP32 over FP16: Mixed precision training produced NaN losses with TTS 0.22.0 on A100 GPUs. FP32 training was stable and produced valid gradients throughout.
- Multilingual data mix: Including French, English, and LJSpeech alongside Wolof prevented the model from losing its multilingual voice cloning ability.
- Low learning rate (5e-6): A conservative learning rate preserved the pre-trained model's strengths while allowing adaptation to Wolof phonology.
Evaluation Results
Comparison against the original XTTS-v2 base model on 9 test sentences (5 Wolof, 2 French, 2 English):
Overall
| Metric | Fine-tuned | Original | Δ |
|---|---|---|---|
| Speaker Similarity (↑) | 0.8273 | 0.8175 | +0.0099 |
| Wins (speaker match) | 7/9 | 2/9 | — |
Per-Language Speaker Similarity
| Language | Fine-tuned | Original | Δ | FT Win Rate |
|---|---|---|---|---|
| Wolof 🟢 | 0.8396 | 0.8193 | +0.0203 | 5/5 (100%) |
| French 🟢 | 0.8266 | 0.8187 | +0.0080 | 1/2 |
| English 🔴 | 0.7974 | 0.8116 | -0.0142 | 1/2 |
Key finding: The fine-tuned model shows a +2% improvement in speaker similarity for Wolof while maintaining competitive performance in French and English.
Usage
With Coqui TTS
from TTS.api import TTS
import torch
# Load model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)
# Override with fine-tuned weights
model_path = "path/to/model.pth"
checkpoint = torch.load(model_path, map_location="cpu", weights_only=False)
tts.synthesizer.tts_model.load_state_dict(checkpoint["model"], strict=True)
# Generate Wolof speech
tts.tts_to_file(
text="Jàmm nga fanaan. Nanga def?",
speaker_wav="reference_audio.wav",
language="fr", # Wolof maps to French in XTTS-v2
file_path="output.wav"
)
Direct with XttsModel
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json")
model.eval()
outputs = model.synthesize(
text="Jàmm nga fanaan.",
config=config,
speaker_wav="reference.wav",
language="fr",
)
Files
| File | Size | Description |
|---|---|---|
model.pth |
1.7 GB | Fine-tuned model weights (wrapped with {"model": state_dict}) |
config.json |
4.3 KB | Model configuration |
vocab.json |
353 KB | Tokenizer vocabulary |
speakers_xtts.pth |
7.4 MB | Speaker embeddings |
Limitations
- Wolof is mapped to French (
language="fr") since XTTS-v2 has no native Wolof language token. This works well because Wolof and French share phonological similarities in the Senegalese context. - English speaker similarity shows a slight regression (-1.4%) compared to the base model.
- The model was fine-tuned for 3 epochs only — longer training or larger Wolof datasets could yield further improvements.
- Voice quality depends on the reference audio provided for cloning.
Citation
If you use this model, please cite the original XTTS-v2 and the datasets:
@misc{xtts-v2-wolof,
title={XTTS-v2 Fine-tuned for Wolof},
author={Muhamad Ul},
year={2026},
url={https://huggingface.co/muhamadul/xtts-v2-wolof}
}
@misc{casanova2024xtts,
title={XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model},
author={Casanova, Edresson and others},
year={2024},
publisher={Coqui AI}
}
@misc{waxalnlp-wolof,
title={Wolof Speech Dataset},
author={WaxalNLP},
url={https://huggingface.co/datasets/WaxalNLP/wolof_speech}
}
@inproceedings{conneau2023fleurs,
title={FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
author={Conneau, Alexis and others},
booktitle={IEEE SLT},
year={2023}
}
Acknowledgments
- Coqui AI for the XTTS-v2 base model and TTS framework
- WaxalNLP for the Wolof speech dataset
- Galsen-AI/WaxalNLP for the corrected Wolof speech dataset from the Google WaxalNLP
- Google FLEURS for multilingual speech data
- Daande — The TTS studio powered by this model, built for Wolof speakers
Daande means "voice" in Pulaar. This project aims to bring modern AI voice technology to West African languages.
- Downloads last month
- 22
Model tree for muhamadul/xtts-v2-wolof
Base model
coqui/XTTS-v2