XTTS-v2 Fine-tuned for Wolof 🇸🇳

A fine-tuned version of Coqui XTTS-v2 with improved Wolof language support. This model was fine-tuned to bring high-quality text-to-speech synthesis to the Wolof language — one of the most widely spoken languages in Senegal and West Africa.

Model Description

XTTS-v2 is a multilingual text-to-speech model that supports voice cloning. This fine-tuned version enhances performance on Wolof while retaining capabilities in French and English.

Property	Value
Base Model	coqui/XTTS-v2
Architecture	GPT-2 based encoder + HiFi-GAN decoder
Parameters	~467M
Audio Sample Rate	24,000 Hz
Model Size	1.7 GB

Training Details

Datasets

The model was fine-tuned on a curated multilingual dataset of 10,861 audio samples:

Dataset	Language	Samples	Description
WaxalNLP/wolof_speech	Wolof	1,042	Wolof speech corpus
GalsenAI/WaxalNNLP	Wolof	1042	Enhanced Wolof Speech
google/fleurs (wo)	Wolof	2,819	FLEURS Wolof split
google/fleurs (fr)	French	2,000	FLEURS French (subset)
google/fleurs (en)	English	2,000	FLEURS English (subset)
keithito/lj_speech	English	3,000	LJSpeech (subset)

French and English data was included to prevent catastrophic forgetting of multilingual capabilities during fine-tuning.

Fine-tuning Technique

Method: GPT encoder fine-tuning (Stage 1) using GPTTrainer from Coqui TTS 0.22.0
All 898 parameters were trained (full fine-tuning, no freezing)
Optimizer: AdamW with learning rate 5e-6
Batch size: 4 (effective batch size 16 with gradient accumulation of 4)
Epochs: 3 (7,740 total steps)
Precision: FP32 (mixed precision disabled — FP16 caused NaN losses on A100)
Hardware: NVIDIA A100-SXM4-40GB
Training loss: 0.924 → 0.767
Best eval loss: 3.049

Key Training Decisions

FP32 over FP16: Mixed precision training produced NaN losses with TTS 0.22.0 on A100 GPUs. FP32 training was stable and produced valid gradients throughout.
Multilingual data mix: Including French, English, and LJSpeech alongside Wolof prevented the model from losing its multilingual voice cloning ability.
Low learning rate (5e-6): A conservative learning rate preserved the pre-trained model's strengths while allowing adaptation to Wolof phonology.

Evaluation Results

Comparison against the original XTTS-v2 base model on 9 test sentences (5 Wolof, 2 French, 2 English):

Overall

Metric	Fine-tuned	Original	Δ
Speaker Similarity (↑)	0.8273	0.8175	+0.0099
Wins (speaker match)	7/9	2/9	—

Per-Language Speaker Similarity

Language	Fine-tuned	Original	Δ	FT Win Rate
Wolof 🟢	0.8396	0.8193	+0.0203	5/5 (100%)
French 🟢	0.8266	0.8187	+0.0080	1/2
English 🔴	0.7974	0.8116	-0.0142	1/2

Key finding: The fine-tuned model shows a +2% improvement in speaker similarity for Wolof while maintaining competitive performance in French and English.

Usage

With Coqui TTS

from TTS.api import TTS
import torch

# Load model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)

# Override with fine-tuned weights
model_path = "path/to/model.pth"
checkpoint = torch.load(model_path, map_location="cpu", weights_only=False)
tts.synthesizer.tts_model.load_state_dict(checkpoint["model"], strict=True)

# Generate Wolof speech
tts.tts_to_file(
    text="Jàmm nga fanaan. Nanga def?",
    speaker_wav="reference_audio.wav",
    language="fr",  # Wolof maps to French in XTTS-v2
    file_path="output.wav"
)

Direct with XttsModel

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch

config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json")
model.eval()

outputs = model.synthesize(
    text="Jàmm nga fanaan.",
    config=config,
    speaker_wav="reference.wav",
    language="fr",
)

Files

File	Size	Description
`model.pth`	1.7 GB	Fine-tuned model weights (wrapped with `{"model": state_dict}`)
`config.json`	4.3 KB	Model configuration
`vocab.json`	353 KB	Tokenizer vocabulary
`speakers_xtts.pth`	7.4 MB	Speaker embeddings

Limitations

Wolof is mapped to French (language="fr") since XTTS-v2 has no native Wolof language token. This works well because Wolof and French share phonological similarities in the Senegalese context.
English speaker similarity shows a slight regression (-1.4%) compared to the base model.
The model was fine-tuned for 3 epochs only — longer training or larger Wolof datasets could yield further improvements.
Voice quality depends on the reference audio provided for cloning.

Citation

If you use this model, please cite the original XTTS-v2 and the datasets:

@misc{xtts-v2-wolof,
  title={XTTS-v2 Fine-tuned for Wolof},
  author={Muhamad Ul},
  year={2026},
  url={https://huggingface.co/muhamadul/xtts-v2-wolof}
}

@misc{casanova2024xtts,
  title={XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model},
  author={Casanova, Edresson and others},
  year={2024},
  publisher={Coqui AI}
}

@misc{waxalnlp-wolof,
  title={Wolof Speech Dataset},
  author={WaxalNLP},
  url={https://huggingface.co/datasets/WaxalNLP/wolof_speech}
}

@inproceedings{conneau2023fleurs,
  title={FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author={Conneau, Alexis and others},
  booktitle={IEEE SLT},
  year={2023}
}

Acknowledgments

Coqui AI for the XTTS-v2 base model and TTS framework
WaxalNLP for the Wolof speech dataset
Galsen-AI/WaxalNLP for the corrected Wolof speech dataset from the Google WaxalNLP
Google FLEURS for multilingual speech data
Daande — The TTS studio powered by this model, built for Wolof speakers

Daande means "voice" in Pulaar. This project aims to bring modern AI voice technology to West African languages.

Downloads last month: 22

Model tree for muhamadul/xtts-v2-wolof

Base model

coqui/XTTS-v2

Finetuned

(64)

this model

muhamadul
/

xtts-v2-wolof