XTTS-v2 Fine-tuned for Wolof 🇸🇳

A fine-tuned version of Coqui XTTS-v2 with improved Wolof language support. This model was fine-tuned to bring high-quality text-to-speech synthesis to the Wolof language — one of the most widely spoken languages in Senegal and West Africa.

Model Description

XTTS-v2 is a multilingual text-to-speech model that supports voice cloning. This fine-tuned version enhances performance on Wolof while retaining capabilities in French and English.

Property Value
Base Model coqui/XTTS-v2
Architecture GPT-2 based encoder + HiFi-GAN decoder
Parameters ~467M
Audio Sample Rate 24,000 Hz
Model Size 1.7 GB

Training Details

Datasets

The model was fine-tuned on a curated multilingual dataset of 10,861 audio samples:

Dataset Language Samples Description
WaxalNLP/wolof_speech Wolof 1,042 Wolof speech corpus
GalsenAI/WaxalNNLP Wolof 1042 Enhanced Wolof Speech
google/fleurs (wo) Wolof 2,819 FLEURS Wolof split
google/fleurs (fr) French 2,000 FLEURS French (subset)
google/fleurs (en) English 2,000 FLEURS English (subset)
keithito/lj_speech English 3,000 LJSpeech (subset)

French and English data was included to prevent catastrophic forgetting of multilingual capabilities during fine-tuning.

Fine-tuning Technique

  • Method: GPT encoder fine-tuning (Stage 1) using GPTTrainer from Coqui TTS 0.22.0
  • All 898 parameters were trained (full fine-tuning, no freezing)
  • Optimizer: AdamW with learning rate 5e-6
  • Batch size: 4 (effective batch size 16 with gradient accumulation of 4)
  • Epochs: 3 (7,740 total steps)
  • Precision: FP32 (mixed precision disabled — FP16 caused NaN losses on A100)
  • Hardware: NVIDIA A100-SXM4-40GB
  • Training loss: 0.924 → 0.767
  • Best eval loss: 3.049

Key Training Decisions

  1. FP32 over FP16: Mixed precision training produced NaN losses with TTS 0.22.0 on A100 GPUs. FP32 training was stable and produced valid gradients throughout.
  2. Multilingual data mix: Including French, English, and LJSpeech alongside Wolof prevented the model from losing its multilingual voice cloning ability.
  3. Low learning rate (5e-6): A conservative learning rate preserved the pre-trained model's strengths while allowing adaptation to Wolof phonology.

Evaluation Results

Comparison against the original XTTS-v2 base model on 9 test sentences (5 Wolof, 2 French, 2 English):

Overall

Metric Fine-tuned Original Δ
Speaker Similarity (↑) 0.8273 0.8175 +0.0099
Wins (speaker match) 7/9 2/9

Per-Language Speaker Similarity

Language Fine-tuned Original Δ FT Win Rate
Wolof 🟢 0.8396 0.8193 +0.0203 5/5 (100%)
French 🟢 0.8266 0.8187 +0.0080 1/2
English 🔴 0.7974 0.8116 -0.0142 1/2

Key finding: The fine-tuned model shows a +2% improvement in speaker similarity for Wolof while maintaining competitive performance in French and English.

Usage

With Coqui TTS

from TTS.api import TTS
import torch

# Load model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)

# Override with fine-tuned weights
model_path = "path/to/model.pth"
checkpoint = torch.load(model_path, map_location="cpu", weights_only=False)
tts.synthesizer.tts_model.load_state_dict(checkpoint["model"], strict=True)

# Generate Wolof speech
tts.tts_to_file(
    text="Jàmm nga fanaan. Nanga def?",
    speaker_wav="reference_audio.wav",
    language="fr",  # Wolof maps to French in XTTS-v2
    file_path="output.wav"
)

Direct with XttsModel

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import torch

config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json")
model.eval()

outputs = model.synthesize(
    text="Jàmm nga fanaan.",
    config=config,
    speaker_wav="reference.wav",
    language="fr",
)

Files

File Size Description
model.pth 1.7 GB Fine-tuned model weights (wrapped with {"model": state_dict})
config.json 4.3 KB Model configuration
vocab.json 353 KB Tokenizer vocabulary
speakers_xtts.pth 7.4 MB Speaker embeddings

Limitations

  • Wolof is mapped to French (language="fr") since XTTS-v2 has no native Wolof language token. This works well because Wolof and French share phonological similarities in the Senegalese context.
  • English speaker similarity shows a slight regression (-1.4%) compared to the base model.
  • The model was fine-tuned for 3 epochs only — longer training or larger Wolof datasets could yield further improvements.
  • Voice quality depends on the reference audio provided for cloning.

Citation

If you use this model, please cite the original XTTS-v2 and the datasets:

@misc{xtts-v2-wolof,
  title={XTTS-v2 Fine-tuned for Wolof},
  author={Muhamad Ul},
  year={2026},
  url={https://huggingface.co/muhamadul/xtts-v2-wolof}
}

@misc{casanova2024xtts,
  title={XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model},
  author={Casanova, Edresson and others},
  year={2024},
  publisher={Coqui AI}
}

@misc{waxalnlp-wolof,
  title={Wolof Speech Dataset},
  author={WaxalNLP},
  url={https://huggingface.co/datasets/WaxalNLP/wolof_speech}
}

@inproceedings{conneau2023fleurs,
  title={FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author={Conneau, Alexis and others},
  booktitle={IEEE SLT},
  year={2023}
}

Acknowledgments

  • Coqui AI for the XTTS-v2 base model and TTS framework
  • WaxalNLP for the Wolof speech dataset
  • Galsen-AI/WaxalNLP for the corrected Wolof speech dataset from the Google WaxalNLP
  • Google FLEURS for multilingual speech data
  • Daande — The TTS studio powered by this model, built for Wolof speakers

Daande means "voice" in Pulaar. This project aims to bring modern AI voice technology to West African languages.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for muhamadul/xtts-v2-wolof

Base model

coqui/XTTS-v2
Finetuned
(64)
this model

Datasets used to train muhamadul/xtts-v2-wolof

Space using muhamadul/xtts-v2-wolof 1