XTTS v2 Fine-tuned Model

Model Description

This is a fine-tuned version of the XTTS v2 model trained on the LJSpeech dataset. The model supports multilingual text-to-speech synthesis with voice cloning capabilities.

Model Details

Base Model: Coqui XTTS v2.0
Fine-tuning Dataset: LJSpeech
Languages Supported: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Simplified), Hungarian, Korean, Japanese, Hindi
Model Type: Generative Pre-trained Transformer for Text-to-Speech
Voice Cloning: Yes (requires speaker reference audio ~10-30 seconds)

Installation

git clone https://github.com/idiap/coqui-ai-TTS.git

conda create --name  xtts python=3.10


conda activate xtts 

 uv pip install torch torchaudio torchcodec --torch-backend=auto

uv pip install -e .


uv pip install transformers==5.0.0


pip install torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121 --break-system-packages

Quick Start

Basic Usage

from TTS.tts.models.xtts import Xtts
from TTS.tts.configs.xtts_config import XttsConfig
from huggingface_hub import hf_hub_download
import torch
import torchaudio

# Download files from Hugging Face
repo_id = "Abdelrahman2922/XTTS-v2-FT-saudia/XTTS-v2-FT-LJSpeech"

config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
checkpoint_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")

# Load model
config = XttsConfig()
config.load_json(config_path)
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path=checkpoint_path, 
                     vocab_path=vocab_path, use_deepspeed=False)
model.cuda()

# Get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference_speaker.wav"]
)

# Generate speech
output = model.inference(
    text="Hello, this is a test of the fine-tuned XTTS model.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.75,
    enable_text_splitting=True
)

# Save output
torchaudio.save("output.wav", torch.tensor(output["wav"]).unsqueeze(0), 24000)

Advanced Usage with Different Languages

# Arabic example
output = model.inference(
    text="السلام عليكم، كيف حالك؟",
    language="ar",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.85,
    enable_text_splitting=True
)

# French example
output = model.inference(
    text="Bonjour, comment allez-vous?",
    language="fr",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.75,
)

Configuration

The config.json file contains all necessary configurations embedded, including:

Model Architecture: GPT-based autoregressive model with 30 layers
Audio Parameters:
- Sample rate: 22050 Hz (input), 24000 Hz (output)
- Max audio length: 264600 samples (~12 seconds)
- Max text length: 200 characters
Inference Parameters:
- Temperature: 0.85
- Top-k: 50
- Top-p: 0.85
- Repetition penalty: 2.0
Vocabulary: 6681 text tokens with multilingual support

File Descriptions

config.json: Complete training and inference configuration with embedded vocabulary
pytorch_model.bin: Fine-tuned model checkpoint with all weights
vocab.json: Tokenizer vocabulary for text encoding
mel_stats.pth: Mel-spectrogram normalization statistics
dvae.pth: Discrete VAE model for audio tokenization

Parameters

Inference Parameters

temperature: Controls randomness (0.7-0.9 recommended). Lower values = more deterministic
top_k: Limits tokens to top-k most likely. Set to 50 by default
top_p: Nucleus sampling parameter. Set to 0.85 by default
repetition_penalty: Penalizes repeated tokens. Set to 2.0 by default
enable_text_splitting: Automatically splits long texts for better quality

Performance

Inference Speed: ~2-3x real-time on GPU (NVIDIA A100)
Output Quality: 24kHz mono WAV files
Memory Requirements: ~8GB GPU memory recommended

Limitations

Maximum text length: 200 characters per inference call
Requires CUDA-capable GPU for real-time performance
Best results with speaker reference audio of 10-30 seconds

License

This model is provided under the same license as the original Coqui TTS project.

Attribution

Original XTTS model: Coqui AI
Fine-tuning: LJSpeech dataset

Citation

@article{casanova2022sc,
  title={Towards End-to-End Speech Understanding and Dialogue Systems Using Self-Supervised Models},
  author={Casanova, Edresson and Weber, Julian and Shulby, Christopher D and Junior, Arnaldo Candido and Gölge, Eren and Valoyes, Maria Ryskina},
  journal={arXiv preprint arXiv:2308.07327},
  year={2023}
}

Support

For issues and questions, please open an issue on the Hugging Face Model Hub.

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Abdelrahman2922/XTTS-v2-FT-saudia

PokerKit: A Comprehensive Python Library for Fine-Grained Multi-Variant Poker Game Simulations

Paper • 2308.07327 • Published Aug 8, 2023