PokerKit: A Comprehensive Python Library for Fine-Grained Multi-Variant Poker Game Simulations
Paper • 2308.07327 • Published
Configuration Parsing Warning:Invalid JSON for config file config.json
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
This is a fine-tuned version of the XTTS v2 model trained on the LJSpeech dataset. The model supports multilingual text-to-speech synthesis with voice cloning capabilities.
git clone https://github.com/idiap/coqui-ai-TTS.git
conda create --name xtts python=3.10
conda activate xtts
uv pip install torch torchaudio torchcodec --torch-backend=auto
uv pip install -e .
uv pip install transformers==5.0.0
pip install torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121 --break-system-packages
from TTS.tts.models.xtts import Xtts
from TTS.tts.configs.xtts_config import XttsConfig
from huggingface_hub import hf_hub_download
import torch
import torchaudio
# Download files from Hugging Face
repo_id = "Abdelrahman2922/XTTS-v2-FT-saudia/XTTS-v2-FT-LJSpeech"
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
checkpoint_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.json")
# Load model
config = XttsConfig()
config.load_json(config_path)
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path=checkpoint_path,
vocab_path=vocab_path, use_deepspeed=False)
model.cuda()
# Get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["reference_speaker.wav"]
)
# Generate speech
output = model.inference(
text="Hello, this is a test of the fine-tuned XTTS model.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.75,
enable_text_splitting=True
)
# Save output
torchaudio.save("output.wav", torch.tensor(output["wav"]).unsqueeze(0), 24000)
# Arabic example
output = model.inference(
text="السلام عليكم، كيف حالك؟",
language="ar",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.85,
enable_text_splitting=True
)
# French example
output = model.inference(
text="Bonjour, comment allez-vous?",
language="fr",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.75,
)
The config.json file contains all necessary configurations embedded, including:
temperature: Controls randomness (0.7-0.9 recommended). Lower values = more deterministictop_k: Limits tokens to top-k most likely. Set to 50 by defaulttop_p: Nucleus sampling parameter. Set to 0.85 by defaultrepetition_penalty: Penalizes repeated tokens. Set to 2.0 by defaultenable_text_splitting: Automatically splits long texts for better qualityThis model is provided under the same license as the original Coqui TTS project.
@article{casanova2022sc,
title={Towards End-to-End Speech Understanding and Dialogue Systems Using Self-Supervised Models},
author={Casanova, Edresson and Weber, Julian and Shulby, Christopher D and Junior, Arnaldo Candido and Gölge, Eren and Valoyes, Maria Ryskina},
journal={arXiv preprint arXiv:2308.07327},
year={2023}
}
For issues and questions, please open an issue on the Hugging Face Model Hub.