--- language: - dv license: other license_name: coqui-public-model-license license_link: https://coqui.ai/cpml tags: - text-to-speech - tts - voice-cloning - xtts - dhivehi - thaana - maldives library_name: coqui pipeline_tag: text-to-speech base_model: coqui/XTTS-v2 --- # XTTS v2 - Dhivehi (Thaana) Fine-tuned [XTTS v2.0](https://huggingface.co/coqui/XTTS-v2) for **Dhivehi** (Maldivian, Thaana script) text-to-speech with zero-shot voice cloning. ## Model Details - **Base model:** XTTS v2.0 (Coqui) - **Language:** Dhivehi (dv) - Thaana script - **Architecture:** GPT-2 + DVAE + HiFiGAN vocoder - **Audio:** 24kHz output - **Training step:** 95366 ## Training Data ~59,000 samples (~75+ hours) from multiple Dhivehi speech sources: - [Serialtechlab/dhivehi-javaabu-speech-parquet](https://huggingface.co/datasets/Serialtechlab/dhivehi-javaabu-speech-parquet) - news/article narration - [Serialtechlab/dv-presidential-speech](https://huggingface.co/datasets/Serialtechlab/dv-presidential-speech) - presidential addresses - [Serialtechlab/dhivehi-tts-female-01](https://huggingface.co/datasets/Serialtechlab/dhivehi-tts-female-01) - female speaker - [alakxender/dv-audio-syn-lg](https://huggingface.co/datasets/alakxender/dv-audio-syn-lg) - synthetic speech (subset) ## Usage ### Install ```bash pip install coqui-tts ``` ### Inference ```python import torch, torchaudio from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Download all files from this repo into a local directory config = XttsConfig() config.load_json("config.json") model = Xtts.init_from_config(config) model.load_checkpoint( config, checkpoint_path="model.pth", vocab_path="vocab.json", eval=True, strict=False, ) model.cuda() # Get speaker embedding from a reference WAV (5-15 sec of clean speech) gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( audio_path=["reference.wav"], gpt_cond_len=24, gpt_cond_chunk_len=4, ) # Generate speech out = model.inference( text="\u0784\u07a8\u0790\u07b0\u0789\u07a8\ufdf2 \u0783\u07a6\u0781\u07aa\u0789\u07a7\u0782\u07a8 \u0783\u07a6\u0781\u07a9\u0789\u07a8", language="dv", gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, temperature=0.7, ) wav = torch.tensor(out["wav"]).unsqueeze(0) torchaudio.save("output.wav", wav, 24000) ``` ## Files - `model.pth` - Fine-tuned GPT checkpoint - `config.json` - Model configuration - `vocab.json` - Extended BPE vocabulary (base XTTS + Thaana characters) - `dvae.pth` - Discrete VAE (from base XTTS v2.0) - `mel_stats.pth` - Mel spectrogram normalization stats (from base XTTS v2.0) ## Limitations - Voice cloning quality depends on the reference audio (clean, 5-15 seconds recommended) - Text longer than ~300 characters may be truncated - Some rare Dhivehi words may be mispronounced - Model is still being actively trained - newer checkpoints may be uploaded ## License This model inherits the [Coqui Public Model License](https://coqui.ai/cpml) from the base XTTS v2.0 model.