--- language: - nd license: cc-by-sa-4.0 library_name: coqui-tts tags: - text-to-speech - tts - vits - open-bible - ndebele pipeline_tag: text-to-speech datasets: - davidguzmanr/open-bible-resources inference: false --- # VITS Open Bible — Ndebele A multispeaker text-to-speech model for **Ndebele**, trained from scratch on the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture (end-to-end TTS with adversarial learning, 22,050 Hz output) via the [Coqui TTS](https://github.com/coqui-ai/TTS) framework. Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned during training. A speaker name from the training set must be supplied at inference time. ## Files | File | Purpose | |------|---------| | `model_last.pth` | Trained model weights. | | `config.json` | Coqui TTS model configuration. | | `speakers.pth` | Speaker ID → embedding mapping. | ## Intended use - Multispeaker TTS for Ndebele using one of the training-set speaker voices. - Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech. ## How to use Install Coqui TTS: ```bash pip install TTS ``` Download the checkpoint and run inference: ```python import torch from huggingface_hub import hf_hub_download from TTS.tts.utils.speakers import SpeakerManager from TTS.utils.synthesizer import Synthesizer repo_id = "multilingual-tts/VITS-OpenBible-Ndebele" ckpt = hf_hub_download(repo_id, "model_last.pth") config = hf_hub_download(repo_id, "config.json") speakers = hf_hub_download(repo_id, "speakers.pth") use_cuda = torch.cuda.is_available() synthesizer = Synthesizer( tts_checkpoint=ckpt, tts_config_path=config, tts_speakers_file=speakers, use_cuda=use_cuda, ) # Coqui's Synthesizer may not inject the speakers file into the model config # automatically — restore the SpeakerManager manually when needed. if synthesizer.tts_model.speaker_manager is None: synthesizer.tts_model.speaker_manager = SpeakerManager( speaker_id_file_path=speakers ) # List available speaker names print(sorted(synthesizer.tts_model.speaker_manager.speaker_names)) wav = synthesizer.tts( text="...", # text to synthesise in Ndebele speaker_name="...", # one of the speaker names printed above split_sentences=True, ) ``` ## Training data - **Source:** `davidguzmanr/open-bible-resources`, config `Ndebele` - **Size:** approximately 21,474 utterances - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set voices and selected by name at inference time - **Sample rate:** 22,050 Hz ## Training procedure - Architecture: VITS (Conditional Variational Autoencoder + adversarial training). - Grapheme-level tokenizer, built from the training transcripts. - Optimizer: AdamW, learning rate 2e-4. - Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision (bf16). Audio preprocessing and training are reproducible via the upstream [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. ## Evaluation Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository for the evaluation pipeline and the [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository for the human-listening survey methodology.