| --- |
| language: |
| - lu |
| license: cc-by-sa-4.0 |
| library_name: coqui-tts |
| tags: |
| - text-to-speech |
| - tts |
| - vits |
| - open-bible |
| - luganda |
| pipeline_tag: text-to-speech |
| datasets: |
| - davidguzmanr/open-bible-resources |
| inference: false |
| --- |
| |
| # VITS Open Bible — Luganda |
|
|
| A multispeaker text-to-speech model for **Luganda**, trained from scratch on |
| the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) |
| corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture |
| (end-to-end TTS with adversarial learning, 22,050 Hz output) via the |
| [Coqui TTS](https://github.com/coqui-ai/TTS) framework. |
|
|
| Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned |
| during training. A speaker name from the training set must be supplied at |
| inference time. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `model_last.pth` | Trained model weights. | |
| | `config.json` | Coqui TTS model configuration. | |
| | `speakers.pth` | Speaker ID → embedding mapping. | |
|
|
| ## Intended use |
|
|
| - Multispeaker TTS for Luganda using one of the training-set speaker voices. |
| - Research on multilingual TTS, low-resource TTS evaluation, and listening |
| studies on Open Bible–style read-speech. |
|
|
| ## How to use |
|
|
| Install Coqui TTS: |
|
|
| ```bash |
| pip install TTS |
| ``` |
|
|
| Download the checkpoint and run inference: |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from TTS.tts.utils.speakers import SpeakerManager |
| from TTS.utils.synthesizer import Synthesizer |
| |
| repo_id = "multilingual-tts/VITS-OpenBible-Luganda" |
| ckpt = hf_hub_download(repo_id, "model_last.pth") |
| config = hf_hub_download(repo_id, "config.json") |
| speakers = hf_hub_download(repo_id, "speakers.pth") |
| |
| use_cuda = torch.cuda.is_available() |
| synthesizer = Synthesizer( |
| tts_checkpoint=ckpt, |
| tts_config_path=config, |
| tts_speakers_file=speakers, |
| use_cuda=use_cuda, |
| ) |
| |
| # Coqui's Synthesizer may not inject the speakers file into the model config |
| # automatically — restore the SpeakerManager manually when needed. |
| if synthesizer.tts_model.speaker_manager is None: |
| synthesizer.tts_model.speaker_manager = SpeakerManager( |
| speaker_id_file_path=speakers |
| ) |
| |
| # List available speaker names |
| print(sorted(synthesizer.tts_model.speaker_manager.speaker_names)) |
| |
| wav = synthesizer.tts( |
| text="...", # text to synthesise in Luganda |
| speaker_name="...", # one of the speaker names printed above |
| split_sentences=True, |
| ) |
| ``` |
|
|
| ## Training data |
|
|
| - **Source:** `davidguzmanr/open-bible-resources`, config `Luganda` |
| - **Size:** approximately 21,553 utterances |
| - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set |
| voices and selected by name at inference time |
| - **Sample rate:** 22,050 Hz |
|
|
| ## Training procedure |
|
|
| - Architecture: VITS (Conditional Variational Autoencoder + adversarial training). |
| - Grapheme-level tokenizer, built from the training transcripts. |
| - Optimizer: AdamW, learning rate 2e-4. |
| - Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision |
| (bf16). |
|
|
| Audio preprocessing and training are reproducible via the upstream |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. |
|
|
| ## Evaluation |
|
|
| Evaluated alongside other Open-Bible TTS systems on character/word error rate |
| (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository |
| for the evaluation pipeline and the |
| [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository |
| for the human-listening survey methodology. |
|
|