File size: 3,596 Bytes

db83970

---
language:
  - as
license: cc-by-sa-4.0
library_name: coqui-tts
tags:
  - text-to-speech
  - tts
  - vits
  - open-bible
  - assamese
pipeline_tag: text-to-speech
datasets:
  - davidguzmanr/open-bible-resources
inference: false
---

# VITS Open Bible — Assamese

A multispeaker text-to-speech model for **Assamese**, trained from scratch on
the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
(end-to-end TTS with adversarial learning, 22,050 Hz output) via the
[Coqui TTS](https://github.com/coqui-ai/TTS) framework.

Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
during training. A speaker name from the training set must be supplied at
inference time.

## Files

| File | Purpose |
|------|---------|
| `model_last.pth` | Trained model weights. |
| `config.json` | Coqui TTS model configuration. |
| `speakers.pth` | Speaker ID → embedding mapping. |

## Intended use

- Multispeaker TTS for Assamese using one of the training-set speaker voices.
- Research on multilingual TTS, low-resource TTS evaluation, and listening
  studies on Open Bible–style read-speech.

## How to use

Install Coqui TTS:

```bash
pip install TTS
```

Download the checkpoint and run inference:

```python
import torch
from huggingface_hub import hf_hub_download
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.synthesizer import Synthesizer

repo_id  = "multilingual-tts/VITS-OpenBible-Assamese"
ckpt     = hf_hub_download(repo_id, "model_last.pth")
config   = hf_hub_download(repo_id, "config.json")
speakers = hf_hub_download(repo_id, "speakers.pth")

use_cuda = torch.cuda.is_available()
synthesizer = Synthesizer(
    tts_checkpoint=ckpt,
    tts_config_path=config,
    tts_speakers_file=speakers,
    use_cuda=use_cuda,
)

# Coqui's Synthesizer may not inject the speakers file into the model config
# automatically — restore the SpeakerManager manually when needed.
if synthesizer.tts_model.speaker_manager is None:
    synthesizer.tts_model.speaker_manager = SpeakerManager(
        speaker_id_file_path=speakers
    )

# List available speaker names
print(sorted(synthesizer.tts_model.speaker_manager.speaker_names))

wav = synthesizer.tts(
    text="...",          # text to synthesise in Assamese
    speaker_name="...",  # one of the speaker names printed above
    split_sentences=True,
)
```

## Training data

- **Source:** `davidguzmanr/open-bible-resources`, config `Assamese`
- **Size:** approximately 20,895 utterances
- **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
  voices and selected by name at inference time
- **Sample rate:** 22,050 Hz

## Training procedure

- Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
- Grapheme-level tokenizer, built from the training transcripts.
- Optimizer: AdamW, learning rate 2e-4.
- Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision
  (bf16).

Audio preprocessing and training are reproducible via the upstream
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.

## Evaluation

Evaluated alongside other Open-Bible TTS systems on character/word error rate
(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
for the evaluation pipeline and the
[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
for the human-listening survey methodology.