Text-to-Speech
Assamese
coqui-tts
tts
vits
open-bible
assamese
ajesujoba's picture
Add README for Assamese
db83970 verified
---
language:
- as
license: cc-by-sa-4.0
library_name: coqui-tts
tags:
- text-to-speech
- tts
- vits
- open-bible
- assamese
pipeline_tag: text-to-speech
datasets:
- davidguzmanr/open-bible-resources
inference: false
---
# VITS Open Bible — Assamese
A multispeaker text-to-speech model for **Assamese**, trained from scratch on
the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
(end-to-end TTS with adversarial learning, 22,050 Hz output) via the
[Coqui TTS](https://github.com/coqui-ai/TTS) framework.
Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
during training. A speaker name from the training set must be supplied at
inference time.
## Files
| File | Purpose |
|------|---------|
| `model_last.pth` | Trained model weights. |
| `config.json` | Coqui TTS model configuration. |
| `speakers.pth` | Speaker ID → embedding mapping. |
## Intended use
- Multispeaker TTS for Assamese using one of the training-set speaker voices.
- Research on multilingual TTS, low-resource TTS evaluation, and listening
studies on Open Bible–style read-speech.
## How to use
Install Coqui TTS:
```bash
pip install TTS
```
Download the checkpoint and run inference:
```python
import torch
from huggingface_hub import hf_hub_download
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.synthesizer import Synthesizer
repo_id = "multilingual-tts/VITS-OpenBible-Assamese"
ckpt = hf_hub_download(repo_id, "model_last.pth")
config = hf_hub_download(repo_id, "config.json")
speakers = hf_hub_download(repo_id, "speakers.pth")
use_cuda = torch.cuda.is_available()
synthesizer = Synthesizer(
tts_checkpoint=ckpt,
tts_config_path=config,
tts_speakers_file=speakers,
use_cuda=use_cuda,
)
# Coqui's Synthesizer may not inject the speakers file into the model config
# automatically — restore the SpeakerManager manually when needed.
if synthesizer.tts_model.speaker_manager is None:
synthesizer.tts_model.speaker_manager = SpeakerManager(
speaker_id_file_path=speakers
)
# List available speaker names
print(sorted(synthesizer.tts_model.speaker_manager.speaker_names))
wav = synthesizer.tts(
text="...", # text to synthesise in Assamese
speaker_name="...", # one of the speaker names printed above
split_sentences=True,
)
```
## Training data
- **Source:** `davidguzmanr/open-bible-resources`, config `Assamese`
- **Size:** approximately 20,895 utterances
- **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
voices and selected by name at inference time
- **Sample rate:** 22,050 Hz
## Training procedure
- Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
- Grapheme-level tokenizer, built from the training transcripts.
- Optimizer: AdamW, learning rate 2e-4.
- Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision
(bf16).
Audio preprocessing and training are reproducible via the upstream
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.
## Evaluation
Evaluated alongside other Open-Bible TTS systems on character/word error rate
(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
for the evaluation pipeline and the
[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
for the human-listening survey methodology.