multilingual-tts
/

VITS-OpenBible-Assamese

Model card Files Files and versions

VITS-OpenBible-Assamese / README.md

ajesujoba's picture

Add README for Assamese

db83970 verified 3 days ago

|

history blame contribute delete

3.6 kB

	---
	language:
	- as
	license: cc-by-sa-4.0
	library_name: coqui-tts
	tags:
	- text-to-speech
	- tts
	- vits
	- open-bible
	- assamese
	pipeline_tag: text-to-speech
	datasets:
	- davidguzmanr/open-bible-resources
	inference: false
	---

	# VITS Open Bible — Assamese

	A multispeaker text-to-speech model for Assamese, trained from scratch on
	the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
	corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
	(end-to-end TTS with adversarial learning, 22,050 Hz output) via the
	[Coqui TTS](https://github.com/coqui-ai/TTS) framework.

	Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
	during training. A speaker name from the training set must be supplied at
	inference time.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `model_last.pth` \| Trained model weights. \|
	\| `config.json` \| Coqui TTS model configuration. \|
	\| `speakers.pth` \| Speaker ID → embedding mapping. \|

	## Intended use

	- Multispeaker TTS for Assamese using one of the training-set speaker voices.
	- Research on multilingual TTS, low-resource TTS evaluation, and listening
	studies on Open Bible–style read-speech.

	## How to use

	Install Coqui TTS:

	```bash
	pip install TTS
	```

	Download the checkpoint and run inference:

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from TTS.tts.utils.speakers import SpeakerManager
	from TTS.utils.synthesizer import Synthesizer

	repo_id = "multilingual-tts/VITS-OpenBible-Assamese"
	ckpt = hf_hub_download(repo_id, "model_last.pth")
	config = hf_hub_download(repo_id, "config.json")
	speakers = hf_hub_download(repo_id, "speakers.pth")

	use_cuda = torch.cuda.is_available()
	synthesizer = Synthesizer(
	tts_checkpoint=ckpt,
	tts_config_path=config,
	tts_speakers_file=speakers,
	use_cuda=use_cuda,
	)

	# Coqui's Synthesizer may not inject the speakers file into the model config
	# automatically — restore the SpeakerManager manually when needed.
	if synthesizer.tts_model.speaker_manager is None:
	synthesizer.tts_model.speaker_manager = SpeakerManager(
	speaker_id_file_path=speakers
	)

	# List available speaker names
	print(sorted(synthesizer.tts_model.speaker_manager.speaker_names))

	wav = synthesizer.tts(
	text="...", # text to synthesise in Assamese
	speaker_name="...", # one of the speaker names printed above
	split_sentences=True,
	)
	```

	## Training data

	- Source: `davidguzmanr/open-bible-resources`, config `Assamese`
	- Size: approximately 20,895 utterances
	- Speakers: multispeaker; speaker identity is fixed to one of the training-set
	voices and selected by name at inference time
	- Sample rate: 22,050 Hz

	## Training procedure

	- Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
	- Grapheme-level tokenizer, built from the training transcripts.
	- Optimizer: AdamW, learning rate 2e-4.
	- Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision
	(bf16).

	Audio preprocessing and training are reproducible via the upstream
	[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.

	## Evaluation

	Evaluated alongside other Open-Bible TTS systems on character/word error rate
	(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
	[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
	for the evaluation pipeline and the
	[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
	for the human-listening survey methodology.