Add README for Nepali

397c2ac verified 2 days ago

5.16 kB

	---
	language:
	- ne
	license: cc-by-sa-4.0
	library_name: everyvoice
	tags:
	- text-to-speech
	- tts
	- everyvoice
	- fastspeech2
	- open-bible
	- nepali
	pipeline_tag: text-to-speech
	datasets:
	- davidguzmanr/open-bible-resources
	inference: false
	---

	# EveryVoice Open Bible — Nepali

	A multispeaker text-to-speech model for Nepali, trained from scratch on
	the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
	corpus using the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) TTS toolkit
	(FastSpeech2 acoustic model + HiFi-GAN vocoder, 22,050 Hz output).

	The model is conditioned on speaker embeddings learned during training. A speaker
	name from the training set must be supplied at inference time.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `feature_prediction.ckpt` \| Trained FastSpeech2 feature-prediction weights. \|
	\| `vocoder.ckpt` \| HiFi-GAN vocoder checkpoint (optional — can be replaced with a universal vocoder). \|
	\| `config/` \| EveryVoice YAML config files (shared data, text, feature-prediction, spec-to-wav). \|
	\| `filelist.psv` \| Pipe-separated training filelist (`basename\|language\|speaker\|characters\|phones`). \|

	## Intended use

	- Multispeaker TTS for Nepali using one of the training-set speaker voices.
	- Research on multilingual TTS, low-resource TTS evaluation, and listening
	studies on Open Bible–style read-speech.

	## How to use

	Install EveryVoice:

	```bash
	pip install everyvoice
	```

	Download the checkpoint and run inference:

	```python
	import torch
	from pathlib import Path
	from huggingface_hub import snapshot_download

	from everyvoice.config.type_definitions import DatasetTextRepresentation
	from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize import (
	get_global_step,
	synthesize_helper,
	)
	from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.model import FastSpeech2
	from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import (
	SynthesizeOutputFormats,
	)
	from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import (
	load_hifigan_from_checkpoint,
	)
	from everyvoice.utils.heavy import get_device_from_accelerator

	repo_id = "multilingual-tts/EveryVoice-OpenBible-Nepali"
	local = Path(snapshot_download(repo_id))

	ckpt_path = local / "feature_prediction.ckpt"
	vocoder_path = local / "vocoder.ckpt"

	accelerator = "gpu" if torch.cuda.is_available() else "cpu"
	device = get_device_from_accelerator(accelerator)

	model = FastSpeech2.load_from_checkpoint(str(ckpt_path)).to(device)
	model.eval()
	global_step = get_global_step(ckpt_path)

	vocoder_ckpt = torch.load(str(vocoder_path), map_location=device, weights_only=True)
	vocoder_model, vocoder_config = load_hifigan_from_checkpoint(vocoder_ckpt, device)
	vocoder_global_step = get_global_step(vocoder_path)

	# Pick any speaker from the model
	speaker = next(iter(model.speaker2id.keys()))
	language = next(iter(model.lang2id.keys()))
	print(f"Available speakers: {list(model.speaker2id.keys())}")

	filelist_data = [
	{
	"basename": "sample-0",
	"characters": "...", # text to synthesise in Nepali
	"language": language,
	"speaker": speaker,
	"duration_control": 1.0,
	}
	]

	output_dir = Path("everyvoice_output")
	output_dir.mkdir(exist_ok=True)

	synthesize_helper(
	model=model,
	texts=None,
	style_reference=None,
	language=None,
	speaker=None,
	duration_control=1.0,
	global_step=global_step,
	output_type=[SynthesizeOutputFormats.wav],
	text_representation=DatasetTextRepresentation.characters,
	accelerator=accelerator,
	devices="auto",
	device=device,
	batch_size=1,
	num_workers=1,
	filelist=None,
	filelist_data=filelist_data,
	output_dir=output_dir,
	teacher_forcing_directory=None,
	vocoder_model=vocoder_model,
	vocoder_config=vocoder_config,
	vocoder_global_step=vocoder_global_step,
	)
	# Generated WAVs land in output_dir/wav/
	```

	## Training data

	- Source: `davidguzmanr/open-bible-resources`, config `Nepali`
	- Size: approximately 20,423 utterances
	- Speakers: multispeaker; speaker identity is fixed to one of the training-set
	voices and selected by name at inference time
	- Sample rate: 22,050 Hz

	## Training procedure

	- Acoustic model: FastSpeech2 (non-autoregressive, duration-prediction based).
	- Vocoder: HiFi-GAN (iSTFT variant).
	- Character-level tokenizer built from the training transcripts.
	- Trained with the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) toolkit.

	Audio preprocessing and training are reproducible via the upstream
	[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.

	## Evaluation

	Evaluated alongside other Open-Bible TTS systems on character/word error rate
	(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
	[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
	for the evaluation pipeline and the
	[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
	for the human-listening survey methodology.