--- language: - ne license: cc-by-sa-4.0 library_name: everyvoice tags: - text-to-speech - tts - everyvoice - fastspeech2 - open-bible - nepali pipeline_tag: text-to-speech datasets: - davidguzmanr/open-bible-resources inference: false --- # EveryVoice Open Bible — Nepali A multispeaker text-to-speech model for **Nepali**, trained from scratch on the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) corpus using the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) TTS toolkit (FastSpeech2 acoustic model + HiFi-GAN vocoder, 22,050 Hz output). The model is conditioned on speaker embeddings learned during training. A speaker name from the training set must be supplied at inference time. ## Files | File | Purpose | |------|---------| | `feature_prediction.ckpt` | Trained FastSpeech2 feature-prediction weights. | | `vocoder.ckpt` | HiFi-GAN vocoder checkpoint (optional — can be replaced with a universal vocoder). | | `config/` | EveryVoice YAML config files (shared data, text, feature-prediction, spec-to-wav). | | `filelist.psv` | Pipe-separated training filelist (`basename|language|speaker|characters|phones`). | ## Intended use - Multispeaker TTS for Nepali using one of the training-set speaker voices. - Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech. ## How to use Install EveryVoice: ```bash pip install everyvoice ``` Download the checkpoint and run inference: ```python import torch from pathlib import Path from huggingface_hub import snapshot_download from everyvoice.config.type_definitions import DatasetTextRepresentation from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize import ( get_global_step, synthesize_helper, ) from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.model import FastSpeech2 from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import ( SynthesizeOutputFormats, ) from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import ( load_hifigan_from_checkpoint, ) from everyvoice.utils.heavy import get_device_from_accelerator repo_id = "multilingual-tts/EveryVoice-OpenBible-Nepali" local = Path(snapshot_download(repo_id)) ckpt_path = local / "feature_prediction.ckpt" vocoder_path = local / "vocoder.ckpt" accelerator = "gpu" if torch.cuda.is_available() else "cpu" device = get_device_from_accelerator(accelerator) model = FastSpeech2.load_from_checkpoint(str(ckpt_path)).to(device) model.eval() global_step = get_global_step(ckpt_path) vocoder_ckpt = torch.load(str(vocoder_path), map_location=device, weights_only=True) vocoder_model, vocoder_config = load_hifigan_from_checkpoint(vocoder_ckpt, device) vocoder_global_step = get_global_step(vocoder_path) # Pick any speaker from the model speaker = next(iter(model.speaker2id.keys())) language = next(iter(model.lang2id.keys())) print(f"Available speakers: {list(model.speaker2id.keys())}") filelist_data = [ { "basename": "sample-0", "characters": "...", # text to synthesise in Nepali "language": language, "speaker": speaker, "duration_control": 1.0, } ] output_dir = Path("everyvoice_output") output_dir.mkdir(exist_ok=True) synthesize_helper( model=model, texts=None, style_reference=None, language=None, speaker=None, duration_control=1.0, global_step=global_step, output_type=[SynthesizeOutputFormats.wav], text_representation=DatasetTextRepresentation.characters, accelerator=accelerator, devices="auto", device=device, batch_size=1, num_workers=1, filelist=None, filelist_data=filelist_data, output_dir=output_dir, teacher_forcing_directory=None, vocoder_model=vocoder_model, vocoder_config=vocoder_config, vocoder_global_step=vocoder_global_step, ) # Generated WAVs land in output_dir/wav/ ``` ## Training data - **Source:** `davidguzmanr/open-bible-resources`, config `Nepali` - **Size:** approximately 20,423 utterances - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set voices and selected by name at inference time - **Sample rate:** 22,050 Hz ## Training procedure - Acoustic model: FastSpeech2 (non-autoregressive, duration-prediction based). - Vocoder: HiFi-GAN (iSTFT variant). - Character-level tokenizer built from the training transcripts. - Trained with the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) toolkit. Audio preprocessing and training are reproducible via the upstream [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. ## Evaluation Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository for the evaluation pipeline and the [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository for the human-listening survey methodology.