davidguzmanr's picture
Add README for Nepali
397c2ac verified
---
language:
- ne
license: cc-by-sa-4.0
library_name: everyvoice
tags:
- text-to-speech
- tts
- everyvoice
- fastspeech2
- open-bible
- nepali
pipeline_tag: text-to-speech
datasets:
- davidguzmanr/open-bible-resources
inference: false
---
# EveryVoice Open Bible — Nepali
A multispeaker text-to-speech model for **Nepali**, trained from scratch on
the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
corpus using the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) TTS toolkit
(FastSpeech2 acoustic model + HiFi-GAN vocoder, 22,050 Hz output).
The model is conditioned on speaker embeddings learned during training. A speaker
name from the training set must be supplied at inference time.
## Files
| File | Purpose |
|------|---------|
| `feature_prediction.ckpt` | Trained FastSpeech2 feature-prediction weights. |
| `vocoder.ckpt` | HiFi-GAN vocoder checkpoint (optional — can be replaced with a universal vocoder). |
| `config/` | EveryVoice YAML config files (shared data, text, feature-prediction, spec-to-wav). |
| `filelist.psv` | Pipe-separated training filelist (`basename|language|speaker|characters|phones`). |
## Intended use
- Multispeaker TTS for Nepali using one of the training-set speaker voices.
- Research on multilingual TTS, low-resource TTS evaluation, and listening
studies on Open Bible–style read-speech.
## How to use
Install EveryVoice:
```bash
pip install everyvoice
```
Download the checkpoint and run inference:
```python
import torch
from pathlib import Path
from huggingface_hub import snapshot_download
from everyvoice.config.type_definitions import DatasetTextRepresentation
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize import (
get_global_step,
synthesize_helper,
)
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.model import FastSpeech2
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import (
SynthesizeOutputFormats,
)
from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import (
load_hifigan_from_checkpoint,
)
from everyvoice.utils.heavy import get_device_from_accelerator
repo_id = "multilingual-tts/EveryVoice-OpenBible-Nepali"
local = Path(snapshot_download(repo_id))
ckpt_path = local / "feature_prediction.ckpt"
vocoder_path = local / "vocoder.ckpt"
accelerator = "gpu" if torch.cuda.is_available() else "cpu"
device = get_device_from_accelerator(accelerator)
model = FastSpeech2.load_from_checkpoint(str(ckpt_path)).to(device)
model.eval()
global_step = get_global_step(ckpt_path)
vocoder_ckpt = torch.load(str(vocoder_path), map_location=device, weights_only=True)
vocoder_model, vocoder_config = load_hifigan_from_checkpoint(vocoder_ckpt, device)
vocoder_global_step = get_global_step(vocoder_path)
# Pick any speaker from the model
speaker = next(iter(model.speaker2id.keys()))
language = next(iter(model.lang2id.keys()))
print(f"Available speakers: {list(model.speaker2id.keys())}")
filelist_data = [
{
"basename": "sample-0",
"characters": "...", # text to synthesise in Nepali
"language": language,
"speaker": speaker,
"duration_control": 1.0,
}
]
output_dir = Path("everyvoice_output")
output_dir.mkdir(exist_ok=True)
synthesize_helper(
model=model,
texts=None,
style_reference=None,
language=None,
speaker=None,
duration_control=1.0,
global_step=global_step,
output_type=[SynthesizeOutputFormats.wav],
text_representation=DatasetTextRepresentation.characters,
accelerator=accelerator,
devices="auto",
device=device,
batch_size=1,
num_workers=1,
filelist=None,
filelist_data=filelist_data,
output_dir=output_dir,
teacher_forcing_directory=None,
vocoder_model=vocoder_model,
vocoder_config=vocoder_config,
vocoder_global_step=vocoder_global_step,
)
# Generated WAVs land in output_dir/wav/
```
## Training data
- **Source:** `davidguzmanr/open-bible-resources`, config `Nepali`
- **Size:** approximately 20,423 utterances
- **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
voices and selected by name at inference time
- **Sample rate:** 22,050 Hz
## Training procedure
- Acoustic model: FastSpeech2 (non-autoregressive, duration-prediction based).
- Vocoder: HiFi-GAN (iSTFT variant).
- Character-level tokenizer built from the training transcripts.
- Trained with the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) toolkit.
Audio preprocessing and training are reproducible via the upstream
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.
## Evaluation
Evaluated alongside other Open-Bible TTS systems on character/word error rate
(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
for the evaluation pipeline and the
[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
for the human-listening survey methodology.