Add README for Hiligaynon
Browse files
README.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- hil
|
| 4 |
+
license: cc-by-sa-4.0
|
| 5 |
+
library_name: everyvoice
|
| 6 |
+
tags:
|
| 7 |
+
- text-to-speech
|
| 8 |
+
- tts
|
| 9 |
+
- everyvoice
|
| 10 |
+
- fastspeech2
|
| 11 |
+
- open-bible
|
| 12 |
+
- hiligaynon
|
| 13 |
+
pipeline_tag: text-to-speech
|
| 14 |
+
datasets:
|
| 15 |
+
- davidguzmanr/open-bible-resources
|
| 16 |
+
inference: false
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# EveryVoice Open Bible — Hiligaynon
|
| 20 |
+
|
| 21 |
+
A multispeaker text-to-speech model for **Hiligaynon**, trained from scratch on
|
| 22 |
+
the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
|
| 23 |
+
corpus using the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) TTS toolkit
|
| 24 |
+
(FastSpeech2 acoustic model + HiFi-GAN vocoder, 22,050 Hz output).
|
| 25 |
+
|
| 26 |
+
The model is conditioned on speaker embeddings learned during training. A speaker
|
| 27 |
+
name from the training set must be supplied at inference time.
|
| 28 |
+
|
| 29 |
+
## Files
|
| 30 |
+
|
| 31 |
+
| File | Purpose |
|
| 32 |
+
|------|---------|
|
| 33 |
+
| `feature_prediction.ckpt` | Trained FastSpeech2 feature-prediction weights. |
|
| 34 |
+
| `vocoder.ckpt` | HiFi-GAN vocoder checkpoint (optional — can be replaced with a universal vocoder). |
|
| 35 |
+
| `config/` | EveryVoice YAML config files (shared data, text, feature-prediction, spec-to-wav). |
|
| 36 |
+
| `filelist.psv` | Pipe-separated training filelist (`basename|language|speaker|characters|phones`). |
|
| 37 |
+
|
| 38 |
+
## Intended use
|
| 39 |
+
|
| 40 |
+
- Multispeaker TTS for Hiligaynon using one of the training-set speaker voices.
|
| 41 |
+
- Research on multilingual TTS, low-resource TTS evaluation, and listening
|
| 42 |
+
studies on Open Bible–style read-speech.
|
| 43 |
+
|
| 44 |
+
## How to use
|
| 45 |
+
|
| 46 |
+
Install EveryVoice:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
pip install everyvoice
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Download the checkpoint and run inference:
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
import torch
|
| 56 |
+
from pathlib import Path
|
| 57 |
+
from huggingface_hub import snapshot_download
|
| 58 |
+
|
| 59 |
+
from everyvoice.config.type_definitions import DatasetTextRepresentation
|
| 60 |
+
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize import (
|
| 61 |
+
get_global_step,
|
| 62 |
+
synthesize_helper,
|
| 63 |
+
)
|
| 64 |
+
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.model import FastSpeech2
|
| 65 |
+
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import (
|
| 66 |
+
SynthesizeOutputFormats,
|
| 67 |
+
)
|
| 68 |
+
from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import (
|
| 69 |
+
load_hifigan_from_checkpoint,
|
| 70 |
+
)
|
| 71 |
+
from everyvoice.utils.heavy import get_device_from_accelerator
|
| 72 |
+
|
| 73 |
+
repo_id = "multilingual-tts/EveryVoice-OpenBible-Hiligaynon"
|
| 74 |
+
local = Path(snapshot_download(repo_id))
|
| 75 |
+
|
| 76 |
+
ckpt_path = local / "feature_prediction.ckpt"
|
| 77 |
+
vocoder_path = local / "vocoder.ckpt"
|
| 78 |
+
|
| 79 |
+
accelerator = "gpu" if torch.cuda.is_available() else "cpu"
|
| 80 |
+
device = get_device_from_accelerator(accelerator)
|
| 81 |
+
|
| 82 |
+
model = FastSpeech2.load_from_checkpoint(str(ckpt_path)).to(device)
|
| 83 |
+
model.eval()
|
| 84 |
+
global_step = get_global_step(ckpt_path)
|
| 85 |
+
|
| 86 |
+
vocoder_ckpt = torch.load(str(vocoder_path), map_location=device, weights_only=True)
|
| 87 |
+
vocoder_model, vocoder_config = load_hifigan_from_checkpoint(vocoder_ckpt, device)
|
| 88 |
+
vocoder_global_step = get_global_step(vocoder_path)
|
| 89 |
+
|
| 90 |
+
# Pick any speaker from the model
|
| 91 |
+
speaker = next(iter(model.speaker2id.keys()))
|
| 92 |
+
language = next(iter(model.lang2id.keys()))
|
| 93 |
+
print(f"Available speakers: {list(model.speaker2id.keys())}")
|
| 94 |
+
|
| 95 |
+
filelist_data = [
|
| 96 |
+
{
|
| 97 |
+
"basename": "sample-0",
|
| 98 |
+
"characters": "...", # text to synthesise in Hiligaynon
|
| 99 |
+
"language": language,
|
| 100 |
+
"speaker": speaker,
|
| 101 |
+
"duration_control": 1.0,
|
| 102 |
+
}
|
| 103 |
+
]
|
| 104 |
+
|
| 105 |
+
output_dir = Path("everyvoice_output")
|
| 106 |
+
output_dir.mkdir(exist_ok=True)
|
| 107 |
+
|
| 108 |
+
synthesize_helper(
|
| 109 |
+
model=model,
|
| 110 |
+
texts=None,
|
| 111 |
+
style_reference=None,
|
| 112 |
+
language=None,
|
| 113 |
+
speaker=None,
|
| 114 |
+
duration_control=1.0,
|
| 115 |
+
global_step=global_step,
|
| 116 |
+
output_type=[SynthesizeOutputFormats.wav],
|
| 117 |
+
text_representation=DatasetTextRepresentation.characters,
|
| 118 |
+
accelerator=accelerator,
|
| 119 |
+
devices="auto",
|
| 120 |
+
device=device,
|
| 121 |
+
batch_size=1,
|
| 122 |
+
num_workers=1,
|
| 123 |
+
filelist=None,
|
| 124 |
+
filelist_data=filelist_data,
|
| 125 |
+
output_dir=output_dir,
|
| 126 |
+
teacher_forcing_directory=None,
|
| 127 |
+
vocoder_model=vocoder_model,
|
| 128 |
+
vocoder_config=vocoder_config,
|
| 129 |
+
vocoder_global_step=vocoder_global_step,
|
| 130 |
+
)
|
| 131 |
+
# Generated WAVs land in output_dir/wav/
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## Training data
|
| 135 |
+
|
| 136 |
+
- **Source:** `davidguzmanr/open-bible-resources`, config `Hiligaynon`
|
| 137 |
+
- **Size:** approximately 18,573 utterances
|
| 138 |
+
- **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
|
| 139 |
+
voices and selected by name at inference time
|
| 140 |
+
- **Sample rate:** 22,050 Hz
|
| 141 |
+
|
| 142 |
+
## Training procedure
|
| 143 |
+
|
| 144 |
+
- Acoustic model: FastSpeech2 (non-autoregressive, duration-prediction based).
|
| 145 |
+
- Vocoder: HiFi-GAN (iSTFT variant).
|
| 146 |
+
- Character-level tokenizer built from the training transcripts.
|
| 147 |
+
- Trained with the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) toolkit.
|
| 148 |
+
|
| 149 |
+
Audio preprocessing and training are reproducible via the upstream
|
| 150 |
+
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.
|
| 151 |
+
|
| 152 |
+
## Evaluation
|
| 153 |
+
|
| 154 |
+
Evaluated alongside other Open-Bible TTS systems on character/word error rate
|
| 155 |
+
(via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the
|
| 156 |
+
[open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository
|
| 157 |
+
for the evaluation pipeline and the
|
| 158 |
+
[open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository
|
| 159 |
+
for the human-listening survey methodology.
|