| --- |
| language: |
| - ha |
| license: cc-by-sa-4.0 |
| library_name: everyvoice |
| tags: |
| - text-to-speech |
| - tts |
| - everyvoice |
| - fastspeech2 |
| - open-bible |
| - hausa |
| pipeline_tag: text-to-speech |
| datasets: |
| - davidguzmanr/open-bible-resources |
| inference: false |
| --- |
| |
| # EveryVoice Open Bible — Hausa |
|
|
| A multispeaker text-to-speech model for **Hausa**, trained from scratch on |
| the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) |
| corpus using the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) TTS toolkit |
| (FastSpeech2 acoustic model + HiFi-GAN vocoder, 22,050 Hz output). |
|
|
| The model is conditioned on speaker embeddings learned during training. A speaker |
| name from the training set must be supplied at inference time. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `feature_prediction.ckpt` | Trained FastSpeech2 feature-prediction weights. | |
| | `vocoder.ckpt` | HiFi-GAN vocoder checkpoint (optional — can be replaced with a universal vocoder). | |
| | `config/` | EveryVoice YAML config files (shared data, text, feature-prediction, spec-to-wav). | |
| | `filelist.psv` | Pipe-separated training filelist (`basename|language|speaker|characters|phones`). | |
|
|
| ## Intended use |
|
|
| - Multispeaker TTS for Hausa using one of the training-set speaker voices. |
| - Research on multilingual TTS, low-resource TTS evaluation, and listening |
| studies on Open Bible–style read-speech. |
|
|
| ## How to use |
|
|
| Install EveryVoice: |
|
|
| ```bash |
| pip install everyvoice |
| ``` |
|
|
| Download the checkpoint and run inference: |
|
|
| ```python |
| import torch |
| from pathlib import Path |
| from huggingface_hub import snapshot_download |
| |
| from everyvoice.config.type_definitions import DatasetTextRepresentation |
| from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize import ( |
| get_global_step, |
| synthesize_helper, |
| ) |
| from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.model import FastSpeech2 |
| from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import ( |
| SynthesizeOutputFormats, |
| ) |
| from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import ( |
| load_hifigan_from_checkpoint, |
| ) |
| from everyvoice.utils.heavy import get_device_from_accelerator |
| |
| repo_id = "multilingual-tts/EveryVoice-OpenBible-Hausa" |
| local = Path(snapshot_download(repo_id)) |
| |
| ckpt_path = local / "feature_prediction.ckpt" |
| vocoder_path = local / "vocoder.ckpt" |
| |
| accelerator = "gpu" if torch.cuda.is_available() else "cpu" |
| device = get_device_from_accelerator(accelerator) |
| |
| model = FastSpeech2.load_from_checkpoint(str(ckpt_path)).to(device) |
| model.eval() |
| global_step = get_global_step(ckpt_path) |
| |
| vocoder_ckpt = torch.load(str(vocoder_path), map_location=device, weights_only=True) |
| vocoder_model, vocoder_config = load_hifigan_from_checkpoint(vocoder_ckpt, device) |
| vocoder_global_step = get_global_step(vocoder_path) |
| |
| # Pick any speaker from the model |
| speaker = next(iter(model.speaker2id.keys())) |
| language = next(iter(model.lang2id.keys())) |
| print(f"Available speakers: {list(model.speaker2id.keys())}") |
| |
| filelist_data = [ |
| { |
| "basename": "sample-0", |
| "characters": "...", # text to synthesise in Hausa |
| "language": language, |
| "speaker": speaker, |
| "duration_control": 1.0, |
| } |
| ] |
| |
| output_dir = Path("everyvoice_output") |
| output_dir.mkdir(exist_ok=True) |
| |
| synthesize_helper( |
| model=model, |
| texts=None, |
| style_reference=None, |
| language=None, |
| speaker=None, |
| duration_control=1.0, |
| global_step=global_step, |
| output_type=[SynthesizeOutputFormats.wav], |
| text_representation=DatasetTextRepresentation.characters, |
| accelerator=accelerator, |
| devices="auto", |
| device=device, |
| batch_size=1, |
| num_workers=1, |
| filelist=None, |
| filelist_data=filelist_data, |
| output_dir=output_dir, |
| teacher_forcing_directory=None, |
| vocoder_model=vocoder_model, |
| vocoder_config=vocoder_config, |
| vocoder_global_step=vocoder_global_step, |
| ) |
| # Generated WAVs land in output_dir/wav/ |
| ``` |
|
|
| ## Training data |
|
|
| - **Source:** `davidguzmanr/open-bible-resources`, config `Hausa` |
| - **Size:** approximately 22,825 utterances |
| - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set |
| voices and selected by name at inference time |
| - **Sample rate:** 22,050 Hz |
|
|
| ## Training procedure |
|
|
| - Acoustic model: FastSpeech2 (non-autoregressive, duration-prediction based). |
| - Vocoder: HiFi-GAN (iSTFT variant). |
| - Character-level tokenizer built from the training transcripts. |
| - Trained with the [EveryVoice](https://github.com/EveryVoiceTTS/EveryVoice) toolkit. |
|
|
| Audio preprocessing and training are reproducible via the upstream |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. |
|
|
| ## Evaluation |
|
|
| Evaluated alongside other Open-Bible TTS systems on character/word error rate |
| (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository |
| for the evaluation pipeline and the |
| [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository |
| for the human-listening survey methodology. |
|
|