multilingual-tts
/

VITS-OpenBible-Ewe

Model card Files Files and versions

davidguzmanr commited on 5 days ago

Commit

b1e7c8f

·

verified ·

1 Parent(s): f46b8d0

Add README for Ewe

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ inference: false
 A multispeaker text-to-speech model for **Ewe**, trained from scratch on
 the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
 corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
-(end-to-end TTS with adversarial learning, 22 kHz output) via the
 [Coqui TTS](https://github.com/coqui-ai/TTS) framework.
 Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
@@ -93,14 +93,15 @@ wav = synthesizer.tts(
 - **Size:** approximately 22,195 utterances
 - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
   voices and selected by name at inference time
-- **Sample rate:** 22 kHz
 ## Training procedure
 - Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
 - Grapheme-level tokenizer, built from the training transcripts.
 - Optimizer: AdamW, learning rate 2e-4.
-- Training budget: 250,000 steps.
 Audio preprocessing and training are reproducible via the upstream
 [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.

 A multispeaker text-to-speech model for **Ewe**, trained from scratch on
 the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
 corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
+(end-to-end TTS with adversarial learning, 22,050 Hz output) via the
 [Coqui TTS](https://github.com/coqui-ai/TTS) framework.
 Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
 - **Size:** approximately 22,195 utterances
 - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
   voices and selected by name at inference time
+- **Sample rate:** 22,050 Hz
 ## Training procedure
 - Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
 - Grapheme-level tokenizer, built from the training transcripts.
 - Optimizer: AdamW, learning rate 2e-4.
+- Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision
+  (bf16).
 Audio preprocessing and training are reproducible via the upstream
 [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.