Text-to-Speech
Ewe
coqui-tts
tts
vits
open-bible
ewe
davidguzmanr commited on
Commit
b1e7c8f
·
verified ·
1 Parent(s): f46b8d0

Add README for Ewe

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -20,7 +20,7 @@ inference: false
20
  A multispeaker text-to-speech model for **Ewe**, trained from scratch on
21
  the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
22
  corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
23
- (end-to-end TTS with adversarial learning, 22 kHz output) via the
24
  [Coqui TTS](https://github.com/coqui-ai/TTS) framework.
25
 
26
  Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
@@ -93,14 +93,15 @@ wav = synthesizer.tts(
93
  - **Size:** approximately 22,195 utterances
94
  - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
95
  voices and selected by name at inference time
96
- - **Sample rate:** 22 kHz
97
 
98
  ## Training procedure
99
 
100
  - Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
101
  - Grapheme-level tokenizer, built from the training transcripts.
102
  - Optimizer: AdamW, learning rate 2e-4.
103
- - Training budget: 250,000 steps.
 
104
 
105
  Audio preprocessing and training are reproducible via the upstream
106
  [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.
 
20
  A multispeaker text-to-speech model for **Ewe**, trained from scratch on
21
  the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources)
22
  corpus using the [VITS](https://arxiv.org/abs/2106.06103) architecture
23
+ (end-to-end TTS with adversarial learning, 22,050 Hz output) via the
24
  [Coqui TTS](https://github.com/coqui-ai/TTS) framework.
25
 
26
  Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned
 
93
  - **Size:** approximately 22,195 utterances
94
  - **Speakers:** multispeaker; speaker identity is fixed to one of the training-set
95
  voices and selected by name at inference time
96
+ - **Sample rate:** 22,050 Hz
97
 
98
  ## Training procedure
99
 
100
  - Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
101
  - Grapheme-level tokenizer, built from the training transcripts.
102
  - Optimizer: AdamW, learning rate 2e-4.
103
+ - Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision
104
+ (bf16).
105
 
106
  Audio preprocessing and training are reproducible via the upstream
107
  [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo.