Questions about finetuning in another language
Amazing work on this finetune.
I am a student trying to finetune the model for my own regional language. Can I get data around how much data (duration of audio files) and in which language did you use transcription for?
Also, did you use the voice of a single speaker or multiple speakers?
Any insight on this process would help me a lot.
Thankyou.
Hello, thanks for reaching out.
We used single-speaker voice and mix of Synthetic & Public datasets in two stages:
(1) Synthetic data to teach model to map text to Uzbek speech (1 epoch of 50K samples of 10-30 second synthetic speech)
(2) Public data to teach model to speak like a human (4 epochs of 50K samples each of 10-30 second natural speech)
You will see signs of success/failure within 10% of the dataset I used (more or less depending on how close is your target language to English-Chinese)