training dataset
Hi!
Just wondering if this model was trained on this dataset? https://huggingface.co/datasets/patriotyk/filatov_24000
Thanks!
Yes, it is.
are you able to share how many epochs in total? I am looking to do some training on styletts2 and your repository is a good example. Though to train on such a big dataset as filatov_24000 is probably going to be too high of a price for me.
I don't remember, it was done in several steps, first, it was on 2xRTX3090 then on one A100 on runpod. You can save resources by just removing encoder from training and then merge with my multilingual. Encoder on last stages requires a lot of memory.
I am slightly new in this area, though with a good background.
Right now I am running first stage of training on A100 with 200 training samples and 200 epochs just to test what would the end quality would be. The price is quite reasonable so far, but it would not be reasonable with more data. That's why your approach sounds interesting, I just simply lack knowledge to do this, but would love to learn. My wife and son are ukrainian, so I want to make some tool for them where they can have different voices reading an input text that they provide (just for fun and for learning purposes for me).
I wonder if you could help to me with this task? Of course not for free.
I was planning on taking some deeper course about these things, but sitting with a "teacher" and doing hands-on would be more beneficial for me.
Oh sorry, no I don't have enough time for that. For you, it would be easier just to use multispeaker model, here you can create custom voices.
I see :) Was worth trying asking anyway. Is it possible for myself to fine-tune your multispeaker model with different voices? Ie run this script: https://github.com/yl4579/StyleTTS2/blob/main/train_finetune.py
Or for that I would need .pth file and more information how the original config.yml looked during the training?
| You can save resources by just removing encoder from training and then merge with my multilingual
Wouldn't that require .pth file, which is not uploaded in the model repository?
You cannot because this checkpoint doesn't include discriminators and aligner. My decision was not to provide the full checkpoint for everyone.
ok, I see. Is this part possible without having .pth?
| You can save resources by just removing encoder from training and then merge with my multilingual. Encoder on last stages requires a lot of memory.
Yes, you can train model yourself but without encoder, that is just vocoder, and then you can copy already trained encoder to your weights.
okey, I see now that I can grab these weights from .bin file and use them. Just to clarify: by encoder you mean text_encoder? Thanks!
Oh, I am sorry decoder, not encoder.
do you know maybe if it will work to fine-tune on different voice if I use your decoder and don't train my own?
Yes, it will work. Decoder just generates sound from mels.
I might be missing something here, but does it really work to remove the decoder from training? considering this code https://github.com/yl4579/StyleTTS2/blob/main/train_first.py#L255-L264
Discriminator loss is dependent on output from decoder, but if the decoder is removed from training, then the discriminator loss cannot be calculated anymore and that makes the discriminator training redundant. Am I misunderstanding something here?
y_rec = model.decoder(en, F0_real, real_norm, s)
# discriminator loss
if epoch >= TMA_epoch:
optimizer.zero_grad()
d_loss = dl(wav.detach().unsqueeze(1).float(), y_rec.detach()).mean()
accelerator.backward(d_loss)
optimizer.step('msd')
optimizer.step('mpd')
else:
d_loss = 0
Discriminators are needed only for decoder train, you can comment this code.
I think I did all the steps right, trained without decoder and later on merged everything (except decoder) into your model. I can hear person speaking, but there is also this background "cracking" noise. Do you know maybe why would that be so?
Could you show me the generated audio? And one sample from dataset.
Here is the generated file:
Sample from dataset:
utt_022293.wav
A few notes:
The dataset I trained on without decoder consist of 200 random wavs from Filatov dataset.
I also trained on the same dataset with decoder and with my trained decoder quality is actually quite good even with 200 items (bad intonation, no pauses etc, but the sound quality is good, no cracks). Note that this is singlespeaker training for now.
My way of using decoder from https://huggingface.co/patriotyk/styletts2_ukrainian_single is like the following:
import soundfile
from nltk.tokenize import word_tokenize
import phonemizer
import os
# Set the espeak library path through environment variable
os.environ['PHONEMIZER_ESPEAK_LIBRARY'] = '/opt/homebrew/Cellar/espeak-ng/1.52.0/lib/libespeak-ng.dylib'
global_phonemizer = phonemizer.backend.EspeakBackend(language='uk', preserve_punctuation=True, with_stress=True, words_mismatch='ignore')
from styletts2_inference.models import StyleTTS2
model = StyleTTS2(
weights_path='model-without-decoder.bin', config_path='small-dataset-with-decoder.yml'
)
model.load_decoder_from_model('patriotyk_styletts2_ukrainian_single.bin')
text = 'Решта окупантів звернула на Вокзальну — центральну вулицю Бучі. Тільки уявіть їхній настрій, коли перед ними відкрилася ця пасторальна картина! Невеличкі котеджі й просторіші будинки шикуються обабіч, перед ними вивищуються голі липи та електро-стовпи, тягнуться газони й жовто-чорні бордюри. Доглянуті сади визирають із-поза зелених парканів, гавкотять собаки, співають птахи…'
ps = global_phonemizer.phonemize([text])
ps = ' '.join(word_tokenize(ps[0]))
tokens = model.tokenizer.encode(ps)
style = model.predict_style_single(tokens)
wav = model(tokens, s_prev=style)
soundfile.write('gennnerated.wav', wav.cpu().numpy(), 24000)
Thx!
First of all you cannot use espeak phonemizer because it is not good and my model was trained on my phonemizer (https://github.com/patriotyk/ipa-uk.git) But it cannot cause so bad sound as in your example. I think changing phonemizer(and retrain) to proper one in second case where is bad intonation, no pauses etc, can resolve this issue.
For the first case with frozen decoder, could you update sigma_data value in your config from config in your checkpoint folder(config.yml), then regenerate style and speech.
Also, why do you train on the same dataset? what the reason?
Changing sigma_data did not help :(.
| Also, why do you train on the same dataset? what the reason?
My initial understanding was that if I train on the same dataset as your model was trained on (Filatov dataset), then I skip the decoder training training becomes faster. After training is done and decoder is merged, I can then finetune on my custom voice. That's why I started to just try things out with 200 samples and see how it works. But since your question, maybe I should be able to train on my custom voice from scratch without decoder and replace decoder and it will still be good quality?
Regarding the cracking sound issue in the background, I might be doing something wrong when removing decoder from training and I might remove some things that should still be there. Essentially what I do is I comment out places where the decoder is used, and if the output of that decoder is used in some other calcluations, I just use 0 or zero tensor instead of that output, which affects how other losses are being calculated. In the end the losses without decoder becomes not good, but my thinking is that this might be ok since decoder training is commented out and then loss values might not be that important, since the main component is taken out of the training. See example from train_second.py which is how I have commented out the lines of code. Is it wrong way to do this like that?
F0_fake, N_fake = model.predictor.F0Ntrain(p_en, s_dur)
# y_rec = model.decoder(en, F0_fake, N_fake, s) # decoder removed
# Some other code ..
if start_ds:
optimizer.zero_grad()
# d_loss = dl(wav.detach(), y_rec.detach()).mean() # decoder removed
# d_loss.backward() # decoder removed
# optimizer.step('msd') # decoder removed
# optimizer.step('mpd') # decoder removed
d_loss = 0
else:
d_loss = 0
# generator loss
optimizer.zero_grad()
# loss_mel = stft_loss(y_rec, wav) # decoder removed
loss_mel = torch.tensor(0.0, device=device) # decoder removed
if start_ds:
# loss_gen_all = gl(wav, y_rec).mean() # decoder removed
loss_gen_all = torch.tensor(0.0, device=device) # decoder removed
else:
loss_gen_all = 0
# loss_lm = wl(wav.detach().squeeze(), y_rec.squeeze()).mean() # decoder removed
loss_lm = torch.tensor(0.0, device=device) # decoder removed
You should use train_finetune.py, also you don't need to create empty tensors, you can also comment writer.add_scalar with all losses you don't use anymore.
It is not so easy to say what you did wrong because the error might be anywhere.
You can also ask chatgpt he may help you.
will try things out, thanks a lot for your time!