Fine-tune on own voice ( ua language )

#1
by YaroslavOwl218 - opened

Hi! Thank you for your amazing work on the StyleTTS2‑Ukrainian model - it sounds truly impressive.

I wanted to ask:
1. Can I use this model for fine-tuning on my own voice?
2. How do you generate the style.pt file separately? Is it part of the training process or done differently?

I’d really appreciate any guidance or tips.
Thanks again for your great contribution!

  1. I don't provide the full checkpoint for fine tuning. I have deleted parts that is not needed for inference.
  2. To generate your voice style in zero shot way you can use predict_style_multi method. Here is example in readme https://github.com/patriotyk/styletts2-inference

Hi Serhiy / @patriotyk ,
Thanks again for the amazing StyleTTS2-Ukrainian work — the voice quality in your provided styles is outstanding. I tried to reproduce it with my own ~4-hour cleaned and transcribed voice dataset, using predict_style_multi, but the synthesized voice is noticeably weaker than your examples. ( I also tried fine tuning using bert multilingual, but in this case slurred speech is generated)

Could you clarify how you generate the style.pt files with such high quality? In particular:

What exactly is stored in style.pt: is it a raw output of predict_style_multi/predict_style_single from one audio clip, or is it aggregated (e.g., averaged) from multiple reference utterances? Do you apply any post-processing (normalization, filtering, scaling) before saving?

Is the style file created purely in zero-shot inference, or is there a separate fine-tuning / adaptation step involved to produce those “ideal” styles?

What preprocessing do you apply to the reference audio when extracting style? (e.g., silence trimming, loudness normalization, forced alignment between text and audio, sample rate settings, pitch/energy conditioning, phonemization specifics for Ukrainian)

How many and what length of reference utterances give you a stable, high-fidelity style?

Are there any synthesis hyperparameters (style scaling, speed, temperature, etc.) you tweak when using that style vector to get the best quality?

Do you ever fine-tune on a new voice to approach that quality, and if so, what scripts/settings do you use?

Could you share the exact commands or a minimal snippet you use to produce and save style.pt?

I can also share my preprocessing pipeline, audio examples, tokenization code, and what I’ve tried so far if that helps.
Thanks a lot!

Uh, too many questions.
style.pt it is a vector generated by predict_style_multi and just saved with torch.save Audio duration for that styles is about 15-20 seconds. I din't do any finetune and postprocessing, I trained everything from scratch(plbert, aligner, and styletts2) with my data, and audio samples used to generate styles is samples from dataset.
You cannot finetune on your dataset using plbert-multilingual because it was trained on different phonemizer and with different phoneme ids. But trained plbert is included in to that checkpoint so you shouldn't not use any external plbert checkpoints. Finetune script included in styletts2 is wrong, and it requires external plbert but actually it is not needed for finetune.
Also if you are trying to generate style from your voice that is was not present during training you should get pretty good audio but voice will not be very similar to yours because zero-shot capabilities of styletts2 is very bad. Could you show me the sample you have been using to generate style and some generated audio using this style.

Hi @patriotyk ,

Thank you for the detailed explanation! I am really gratefull!
Here's my exact setup and issue:

My Training Setup:

  • Base model: LibriTTS epoch_2nd_00100.pth
  • Script: train_finetune_accelerate.py (30 epochs)
  • Dataset: ~4 hours Ukrainian voice data (transcribed in Cyrillic) in txt format: line_00001.wav|Я досі розмірковую над твоєю пропозицією.|0
  • PL-BERT: Replaced with multilingual-pl-bert (https://huggingface.co/papercup-ai/multilingual-pl-bert)
  • Frozen components: text_aligner, text_encoder, bert, bert_encoder

Training Results (seemed good):
Sty Loss: 0.04839 <- excellent for voice cloning
LM Loss: 2.18629 <- seemed reasonable for Ukrainian
Gen Loss: 8.96375, Dur Loss: 4.13022
Total Loss: 0.46797

The Problem:
Despite low LM Loss, inference produces complete gibberish instead of Ukrainian words. However, the voice timbre sounds correct (good style extraction).

Comparison Test:
Your HuggingFace model with zero-shot cloning of the same voice works much better (though not perfect quality). - https://huggingface.co/spaces/patriotyk/styletts2-ukrainian/tree/main

My Understanding:
Based on your explanation, I believe the issue is:

  • LibriTTS was trained with English phonemizer
  • I fine-tuned with Ukrainian Cyrillic text but no Ukrainian phonemizer
  • Style extraction works because it's audio-based, but text processing is broken

Files attached:

  • line_00001.wav - example part of training dataset
  • result.wav - inference result (Ukrainian text input)

Questions:

  1. Is this phonemizer mismatch the root cause of gibberish output?
  2. Should I retrain from scratch with Ukrainian phonemizer instead of fine-tuning?
  3. Or can I somehow fix the text processing pipeline for the existing model?

The voice cloning part works great (Sty Loss: 0.048), but text-to-speech is completely broken. Any guidance would be hugely appreciated!

Thanks a lot for your time and expertise!
I really appreciate this!

Hm, I thought you where trying to finetune this models. But you where using english libritts checkpoint. So it will not work with such small dataset.

Actually, finetune english checkpoint on ukrainian dataset is totally bad idea here.

I wanted to clarify further — if we take your model as a base model (given that it already has correctly trained components for the Ukrainian language — phonemizer, aligner, PL-BERT, etc.), can I use it only for additional training/adaptation to my voice (without retraining all modules from scratch)?

Is this technically possible — for example, to unfreeze only the part responsible for style/voice, and leave the rest frozen?
If this is real, I would be very grateful for a tip on changes to the pipeline.

Yes, you can freeze decoder.
Also please looks here https://huggingface.co/patriotyk/styletts2_ukrainian_single/discussions/1
Maybe there is some useful info for you

Hi Serhiy,
Thanks a lot for your detailed answers in the thread - they’ve been very helpful in understanding the pipeline.

I wanted to ask for a bit more detail about your training from scratch setup:

Training process: Which components did you train from scratch before running StyleTTS2?

  • PL-BERT?
  • Text aligner / ASR?
  • Any other modules?

Data and training time:

  • Roughly how many hours of audio (and/or text-audio pairs) did you use for each module?
    How much training time did it take for:
    • PL-BERT and the aligner,
    • The first stage of StyleTTS2 (train_first.py),
    • The second stage (train_second.py)?

This info would really help in planning a full training pipeline. Thanks again in advance!

I trained from scratch PL-BERT and ASR.
I didnt remember about time but for aligner I used about 1.3K hours of data, for styletts2 ~1K
First stage trains quick because it allows to use bigger batch size. Second stage is also ok but only before joint training starts. When joint training starts it requires much more GPU memory.

Sign up or log in to comment