Inquiry regarding fine-tuning of nvidia/magpie_tts_multilingual_357m
Hi there,
I noticed that you have fine-tuned the nvidia/magpie_tts_multilingual_357m model. I am very interested in your work and have a few questions:
Fine-tuning guide: Do you have any detailed instructions, scripts, or documentation on how you approached the training process?
Output quality: How would you evaluate the quality of the generated audio, specifically regarding intonation and naturalness?
Practical usage: Based on your experience, is the fine-tuned model robust enough for real-world/production applications?
Thank you for your time and for sharing your work!
Hello, of course i am happy to answer those questions.
- please refer to my github project, https://github.com/NMikaa/TTS_pipelines, go to the pipelines and there you can find a folder magpieTTS, how it was trained, everything i did during training. The idea of this repository is basicly benchmarking different architectures of open source TTS models while training on the same set of data. the training/fine tuning guide is basicly language agnostic, you can swap out the data and it works just as intended :)
- the intonation and naturalness are the most tricky part of the evaluation. I can't evaluate georgian language's intonation and naturalness automatically, or i haven't researched that part yet, currently I believe that MOS is the most appropriate scoring for such low resource languages.
- This model is just a basis for Georgians to finetune/continue training on their own data. if you fine tune it on a much bigger corpus, where there are different intonations, alot of punctuation, yeah, its robust enough. at this point i don't think so, it sometimes messes up the "?" or "!" because of the common voice data. but still, only training for 5 (!!!!) hours on A6000 (48GB Vram), 33 hours of georgian language, it gave such a result that i was amazed. the architecture is immaculate and i think can be adapted to any other languages aswell, where at some point of training, it can be used for real-word applications EASILY.
Thank you for your sharing. I’ve looked through it and it seems fairly easy to implement (although I haven’t tried it yet). However, I have some questions regarding the data:
You mentioned that “No voice cloning: Uses 5 baked speaker embeddings from pretraining. Reference audio cloning was not trained.” For a small dataset, this seems quite reasonable to me. However, in NVIDIA’s paper they mention that the model was trained on 50,000 hours of data. Assigning a speaker ID to each individual speaker in such a massive dataset seems impractical (although I’m not entirely sure how they handled it). At the same time, in their inference code there are only a few to a few dozen speaker IDs, and I don’t see any voice cloning mechanism.
Could you explain how this works?
Also, I noticed that your training dataset contains 12 speakers, but you mentioned that only 5 speaker IDs are used. What is the reason for this?
If I have a large dataset but haven’t assigned speaker IDs to everything, what should I do?
Thank you very much.
ok, so if you try training the model itself and log everything in wandb, you are gonna see that using "teacher forcing" method the model is cloning audio very well, even though it is the same audio, its more pure and more realistic than the reference one. I just did not look much into voice cloning as i was aiming for just coherent georgian. they for sure have some "context audio" argument in inference, otherwise this would be impossible.
regarding to your question about the speaker IDs, they probably used some clustering algorithm using some model or their own model, one model that i came across was ECAPA-TDNN, they couldve done something like ECAPA-TDNN embeddings -> k means into n clusters.
about 12 speakers, i did not change their own pretrained model's 5 baked speakers, thats why it happened. fine-tuning doesn't add new speakers. Even though our Common Voice Georgian data had 12 speakers, we can only index into 0-4.
try using automatic speaker assignment for the speaker ID's research which model extracts the speaker embeddings the best, so you can implement some clustering method.
If the model only has 5 fixed speakers, how do you handle speakers whose IDs are greater than 5 in the dataset?
I don’t seem to see any special preprocessing step in the code for handling this case.
the "speaker" field in the training manifest is completely ignored in my setup.
nemo's dataset loader (nemo/collections/tts/data/text_to_speech_dataset.py) only reads the speaker field if a speaker_path JSON file is configured. Our training config (conf/magpietts_georgian.yaml) doesn't set one, so include_speaker = False and every sample defaults to speaker_index = 0 regardless of what's in the manifest.
the speaker IDs (8, 10, 14, etc.) in my manifest are only used by our own train.py (lines 222-234) to pair each training sample with a reference audio clip from the same speaker, this goes into context_audio_filepath / context_audio_codes_path. The model learns speaker characteristics through its context encoder, not through the speaker index.
The 5 baked speaker embeddings (0-4) are a completely separate inference-only mechanism, pre-computed vectors that replace the context encoder at inference time. See magpietts.py, i think _prepare_decoder_context.py was the method which basicly "discarded" the context encoder if the embeddings were "baked" inside the model.
What happened is: we fine-tuned with speaker=0 on all samples, so the model learned Georgian "through" speaker 0's embedding. however, i dont think "fixing" this would benefit the model quality at all. the decoder, text encoder and CTC alignment is shared across all speakers, so they are not speaker specific layers. so when we switch to speaker 1 during inference, it just outputs the audio in speaker 1's voice. for sure, I couldve changed the "baked embedding layer".
now getting back to your question about the speaker IDs. honestly, thats up to you, because when i really researched this topic rn with the help of your questions, it came to the point that even if you dont label anything with speakers, you will still get 5 good speakers who know the language you are training on pretty well. however, if you plan on "cloning" some speaker from your dataset, than it might be worth labeling 😀
thank you for the questions and activity, if you got any more questions i might go down the rabbit hole even deeper because this experiment was a matter of a week and i am still learning about this architecture and their code 😀
Thank you for sharing. I will consider testing it in the near future.