Regarding speaker annotation data used to finetune this model.

by RinRin32 - opened Jan 14

Jan 14

Hi, regarding the model behavior I find that the large one infer with better results than the mini model. One more thing is using Qwen3-4B Instruct and training a linear projector to skip the T5 model somewhat produce a more controllable inference. I'm however curious on how data was annotated in finetuning this checkpoint, this is because the male inference is rather a hit or miss, another point is elaborate english description using the T5 or qwen linear projector gives very varying result, sometime insane hallucination. Feel free to link your dataset if possible 🤝

2121-8

Owner Jan 15

We can’t share the dataset publicly due to constraints, but I can describe how it was prepared and what we observed.

For annotation, we followed the official dataspeech repository pipeline exactly: we used the official code and executed the documented commands in the same order, without changing the annotation procedure itself. The LLM used for the annotation step is Mistral-7B-Instruct-v0.2, and we didn’t introduce any original labeling scheme beyond what the official pipeline produces.

Regarding why “male inference” is hit-or-miss, the dataset is heavily imbalanced: roughly 5% male and 95% female. We think this skew is the main reason male-related generations are inconsistent, because the model sees far fewer male examples during training.

As for instability / occasional hallucination with long, elaborate descriptions, the key issue seems to be tokenization. In the original parler-tts/parler-tts-mini-v1 setup, the default T5 tokenizer doesn’t represent Japanese well, and a large portion of Japanese text gets replaced with <unk>. So we swapped the model-side tokenizer to a Japanese-capable one and then continued TTS training. However, in this work we did not pretrain from scratch or do enough additional training/adaptation of the T5 text encoder itself to match the new tokenizer, which may introduce a tokenizer–encoder mismatch. As a result, the longer and more detailed the description is, the more unstable the conditioning interpretation might become, increasing output variance and making hallucinations more likely .

For reference, 2121-8/japanese-parler-tts-mini swaps only the tokenizer on the prompt side, so it probably improves stability as well.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment