Model gives same output as audio_prompt

by McHammer - opened Jan 5

Jan 5

Hi Sebastian,

nice work on kartoffelbox_turbo, really impressive model!

I’ve been experimenting with it and ran into an issue where the generated output sometimes matches the audio_prompt_path content almost 1:1 (same wording and timing), instead of speaking the provided text.

Did you observe similar behavior during development?
If yes, did you find any recommended workarounds (e.g. very short prompts, specific sampling settings, or a preferred prompt style)?

Thanks a lot for sharing the model!

SebastianBodza

Owner Jan 7

Hey McHammer,

Unfortunately yes, the same question in the hf spaces: https://huggingface.co/spaces/SebastianBodza/chatterbox-turbo-demo/discussions/1

The standard chatterbox turbo also does this sometimes when using low quality reference audio. However, my finetune guided the model stronger towards that undesired behaviour. The training diverged at some point. I set up a two stage training, and in the second stage I used only different audio from reference to desired output. Unfortunately training died quite early in the first stage and therefore the model is not that good.

The cleaner the audio and the closer the temperature to 1, the lower the percentage of the repetitions. You could also try to combine it with a cleanup using a pipeline similar to the Emilia one: UVR-MDX-NET-Inst_HQ_3 ONNX. If I get more time, I could try to train it with cfg similar to the standard chatterbox model. If it is really a one to one copy of the audiotokens we could maybe create a custom sampler to mitigate that.

McHammer

Jan 8

Thanks for the answer:)

As i have seen in: https://huggingface.co/ResembleAI/chatterbox-turbo/discussions/22 they removed the cfg_weight because of computational speed. And also they discussed they use of exaggeration fyi.

Sounds like an good idea!

You tried other models for high german or finetune? I am looking for an standard german streamable model or even better an streamable Model which can be finetuned on differend dialects for my bachelorthesis:)

Maybe if you wanna connect we can discuss further:) As i have seen you are limited in computational resources:)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment