Model Speaks Too Fast and Lacks Pause Control
The audio cloning quality of Voxtral-4B-TTS-2603 is very strong, but there are key issues with speech timing and pacing.
The generated speech does not follow the tempo of the reference voice and tends to speak too quickly. Additionally, it does not preserve or reflect natural pauses from the reference audio, resulting in output that feels rushed and less natural.
There is currently no way to control or insert pauses in the generated speech, which makes it difficult to achieve more realistic pacing.
At a minimum, it would be helpful to support manual pause insertion within the input text to improve timing and overall clarity.
Do you have an example prompt + voice?
@wonderboy PLEASE I need to know does this model supports voice cloning feature locally? and is there any way to get it working on Windows?