Text-to-Speech
vllm
mistral-common

Reference audio / Voice cloning not working?

#17
by Tranquil9661 - opened

model card benchmark uses 10s audio reference
https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts#voxtral-tts-offline-inference

Example accepts --audio-path, but vLLM-Omni docs say not yet released and runtime says encoder weights are missing from the open-source checkpoint; is there a newer public checkpoint or companion asset planned?

Mistral AI_ org

Hi @Tranquil9661 ,
The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

just release the encoder weights bro we can figure out the rest

Mistral AI_ org

While we didn't release the encoder weights in this version, all the details about the encoder are available in the paper: https://arxiv.org/pdf/2603.25551.

architecture without trained weights is useless. And training a compatible encoder is the hard part.

When you say “you can figure it out from the paper” — you’re technically not wrong, but it’s like saying “the blueprints for a jet engine are public, go build one.” The architecture is the easy part. The weights are what people actually need, and those are exactly what is missing.

A more honest answer would be: we won’t release the weights because voice cloning is our only monetization avenue for this model. No point in sending people on wild-geese chases

They'll release the encoder weights under some license that requires you to sacrifice your first born child to the french as soon as another lab releases something better 😭

another useless release by some research lab selling shit and releasing useless weights.

Hi @Tranquil9661 ,
The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

What the heck ? it's one of the most important feature...

Mistral AI_ org

Hi folks,

I understand your disappointment here. This is just our first step in this area, and we’re committed to advancing research and sharing progress with the community. My role here is simply to share information, not to persuade anyone. The only thing I can assure is that your feedback is being heard.

Something like that may help to research the cloning capabilities of the model and get the codes for the voice cloning without actual encoder weights being released - https://github.com/MarvinRomson/voxtral-tts-codes-for-audio

Hi folks,

I understand your disappointment here. This is just our first step in this area, and we’re committed to advancing research and sharing progress with the community. My role here is simply to share information, not to persuade anyone. The only thing I can assure is that your feedback is being heard.

I understand that it's not your fault of course. As a french and support of Mistral i'm just disappointed, no weight, no cloning, not available for commercial projects.. so i had to choose qwen tts that actually fits my needs instead of this model that could have been a wonderful demonstration of the capabilities inside what seems to be a solid model.

Hope you change your mind, this strategy, imho is just bad.

Sign up or log in to comment