Reference audio / Voice cloning not working?

#17

by Tranquil9661 - opened 25 days ago

model card benchmark uses 10s audio reference
https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts#voxtral-tts-offline-inference

Example accepts --audio-path, but vLLM-Omni docs say not yet released and runtime says encoder weights are missing from the open-source checkpoint; is there a newer public checkpoint or companion asset planned?

y123456y78

Mistral AI_ org 25 days ago

Hi @Tranquil9661 ,
The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

evewashere

25 days ago

The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

just release the encoder weights bro we can figure out the rest

y123456y78

Mistral AI_ org 25 days ago

While we didn't release the encoder weights in this version, all the details about the encoder are available in the paper: https://arxiv.org/pdf/2603.25551.

aimeri

24 days ago

architecture without trained weights is useless. And training a compatible encoder is the hard part.

When you say “you can figure it out from the paper” — you’re technically not wrong, but it’s like saying “the blueprints for a jet engine are public, go build one.” The architecture is the easy part. The weights are what people actually need, and those are exactly what is missing.

A more honest answer would be: we won’t release the weights because voice cloning is our only monetization avenue for this model. No point in sending people on wild-geese chases

evewashere

24 days ago

They'll release the encoder weights under some license that requires you to sacrifice your first born child to the french as soon as another lab releases something better 😭

marduk191

24 days ago

another useless release by some research lab selling shit and releasing useless weights.

KebalBaguette

23 days ago

Hi @Tranquil9661 ,
The voice cloning feature is not included in the current release, and we don’t yet have a timeline for its availability.

What the heck ? it's one of the most important feature...

y123456y78

Mistral AI_ org 22 days ago

Hi folks,

I understand your disappointment here. This is just our first step in this area, and we’re committed to advancing research and sharing progress with the community. My role here is simply to share information, not to persuade anyone. The only thing I can assure is that your feedback is being heard.

RSmirnov

19 days ago

•

edited 19 days ago

Something like that may help to research the cloning capabilities of the model and get the codes for the voice cloning without actual encoder weights being released - https://github.com/MarvinRomson/voxtral-tts-codes-for-audio

KebalBaguette

19 days ago

•

edited 19 days ago

Hi folks,

I understand your disappointment here. This is just our first step in this area, and we’re committed to advancing research and sharing progress with the community. My role here is simply to share information, not to persuade anyone. The only thing I can assure is that your feedback is being heard.

I understand that it's not your fault of course. As a french and support of Mistral i'm just disappointed, no weight, no cloning, not available for commercial projects.. so i had to choose qwen tts that actually fits my needs instead of this model that could have been a wonderful demonstration of the capabilities inside what seems to be a solid model.

Hope you change your mind, this strategy, imho is just bad.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment