inconsistent in selecting the speaker voice in sesame csm 1b

#48

by Jernik - opened May 21, 2025

May 21, 2025

Hi guys, I tried generating output for speaker ID 0 at a time, but I'm not getting a consistent voice—the voice seems to change across generations. Did you experience the same issue? Were you able to get the same voice every time, or did it vary for you too?

Combatti

Jul 3, 2025

Same. 🙏

MatthewEvolving

10 days ago

You set the voice by creating 2 files that are read in every time you generate tokens. The 2 files are voice.wav and voice.txt as an example. Keep it to less than 10 seconds, and this will allow the model to base the voice off of the sample. txt must match what is spoken in the wav file.

Combatti

9 days ago

You set the voice by creating 2 files that are read in every time you generate tokens. The 2 files are voice.wav and voice.txt as an example. Keep it to less than 10 seconds, and this will allow the model to base the voice off of the sample. txt must match what is spoken in the wav file.

Thanks! Long past csm. It was too wonky for our needs. We ended up implementing our own TTS engine. 🙏

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment