Multi-speaker podcast generation: truncation, silence gaps, and volume inconsistency

#24

by ericgreen - opened 23 days ago

Environment

Model: voxtral-mini-tts-2603 (via Mistral API)
API endpoint: /v1/audio/speech
Response format: mp3
Use case: Multi-speaker podcast generation (open-notebooklm; 3 speakers, 38 dialogue clips concatenated)

Issues
I'm using Voxtral TTS to generate a multi-speaker podcast by synthesizing each dialogue turn individually (one API call per turn) and then concatenating the clips. I've encountered three issues:

Speech truncation at the end of utterances
Some speakers' speech is cut off before they finish their sentence. The final words or syllables are abruptly truncated, making the dialogue sound incomplete. This happens inconsistently — not every clip is affected, but it's noticeable enough to degrade the listening experience.
Unnatural silence/padding at clip boundaries
Each generated audio clip contains noticeable silence at the beginning and/or end. When clips are concatenated sequentially, these silences accumulate and create unnaturally long pauses between speaker turns. The gaps feel much longer than what you'd hear in a real conversation.
Volume fluctuation across different speakers/voice_ids
When switching between different voice_ids (different speakers), the output volume varies significantly. Some clips are noticeably louder or quieter than others, even though all other parameters remain the same. This creates jarring volume jumps in the final podcast.

y123456y78

Mistral AI_ org 22 days ago

Hi @ericgreen ,

For using the model via Mistral API, Do you mean via Mistral AI Studio?

Can you share the text prompts & voice_ids you encountered those issues? I can take a look on them.

ericgreen

22 days ago

Hi @y123456y78 , thanks for looking into this!

Yes, I'm using it via the Mistral API (https://api.mistral.ai/v1/audio/speech), model: voxtral-mini-tts-2603, response format: mp3.

I've prepared a full reproduction script that contains:

50 dialogue turns from a real podcast episode with 3 different voice_ids
Exact API call code (httpx POST → base64 decode)
Can be run directly with MISTRAL_API_KEY to reproduce the issues

The script is quite long (~90 lines + transcript data), and HuggingFace doesn't support file attachments here. Could you share an email address so I can send the reproduction script directly?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment