Multi-speaker podcast generation: truncation, silence gaps, and volume inconsistency
Environment
- Model: voxtral-mini-tts-2603 (via Mistral API)
- API endpoint: /v1/audio/speech
- Response format: mp3
- Use case: Multi-speaker podcast generation (open-notebooklm; 3 speakers, 38 dialogue clips concatenated)
Issues
I'm using Voxtral TTS to generate a multi-speaker podcast by synthesizing each dialogue turn individually (one API call per turn) and then concatenating the clips. I've encountered three issues:
Speech truncation at the end of utterances
Some speakers' speech is cut off before they finish their sentence. The final words or syllables are abruptly truncated, making the dialogue sound incomplete. This happens inconsistently β not every clip is affected, but it's noticeable enough to degrade the listening experience.Unnatural silence/padding at clip boundaries
Each generated audio clip contains noticeable silence at the beginning and/or end. When clips are concatenated sequentially, these silences accumulate and create unnaturally long pauses between speaker turns. The gaps feel much longer than what you'd hear in a real conversation.Volume fluctuation across different speakers/voice_ids
When switching between different voice_ids (different speakers), the output volume varies significantly. Some clips are noticeably louder or quieter than others, even though all other parameters remain the same. This creates jarring volume jumps in the final podcast.
Hi @ericgreen ,
For using the model via Mistral API, Do you mean via Mistral AI Studio?
Can you share the text prompts & voice_ids you encountered those issues? I can take a look on them.
Hi @y123456y78 , thanks for looking into this!
Yes, I'm using it via the Mistral API (https://api.mistral.ai/v1/audio/speech), model: voxtral-mini-tts-2603, response format: mp3.
I've prepared a full reproduction script that contains:
- 50 dialogue turns from a real podcast episode with 3 different voice_ids
- Exact API call code (httpx POST β base64 decode)
- Can be run directly with
MISTRAL_API_KEYto reproduce the issues
The script is quite long (~90 lines + transcript data), and HuggingFace doesn't support file attachments here. Could you share an email address so I can send the reproduction script directly?