Text-to-Speech
vllm
mistral-common

Multi-speaker podcast generation: truncation, silence gaps, and volume inconsistency

#24
by ericgreen - opened

Environment

  1. Model: voxtral-mini-tts-2603 (via Mistral API)
  2. API endpoint: /v1/audio/speech
  3. Response format: mp3
  4. Use case: Multi-speaker podcast generation (open-notebooklm; 3 speakers, 38 dialogue clips concatenated)

Issues
I'm using Voxtral TTS to generate a multi-speaker podcast by synthesizing each dialogue turn individually (one API call per turn) and then concatenating the clips. I've encountered three issues:

  1. Speech truncation at the end of utterances
    Some speakers' speech is cut off before they finish their sentence. The final words or syllables are abruptly truncated, making the dialogue sound incomplete. This happens inconsistently β€” not every clip is affected, but it's noticeable enough to degrade the listening experience.

  2. Unnatural silence/padding at clip boundaries
    Each generated audio clip contains noticeable silence at the beginning and/or end. When clips are concatenated sequentially, these silences accumulate and create unnaturally long pauses between speaker turns. The gaps feel much longer than what you'd hear in a real conversation.

  3. Volume fluctuation across different speakers/voice_ids
    When switching between different voice_ids (different speakers), the output volume varies significantly. Some clips are noticeably louder or quieter than others, even though all other parameters remain the same. This creates jarring volume jumps in the final podcast.

Mistral AI_ org

Hi @ericgreen ,

For using the model via Mistral API, Do you mean via Mistral AI Studio?

Can you share the text prompts & voice_ids you encountered those issues? I can take a look on them.

Hi @y123456y78 , thanks for looking into this!

Yes, I'm using it via the Mistral API (https://api.mistral.ai/v1/audio/speech), model: voxtral-mini-tts-2603, response format: mp3.

I've prepared a full reproduction script that contains:

  • 50 dialogue turns from a real podcast episode with 3 different voice_ids
  • Exact API call code (httpx POST β†’ base64 decode)
  • Can be run directly with MISTRAL_API_KEY to reproduce the issues

The script is quite long (~90 lines + transcript data), and HuggingFace doesn't support file attachments here. Could you share an email address so I can send the reproduction script directly?

Sign up or log in to comment