Text-to-Speech
vllm
mistral-common

Garbled initial audio frames in autoregressive generation — model behavior or implementation issue?

#20
by shreyask - opened

Hi Mistral team. Fantastic model, especially on multilingual quality.

We've been working on the mlx-audio port of mistralai/Voxtral-4B-TTS-2603 and consistently see an artifact at the start of generation: the first few decoded audio frames are often garbled before normal speech begins.

We've observed this across multiple implementations, including our MLX port and other public Voxtral TTS projects. Example user reports.

Observation

Right after the prompt-to-audio transition, the model often emits a short prefix of repeated semantic codes before settling into more diverse, speech-like codes. Common prefixes we observed are repeated 10 or 855. These frames decode to a short noise burst or unintelligible audio, typically around 200-600 ms.

Using mlx-community/Voxtral-4B-TTS-2603-mlx-bf16, we captured sequences like:

Prompt Voice First 12 semantic codes Suspected warmup frames
"Hello, how are you today?" casual_female 855, 855, 10, 4904, 1062, 44, 7128, 6164, 10, 4007, 4007, 10 0-2
"The quick brown fox..." casual_male 10, 10, 10, 10, 6081, 6081, 6081, 10, 3936, 44, 1429, 44 0-4
"Welcome to the machine learning..." cheerful_female 10, 44, 44, 3230, 1496, 1496, 6164, 855, 4007, 2538, 2538, 7575 0-1
"I can speak multiple sentences..." neutral_male 10, 10, 10, 10, 10, 10, 2309, 2309, 2309, 2309, 2309, 2309 0-6

Anecdotally, this seems more pronounced with quantized variants and varies somewhat across runs.

Current workaround

As a temporary mitigation, we trim a leading run of identical semantic codes before codec decode. This removes the most obvious constant-prefix artifacts, but it is only a heuristic and can clip legitimate onset frames in some cases.

def _trim_warmup_frames(all_codes: list) -> list:
    """Trim leading run of identical semantic codes before codec decode."""
    if len(all_codes) <= 2:
        return all_codes

    first_code = all_codes[0][0, 0, 0].item()
    if all_codes[1][0, 0, 0].item() != first_code:
        return all_codes

    for i in range(2, min(len(all_codes), 30)):
        if all_codes[i][0, 0, 0].item() != first_code:
            return all_codes[i:]

    return all_codes

Questions

  1. Is this expected behavior around the text-to-audio transition?
  2. Does your internal/reference inference pipeline do anything special for the first generated audio frames?
  3. Is there a recommended bootstrap procedure here, such as a specific transition token pattern, discarding a fixed number of initial frames, or a small post-processing step before codec decode?

Any guidance would be appreciated.

@y123456y78 and the team, any guidance would be greatly appreciated

Curious if a small warmup (generate a few frames, discard, then start) is the “right” way to handle the initial artifacts?

Sign up or log in to comment