Garbled initial audio frames in autoregressive generation — model behavior or implementation issue?

#20

by shreyask - opened 24 days ago

Hi Mistral team. Fantastic model, especially on multilingual quality.

We've been working on the mlx-audio port of mistralai/Voxtral-4B-TTS-2603 and consistently see an artifact at the start of generation: the first few decoded audio frames are often garbled before normal speech begins.

We've observed this across multiple implementations, including our MLX port and other public Voxtral TTS projects. Example user reports.

Observation

Right after the prompt-to-audio transition, the model often emits a short prefix of repeated semantic codes before settling into more diverse, speech-like codes. Common prefixes we observed are repeated 10 or 855. These frames decode to a short noise burst or unintelligible audio, typically around 200-600 ms.

Using mlx-community/Voxtral-4B-TTS-2603-mlx-bf16, we captured sequences like:

Prompt	Voice	First 12 semantic codes	Suspected warmup frames
"Hello, how are you today?"	casual_female	`855, 855, 10, 4904, 1062, 44, 7128, 6164, 10, 4007, 4007, 10`	0-2
"The quick brown fox..."	casual_male	`10, 10, 10, 10, 6081, 6081, 6081, 10, 3936, 44, 1429, 44`	0-4
"Welcome to the machine learning..."	cheerful_female	`10, 44, 44, 3230, 1496, 1496, 6164, 855, 4007, 2538, 2538, 7575`	0-1
"I can speak multiple sentences..."	neutral_male	`10, 10, 10, 10, 10, 10, 2309, 2309, 2309, 2309, 2309, 2309`	0-6

Anecdotally, this seems more pronounced with quantized variants and varies somewhat across runs.

Current workaround

As a temporary mitigation, we trim a leading run of identical semantic codes before codec decode. This removes the most obvious constant-prefix artifacts, but it is only a heuristic and can clip legitimate onset frames in some cases.

def _trim_warmup_frames(all_codes: list) -> list:
    """Trim leading run of identical semantic codes before codec decode."""
    if len(all_codes) <= 2:
        return all_codes

    first_code = all_codes[0][0, 0, 0].item()
    if all_codes[1][0, 0, 0].item() != first_code:
        return all_codes

    for i in range(2, min(len(all_codes), 30)):
        if all_codes[i][0, 0, 0].item() != first_code:
            return all_codes[i:]

    return all_codes

Questions

Is this expected behavior around the text-to-audio transition?
Does your internal/reference inference pipeline do anything special for the first generated audio frames?
Is there a recommended bootstrap procedure here, such as a specific transition token pattern, discarding a fixed number of initial frames, or a small post-processing step before codec decode?

Any guidance would be appreciated.

shreyask

22 days ago

@y123456y78 and the team, any guidance would be greatly appreciated

shreyask

21 days ago

Curious if a small warmup (generate a few frames, discard, then start) is the “right” way to handle the initial artifacts?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment