Garbled initial audio frames in autoregressive generation — model behavior or implementation issue?
Hi Mistral team. Fantastic model, especially on multilingual quality.
We've been working on the mlx-audio port of mistralai/Voxtral-4B-TTS-2603 and consistently see an artifact at the start of generation: the first few decoded audio frames are often garbled before normal speech begins.
We've observed this across multiple implementations, including our MLX port and other public Voxtral TTS projects. Example user reports.
Observation
Right after the prompt-to-audio transition, the model often emits a short prefix of repeated semantic codes before settling into more diverse, speech-like codes. Common prefixes we observed are repeated 10 or 855. These frames decode to a short noise burst or unintelligible audio, typically around 200-600 ms.
Using mlx-community/Voxtral-4B-TTS-2603-mlx-bf16, we captured sequences like:
| Prompt | Voice | First 12 semantic codes | Suspected warmup frames |
|---|---|---|---|
| "Hello, how are you today?" | casual_female | 855, 855, 10, 4904, 1062, 44, 7128, 6164, 10, 4007, 4007, 10 |
0-2 |
| "The quick brown fox..." | casual_male | 10, 10, 10, 10, 6081, 6081, 6081, 10, 3936, 44, 1429, 44 |
0-4 |
| "Welcome to the machine learning..." | cheerful_female | 10, 44, 44, 3230, 1496, 1496, 6164, 855, 4007, 2538, 2538, 7575 |
0-1 |
| "I can speak multiple sentences..." | neutral_male | 10, 10, 10, 10, 10, 10, 2309, 2309, 2309, 2309, 2309, 2309 |
0-6 |
Anecdotally, this seems more pronounced with quantized variants and varies somewhat across runs.
Current workaround
As a temporary mitigation, we trim a leading run of identical semantic codes before codec decode. This removes the most obvious constant-prefix artifacts, but it is only a heuristic and can clip legitimate onset frames in some cases.
def _trim_warmup_frames(all_codes: list) -> list:
"""Trim leading run of identical semantic codes before codec decode."""
if len(all_codes) <= 2:
return all_codes
first_code = all_codes[0][0, 0, 0].item()
if all_codes[1][0, 0, 0].item() != first_code:
return all_codes
for i in range(2, min(len(all_codes), 30)):
if all_codes[i][0, 0, 0].item() != first_code:
return all_codes[i:]
return all_codes
Questions
- Is this expected behavior around the text-to-audio transition?
- Does your internal/reference inference pipeline do anything special for the first generated audio frames?
- Is there a recommended bootstrap procedure here, such as a specific transition token pattern, discarding a fixed number of initial frames, or a small post-processing step before codec decode?
Any guidance would be appreciated.
Curious if a small warmup (generate a few frames, discard, then start) is the “right” way to handle the initial artifacts?