File size: 5,998 Bytes
69f1e2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
library_name: vllm
language:
- en
- fr
- es
- pt
- it
- nl
- de
- ar
- hi
license: cc-by-nc-4.0
inference: false
base_model:
- mistralai/Ministral-3-3B-Base-2512
extra_gated_description: >-
  If you want to learn more about how we process your personal data, please read
  our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
tags:
- mistral-common
pipeline_tag: text-to-speech
---

# Voxtral 4B TTS 2603 

Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.

For more details, see our:
- [🔊 Demo](https://console.mistral.ai/build/audio/text-to-speech)
- [✍️ Blog post](https://mistral.ai/news/voxtral-tts)
- [🔬 Research Paper](https://arxiv.org/abs/2603.25551)


## Key Features

Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:

- **Realistic, expressive speech** with natural prosody and emotional range across 9 major languages, with support for diverse dialects  
- **Text-to-Speech generation** with 20 preset voices and easy adaptation to new voices  
- **Multilingual support**: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi  
- **Very low latency** with fast time-to-first-audio, plus streaming and batch inference support  
- **24 kHz audio output** in WAV, PCM, FLAC, MP3, AAC, and Opus formats  
- **Production-ready performance** for high-throughput, real-time voice agent workflows

> [!Tip]
> For voice customization, visit our [AI Studio](https://console.mistral.ai/build/audio/text-to-speech).

### Use Cases

- Customer support and call center infrastructure.
- Financial services. _-- with video demo on banking KYC voice agents._
- Manufacturing and industrial operations.
- Public services and government.
- Compliance and risk.
- Supply chain and logistics.
- Automotive and in-vehicle systems.
- Sales and marketing.
- Real-time translation.

> [!Warning]
> Responsible Use - 
> You are responsible for complying with applicable laws and avoiding misuse.

## Benchmark Results

  - Measured using [vllm_omni/examples/offline_inference/voxtral_tts/end2end.py](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts).
  - Input: 500-character text with a 10-second audio reference.
  - Hardware: single NVIDIA H200.
  - vllm version: v0.18.0.

*Note*: The RTF in `end2end.py` uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)

  | Concurrency | Latency | RTF   | Throughput (char/s/GPU) |
  |:-----------:|:-------:|:-----:|:-----------------------:|
  | 1           | 70 ms   | 0.103 | 119.14                  |
  | 16          | 331 ms  | 0.237 | 879.11                  |
  | 32          | 552 ms  | 0.302 | 1430.78                 |


## Usage

The model can also be deployed with the following libraries:
- [`vllm-omni (recommended)`](https://github.com/vllm-project/vllm-omni): See [here](#vllm-omni-recommended)

### vLLM Omni (recommended)

> [!Tip]
> We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni.
> Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.


**Installation**

Make sure to install [vllm](https://github.com/vllm-project/vllm) from the latest (>= 0.18.0) pypi package. 
See [here](https://docs.vllm.ai/en/latest/getting_started/installation/) for a full installation guide.

```
uv pip install -U vllm
```

Next, you should install [`vllm-omni`](https://github.com/vllm-project/vllm-omni) with `vllm-omni >= 0.18.0`.

```
uv pip install vllm-omni --upgrade  # make sure to have >= 0.18.0
```

Alternatively, you can also make use of a ready-to-go docker image on the [docker hub](https://hub.docker.com/layers/vllm/vllm-omni/v0.18.0/images/sha256-d855c9f3e06b1126e8a082229e5d2fef217e43c98d03569f8b9e50fa5c2d0a61).


Installing `vllm >= 0.18.0` should automatically install `mistral_common >= 1.10.0` which you can verify by running:

```sh
python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0
```

#### Serve

Due to size and the BF16 format of the weights - `Voxtral-4B-TTS-2603` can run on a single GPU with >= 16GB memory.

```bash
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
```

#### Client

```py
import io
import httpx
import soundfile as sf
 
BASE_URL = "http://<your-server-url>:8000/v1"
 
payload = {
    "input": "Paris is a beautiful city!",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",
}
 
response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
response.raise_for_status()
 
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")

# you can play the audio with a library like `sounddevice.play` for example
```

#### Demo

To run it:

```sh
git clone https://github.com/vllm-project/vllm-omni.git && \
cd vllm-omni && \
uv pip install gradio==5.50 && \
python examples/online_serving/voxtral_tts/gradio_demo.py \
  --host <your-server-url> \
  --port 8000
```

Alternatively you can also try it out live here ➡️ [**HF Space**](https://huggingface.co/spaces/mistralai/voxtral-tts-demo).

## License

The provided voice-references compatible with this model are licensed under [CC BY-NC 4](https://creativecommons.org/licenses/by-nc/4.0/), e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.

*You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*