File size: 5,998 Bytes
69f1e2d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ---
library_name: vllm
language:
- en
- fr
- es
- pt
- it
- nl
- de
- ar
- hi
license: cc-by-nc-4.0
inference: false
base_model:
- mistralai/Ministral-3-3B-Base-2512
extra_gated_description: >-
If you want to learn more about how we process your personal data, please read
our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
tags:
- mistral-common
pipeline_tag: text-to-speech
---
# Voxtral 4B TTS 2603
Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.
For more details, see our:
- [🔊 Demo](https://console.mistral.ai/build/audio/text-to-speech)
- [✍️ Blog post](https://mistral.ai/news/voxtral-tts)
- [🔬 Research Paper](https://arxiv.org/abs/2603.25551)
## Key Features
Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:
- **Realistic, expressive speech** with natural prosody and emotional range across 9 major languages, with support for diverse dialects
- **Text-to-Speech generation** with 20 preset voices and easy adaptation to new voices
- **Multilingual support**: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi
- **Very low latency** with fast time-to-first-audio, plus streaming and batch inference support
- **24 kHz audio output** in WAV, PCM, FLAC, MP3, AAC, and Opus formats
- **Production-ready performance** for high-throughput, real-time voice agent workflows
> [!Tip]
> For voice customization, visit our [AI Studio](https://console.mistral.ai/build/audio/text-to-speech).
### Use Cases
- Customer support and call center infrastructure.
- Financial services. _-- with video demo on banking KYC voice agents._
- Manufacturing and industrial operations.
- Public services and government.
- Compliance and risk.
- Supply chain and logistics.
- Automotive and in-vehicle systems.
- Sales and marketing.
- Real-time translation.
> [!Warning]
> Responsible Use -
> You are responsible for complying with applicable laws and avoiding misuse.
## Benchmark Results
- Measured using [vllm_omni/examples/offline_inference/voxtral_tts/end2end.py](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts).
- Input: 500-character text with a 10-second audio reference.
- Hardware: single NVIDIA H200.
- vllm version: v0.18.0.
*Note*: The RTF in `end2end.py` uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)
| Concurrency | Latency | RTF | Throughput (char/s/GPU) |
|:-----------:|:-------:|:-----:|:-----------------------:|
| 1 | 70 ms | 0.103 | 119.14 |
| 16 | 331 ms | 0.237 | 879.11 |
| 32 | 552 ms | 0.302 | 1430.78 |
## Usage
The model can also be deployed with the following libraries:
- [`vllm-omni (recommended)`](https://github.com/vllm-project/vllm-omni): See [here](#vllm-omni-recommended)
### vLLM Omni (recommended)
> [!Tip]
> We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni.
> Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.
**Installation**
Make sure to install [vllm](https://github.com/vllm-project/vllm) from the latest (>= 0.18.0) pypi package.
See [here](https://docs.vllm.ai/en/latest/getting_started/installation/) for a full installation guide.
```
uv pip install -U vllm
```
Next, you should install [`vllm-omni`](https://github.com/vllm-project/vllm-omni) with `vllm-omni >= 0.18.0`.
```
uv pip install vllm-omni --upgrade # make sure to have >= 0.18.0
```
Alternatively, you can also make use of a ready-to-go docker image on the [docker hub](https://hub.docker.com/layers/vllm/vllm-omni/v0.18.0/images/sha256-d855c9f3e06b1126e8a082229e5d2fef217e43c98d03569f8b9e50fa5c2d0a61).
Installing `vllm >= 0.18.0` should automatically install `mistral_common >= 1.10.0` which you can verify by running:
```sh
python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0
```
#### Serve
Due to size and the BF16 format of the weights - `Voxtral-4B-TTS-2603` can run on a single GPU with >= 16GB memory.
```bash
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
```
#### Client
```py
import io
import httpx
import soundfile as sf
BASE_URL = "http://<your-server-url>:8000/v1"
payload = {
"input": "Paris is a beautiful city!",
"model": "mistralai/Voxtral-4B-TTS-2603",
"response_format": "wav",
"voice": "casual_male",
}
response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
response.raise_for_status()
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")
# you can play the audio with a library like `sounddevice.play` for example
```
#### Demo
To run it:
```sh
git clone https://github.com/vllm-project/vllm-omni.git && \
cd vllm-omni && \
uv pip install gradio==5.50 && \
python examples/online_serving/voxtral_tts/gradio_demo.py \
--host <your-server-url> \
--port 8000
```
Alternatively you can also try it out live here ➡️ [**HF Space**](https://huggingface.co/spaces/mistralai/voxtral-tts-demo).
## License
The provided voice-references compatible with this model are licensed under [CC BY-NC 4](https://creativecommons.org/licenses/by-nc/4.0/), e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.
*You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.* |