LittleDesignSolution patrickvonplaten commited on
Commit
69f1e2d
·
0 Parent(s):

Duplicate from mistralai/Voxtral-4B-TTS-2603

Browse files

Co-authored-by: Patrick von Platen <patrickvonplaten@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken.json filter=lfs diff=lfs merge=lfs -text
37
+ Voxtral_TTS.pdf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: vllm
3
+ language:
4
+ - en
5
+ - fr
6
+ - es
7
+ - pt
8
+ - it
9
+ - nl
10
+ - de
11
+ - ar
12
+ - hi
13
+ license: cc-by-nc-4.0
14
+ inference: false
15
+ base_model:
16
+ - mistralai/Ministral-3-3B-Base-2512
17
+ extra_gated_description: >-
18
+ If you want to learn more about how we process your personal data, please read
19
+ our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
20
+ tags:
21
+ - mistral-common
22
+ pipeline_tag: text-to-speech
23
+ ---
24
+
25
+ # Voxtral 4B TTS 2603
26
+
27
+ Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.
28
+
29
+ For more details, see our:
30
+ - [🔊 Demo](https://console.mistral.ai/build/audio/text-to-speech)
31
+ - [✍️ Blog post](https://mistral.ai/news/voxtral-tts)
32
+ - [🔬 Research Paper](https://arxiv.org/abs/2603.25551)
33
+
34
+
35
+ ## Key Features
36
+
37
+ Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:
38
+
39
+ - **Realistic, expressive speech** with natural prosody and emotional range across 9 major languages, with support for diverse dialects
40
+ - **Text-to-Speech generation** with 20 preset voices and easy adaptation to new voices
41
+ - **Multilingual support**: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi
42
+ - **Very low latency** with fast time-to-first-audio, plus streaming and batch inference support
43
+ - **24 kHz audio output** in WAV, PCM, FLAC, MP3, AAC, and Opus formats
44
+ - **Production-ready performance** for high-throughput, real-time voice agent workflows
45
+
46
+ > [!Tip]
47
+ > For voice customization, visit our [AI Studio](https://console.mistral.ai/build/audio/text-to-speech).
48
+
49
+ ### Use Cases
50
+
51
+ - Customer support and call center infrastructure.
52
+ - Financial services. _-- with video demo on banking KYC voice agents._
53
+ - Manufacturing and industrial operations.
54
+ - Public services and government.
55
+ - Compliance and risk.
56
+ - Supply chain and logistics.
57
+ - Automotive and in-vehicle systems.
58
+ - Sales and marketing.
59
+ - Real-time translation.
60
+
61
+ > [!Warning]
62
+ > Responsible Use -
63
+ > You are responsible for complying with applicable laws and avoiding misuse.
64
+
65
+ ## Benchmark Results
66
+
67
+ - Measured using [vllm_omni/examples/offline_inference/voxtral_tts/end2end.py](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts).
68
+ - Input: 500-character text with a 10-second audio reference.
69
+ - Hardware: single NVIDIA H200.
70
+ - vllm version: v0.18.0.
71
+
72
+ *Note*: The RTF in `end2end.py` uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)
73
+
74
+ | Concurrency | Latency | RTF | Throughput (char/s/GPU) |
75
+ |:-----------:|:-------:|:-----:|:-----------------------:|
76
+ | 1 | 70 ms | 0.103 | 119.14 |
77
+ | 16 | 331 ms | 0.237 | 879.11 |
78
+ | 32 | 552 ms | 0.302 | 1430.78 |
79
+
80
+
81
+ ## Usage
82
+
83
+ The model can also be deployed with the following libraries:
84
+ - [`vllm-omni (recommended)`](https://github.com/vllm-project/vllm-omni): See [here](#vllm-omni-recommended)
85
+
86
+ ### vLLM Omni (recommended)
87
+
88
+ > [!Tip]
89
+ > We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni.
90
+ > Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.
91
+
92
+
93
+ **Installation**
94
+
95
+ Make sure to install [vllm](https://github.com/vllm-project/vllm) from the latest (>= 0.18.0) pypi package.
96
+ See [here](https://docs.vllm.ai/en/latest/getting_started/installation/) for a full installation guide.
97
+
98
+ ```
99
+ uv pip install -U vllm
100
+ ```
101
+
102
+ Next, you should install [`vllm-omni`](https://github.com/vllm-project/vllm-omni) with `vllm-omni >= 0.18.0`.
103
+
104
+ ```
105
+ uv pip install vllm-omni --upgrade # make sure to have >= 0.18.0
106
+ ```
107
+
108
+ Alternatively, you can also make use of a ready-to-go docker image on the [docker hub](https://hub.docker.com/layers/vllm/vllm-omni/v0.18.0/images/sha256-d855c9f3e06b1126e8a082229e5d2fef217e43c98d03569f8b9e50fa5c2d0a61).
109
+
110
+
111
+ Installing `vllm >= 0.18.0` should automatically install `mistral_common >= 1.10.0` which you can verify by running:
112
+
113
+ ```sh
114
+ python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0
115
+ ```
116
+
117
+ #### Serve
118
+
119
+ Due to size and the BF16 format of the weights - `Voxtral-4B-TTS-2603` can run on a single GPU with >= 16GB memory.
120
+
121
+ ```bash
122
+ vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
123
+ ```
124
+
125
+ #### Client
126
+
127
+ ```py
128
+ import io
129
+ import httpx
130
+ import soundfile as sf
131
+
132
+ BASE_URL = "http://<your-server-url>:8000/v1"
133
+
134
+ payload = {
135
+ "input": "Paris is a beautiful city!",
136
+ "model": "mistralai/Voxtral-4B-TTS-2603",
137
+ "response_format": "wav",
138
+ "voice": "casual_male",
139
+ }
140
+
141
+ response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
142
+ response.raise_for_status()
143
+
144
+ audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
145
+ print(f"Got audio: {len(audio_array)} samples at {sr} Hz")
146
+
147
+ # you can play the audio with a library like `sounddevice.play` for example
148
+ ```
149
+
150
+ #### Demo
151
+
152
+ To run it:
153
+
154
+ ```sh
155
+ git clone https://github.com/vllm-project/vllm-omni.git && \
156
+ cd vllm-omni && \
157
+ uv pip install gradio==5.50 && \
158
+ python examples/online_serving/voxtral_tts/gradio_demo.py \
159
+ --host <your-server-url> \
160
+ --port 8000
161
+ ```
162
+
163
+ Alternatively you can also try it out live here ➡️ [**HF Space**](https://huggingface.co/spaces/mistralai/voxtral-tts-demo).
164
+
165
+ ## License
166
+
167
+ The provided voice-references compatible with this model are licensed under [CC BY-NC 4](https://creativecommons.org/licenses/by-nc/4.0/), e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.
168
+
169
+ *You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*
consolidated.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66c4fd998db10e1a6d9cc5baa10e6264bf10701ec22ccdc0822c7dcc45dbe55b
3
+ size 8004752248
params.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dim": 3072,
3
+ "n_layers": 26,
4
+ "head_dim": 128,
5
+ "hidden_dim": 9216,
6
+ "n_heads": 32,
7
+ "n_kv_heads": 8,
8
+ "fp8_matmul": false,
9
+ "use_biases": false,
10
+ "causal": true,
11
+ "rope_theta": 1000000.0,
12
+ "norm_eps": 1e-05,
13
+ "init": "NO_INIT",
14
+ "dropout": 0.0,
15
+ "vocab_size": 131072,
16
+ "model_parallel": 1,
17
+ "is_sequence_parallel": false,
18
+ "context_parallel": 1,
19
+ "tied_embeddings": true,
20
+ "shard_on_vocab_dim": false,
21
+ "model_pipelining": 1,
22
+ "virtual_model_pipelining": 1,
23
+ "fused_rms_norm": true,
24
+ "checkpoint": false,
25
+ "use_cache": false,
26
+ "max_concurrent_tokens": 65536,
27
+ "learnable_sinks": false,
28
+ "rms_norm": "PRE",
29
+ "cust_bwd": false,
30
+ "recompute_w1_every": 0,
31
+ "recompute_w3_every": 0,
32
+ "recompute_attn_every": 0,
33
+ "freeze_nonembedding": false,
34
+ "fsdp2": true,
35
+ "dp_replicate_size": 1,
36
+ "zero2": true,
37
+ "fsdp_optimize_backward_concat_if_pp": true,
38
+ "attention_type": "FLASH_ATTN_3",
39
+ "multimodal": {
40
+ "bos_token_id": 1,
41
+ "audio_model_args": {
42
+ "semantic_codebook_size": 8192,
43
+ "acoustic_codebook_size": 21,
44
+ "n_acoustic_codebook": 36,
45
+ "audio_encoding_args": {
46
+ "codebook_pattern": "parallel",
47
+ "interleave_audio_tokens_per_segment": 8192,
48
+ "interleave_text_tokens_per_segment": 8192,
49
+ "single_trailing_segment": false,
50
+ "num_codebooks": 37,
51
+ "sampling_rate": 24000,
52
+ "frame_rate": 12.5
53
+ },
54
+ "audio_token_id": 24,
55
+ "begin_audio_token_id": 25,
56
+ "input_embedding_concat_type": "sum",
57
+ "acoustic_transformer_args": {
58
+ "input_dim": 3072,
59
+ "dim": 3072,
60
+ "n_layers": 3,
61
+ "head_dim": 128,
62
+ "hidden_dim": 9216,
63
+ "n_heads": 32,
64
+ "n_kv_heads": 8,
65
+ "use_biases": false,
66
+ "rope_theta": 10000.0,
67
+ "sigma": 1e-05,
68
+ "sigma_max": 1.0
69
+ },
70
+ "p_uncond": 0.0,
71
+ "text_feature_bugged": false,
72
+ "condition_dropped_token_id": 42
73
+ },
74
+ "audio_tokenizer_args": {
75
+ "channels": 1,
76
+ "sampling_rate": 24000,
77
+ "pretransform_patch_size": 240,
78
+ "patch_proj_kernel_size": 7,
79
+ "semantic_codebook_size": 8192,
80
+ "semantic_dim": 256,
81
+ "acoustic_codebook_size": 21,
82
+ "acoustic_dim": 36,
83
+ "conv_weight_norm": true,
84
+ "causal": true,
85
+ "attn_sliding_window_size": 16,
86
+ "half_attn_window_upon_downsampling": true,
87
+ "dim": 1024,
88
+ "hidden_dim": 4096,
89
+ "head_dim": 128,
90
+ "n_heads": 8,
91
+ "n_kv_heads": 8,
92
+ "qk_norm_eps": 1e-06,
93
+ "qk_norm": true,
94
+ "use_biases": false,
95
+ "norm_eps": 0.01,
96
+ "layer_scale": true,
97
+ "layer_scale_init": 0.01,
98
+ "decoder_transformer_lengths_str": "2,2,2,2",
99
+ "decoder_convs_kernels_str": "3,4,4,4",
100
+ "decoder_convs_strides_str": "1,2,2,2",
101
+ "voice": {
102
+ "casual_female": 0,
103
+ "casual_male": 1,
104
+ "cheerful_female": 2,
105
+ "neutral_female": 3,
106
+ "neutral_male": 4,
107
+ "pt_male": 5,
108
+ "pt_female": 6,
109
+ "nl_male": 7,
110
+ "nl_female": 8,
111
+ "it_male": 9,
112
+ "it_female": 10,
113
+ "fr_male": 11,
114
+ "fr_female": 12,
115
+ "es_male": 13,
116
+ "es_female": 14,
117
+ "de_male": 15,
118
+ "de_female": 16,
119
+ "ar_male": 17,
120
+ "hi_male": 18,
121
+ "hi_female": 19
122
+ }
123
+ }
124
+ },
125
+ "torch_compile_swiglu_noncust_bwd": false,
126
+ "override_parameters_str": "",
127
+ "max_seq_len": 65536,
128
+ "model_type": "voxtral_tts",
129
+ "max_position_embeddings": 128000
130
+ }
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:587989c9f56676b35e7d16d6fc61461301e402d908392a8ce16f0349f61b56d7
3
+ size 14894731
voice_embedding/ar_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f44603f6433cbb4b2abc7f496a382632171118557a175cb385df168a0dc20464
3
+ size 413253
voice_embedding/casual_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:780637984644064ee22e60b3152e0cd43fa64b2dcd39d9cab6cd2c62f2ce0342
3
+ size 1316421
voice_embedding/casual_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a056c9156ad0058e9d1368363bf3a25a9fcd8fe53e211ffac97de0bbffb3504
3
+ size 904773
voice_embedding/cheerful_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75fe69c8fcb5a0883a3d0bc1215b28f28cc0586aff5732eeebd2b254e8288253
3
+ size 812613
voice_embedding/de_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:282fc191fda496de2ebf2c809acb44056dde6fbe2f1cb99e85e67985bc6f6619
3
+ size 904773
voice_embedding/de_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd75d9fd3ffb9df0481668ce8781287a58f552e2388c5bbc0efdd4ebff0421bf
3
+ size 1003077
voice_embedding/es_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90e01ad34f231cc881987c3b1c0728853fd9b904e52c296a07c71a132949d8a6
3
+ size 849477
voice_embedding/es_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec116d8f4a102291bae3d9156d7c3222d9e1056020bf5894a7504bfc09640fdf
3
+ size 1279557
voice_embedding/fr_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82628d963670f919aa302f9c8a7336c745418a145934edb211810b07d9c8b852
3
+ size 597573
voice_embedding/fr_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73395073472be3fb586b487705ac4ebf35f99db664f56400137e8bfcfe4cd8a8
3
+ size 597573
voice_embedding/hi_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa7718cdd6f65735226bcc701379fdec64f36d0207ca79fc4c61b445ca7bde82
3
+ size 529989
voice_embedding/hi_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3cde36ab9a336f67fd33b46435cdf645cff9e10117f13bcbcb67b44b80a11b0
3
+ size 579141
voice_embedding/it_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29e1714bdb3ce0726e590ce1862fbe953c168ba51a05bc7daa8cb35cddc312b4
3
+ size 1058373
voice_embedding/it_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b98ba2253e2a0b872e20d33d29cab32263cc81062c01e3f5a8696de89e6f47b1
3
+ size 1033797
voice_embedding/neutral_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a03f4008614da7b1505a360a6b0d58d94dd72b0b0f49bf216e39de5eb733c61
3
+ size 1340997
voice_embedding/neutral_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:439df812990e6e4bcc6010ca12f12df90916e862bc1e1b56036d6433b892834e
3
+ size 1039941
voice_embedding/nl_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1bad34c22e0563f05c1f13c1db96680778c297aea6a5c0bb202950648b796b6
3
+ size 898629
voice_embedding/nl_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43fd2de89dc08503f37ae3107273eeb3f2a6195d705ff58d2228b3b5642ff7de
3
+ size 849477
voice_embedding/pt_female.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82f1006b2cd69118cba67085daa1795d9dab90b9bc70e1392e77f82cb616c9ce
3
+ size 1076805
voice_embedding/pt_male.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b30dca6c5d16c7b10a1c09c53e971c1bb1fab65692d7244876fbdc4ad52ba18
3
+ size 886341