Commit ·
69f1e2d
0
Parent(s):
Duplicate from mistralai/Voxtral-4B-TTS-2603
Browse filesCo-authored-by: Patrick von Platen <patrickvonplaten@users.noreply.huggingface.co>
- .gitattributes +37 -0
- README.md +169 -0
- consolidated.safetensors +3 -0
- params.json +130 -0
- tekken.json +3 -0
- voice_embedding/ar_male.pt +3 -0
- voice_embedding/casual_female.pt +3 -0
- voice_embedding/casual_male.pt +3 -0
- voice_embedding/cheerful_female.pt +3 -0
- voice_embedding/de_female.pt +3 -0
- voice_embedding/de_male.pt +3 -0
- voice_embedding/es_female.pt +3 -0
- voice_embedding/es_male.pt +3 -0
- voice_embedding/fr_female.pt +3 -0
- voice_embedding/fr_male.pt +3 -0
- voice_embedding/hi_female.pt +3 -0
- voice_embedding/hi_male.pt +3 -0
- voice_embedding/it_female.pt +3 -0
- voice_embedding/it_male.pt +3 -0
- voice_embedding/neutral_female.pt +3 -0
- voice_embedding/neutral_male.pt +3 -0
- voice_embedding/nl_female.pt +3 -0
- voice_embedding/nl_male.pt +3 -0
- voice_embedding/pt_female.pt +3 -0
- voice_embedding/pt_male.pt +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tekken.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
Voxtral_TTS.pdf filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: vllm
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- fr
|
| 6 |
+
- es
|
| 7 |
+
- pt
|
| 8 |
+
- it
|
| 9 |
+
- nl
|
| 10 |
+
- de
|
| 11 |
+
- ar
|
| 12 |
+
- hi
|
| 13 |
+
license: cc-by-nc-4.0
|
| 14 |
+
inference: false
|
| 15 |
+
base_model:
|
| 16 |
+
- mistralai/Ministral-3-3B-Base-2512
|
| 17 |
+
extra_gated_description: >-
|
| 18 |
+
If you want to learn more about how we process your personal data, please read
|
| 19 |
+
our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
|
| 20 |
+
tags:
|
| 21 |
+
- mistral-common
|
| 22 |
+
pipeline_tag: text-to-speech
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
# Voxtral 4B TTS 2603
|
| 26 |
+
|
| 27 |
+
Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.
|
| 28 |
+
|
| 29 |
+
For more details, see our:
|
| 30 |
+
- [🔊 Demo](https://console.mistral.ai/build/audio/text-to-speech)
|
| 31 |
+
- [✍️ Blog post](https://mistral.ai/news/voxtral-tts)
|
| 32 |
+
- [🔬 Research Paper](https://arxiv.org/abs/2603.25551)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
## Key Features
|
| 36 |
+
|
| 37 |
+
Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:
|
| 38 |
+
|
| 39 |
+
- **Realistic, expressive speech** with natural prosody and emotional range across 9 major languages, with support for diverse dialects
|
| 40 |
+
- **Text-to-Speech generation** with 20 preset voices and easy adaptation to new voices
|
| 41 |
+
- **Multilingual support**: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi
|
| 42 |
+
- **Very low latency** with fast time-to-first-audio, plus streaming and batch inference support
|
| 43 |
+
- **24 kHz audio output** in WAV, PCM, FLAC, MP3, AAC, and Opus formats
|
| 44 |
+
- **Production-ready performance** for high-throughput, real-time voice agent workflows
|
| 45 |
+
|
| 46 |
+
> [!Tip]
|
| 47 |
+
> For voice customization, visit our [AI Studio](https://console.mistral.ai/build/audio/text-to-speech).
|
| 48 |
+
|
| 49 |
+
### Use Cases
|
| 50 |
+
|
| 51 |
+
- Customer support and call center infrastructure.
|
| 52 |
+
- Financial services. _-- with video demo on banking KYC voice agents._
|
| 53 |
+
- Manufacturing and industrial operations.
|
| 54 |
+
- Public services and government.
|
| 55 |
+
- Compliance and risk.
|
| 56 |
+
- Supply chain and logistics.
|
| 57 |
+
- Automotive and in-vehicle systems.
|
| 58 |
+
- Sales and marketing.
|
| 59 |
+
- Real-time translation.
|
| 60 |
+
|
| 61 |
+
> [!Warning]
|
| 62 |
+
> Responsible Use -
|
| 63 |
+
> You are responsible for complying with applicable laws and avoiding misuse.
|
| 64 |
+
|
| 65 |
+
## Benchmark Results
|
| 66 |
+
|
| 67 |
+
- Measured using [vllm_omni/examples/offline_inference/voxtral_tts/end2end.py](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/voxtral_tts).
|
| 68 |
+
- Input: 500-character text with a 10-second audio reference.
|
| 69 |
+
- Hardware: single NVIDIA H200.
|
| 70 |
+
- vllm version: v0.18.0.
|
| 71 |
+
|
| 72 |
+
*Note*: The RTF in `end2end.py` uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)
|
| 73 |
+
|
| 74 |
+
| Concurrency | Latency | RTF | Throughput (char/s/GPU) |
|
| 75 |
+
|:-----------:|:-------:|:-----:|:-----------------------:|
|
| 76 |
+
| 1 | 70 ms | 0.103 | 119.14 |
|
| 77 |
+
| 16 | 331 ms | 0.237 | 879.11 |
|
| 78 |
+
| 32 | 552 ms | 0.302 | 1430.78 |
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
## Usage
|
| 82 |
+
|
| 83 |
+
The model can also be deployed with the following libraries:
|
| 84 |
+
- [`vllm-omni (recommended)`](https://github.com/vllm-project/vllm-omni): See [here](#vllm-omni-recommended)
|
| 85 |
+
|
| 86 |
+
### vLLM Omni (recommended)
|
| 87 |
+
|
| 88 |
+
> [!Tip]
|
| 89 |
+
> We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni.
|
| 90 |
+
> Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
**Installation**
|
| 94 |
+
|
| 95 |
+
Make sure to install [vllm](https://github.com/vllm-project/vllm) from the latest (>= 0.18.0) pypi package.
|
| 96 |
+
See [here](https://docs.vllm.ai/en/latest/getting_started/installation/) for a full installation guide.
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
+
uv pip install -U vllm
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
Next, you should install [`vllm-omni`](https://github.com/vllm-project/vllm-omni) with `vllm-omni >= 0.18.0`.
|
| 103 |
+
|
| 104 |
+
```
|
| 105 |
+
uv pip install vllm-omni --upgrade # make sure to have >= 0.18.0
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
Alternatively, you can also make use of a ready-to-go docker image on the [docker hub](https://hub.docker.com/layers/vllm/vllm-omni/v0.18.0/images/sha256-d855c9f3e06b1126e8a082229e5d2fef217e43c98d03569f8b9e50fa5c2d0a61).
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
Installing `vllm >= 0.18.0` should automatically install `mistral_common >= 1.10.0` which you can verify by running:
|
| 112 |
+
|
| 113 |
+
```sh
|
| 114 |
+
python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
#### Serve
|
| 118 |
+
|
| 119 |
+
Due to size and the BF16 format of the weights - `Voxtral-4B-TTS-2603` can run on a single GPU with >= 16GB memory.
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
#### Client
|
| 126 |
+
|
| 127 |
+
```py
|
| 128 |
+
import io
|
| 129 |
+
import httpx
|
| 130 |
+
import soundfile as sf
|
| 131 |
+
|
| 132 |
+
BASE_URL = "http://<your-server-url>:8000/v1"
|
| 133 |
+
|
| 134 |
+
payload = {
|
| 135 |
+
"input": "Paris is a beautiful city!",
|
| 136 |
+
"model": "mistralai/Voxtral-4B-TTS-2603",
|
| 137 |
+
"response_format": "wav",
|
| 138 |
+
"voice": "casual_male",
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
|
| 142 |
+
response.raise_for_status()
|
| 143 |
+
|
| 144 |
+
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
|
| 145 |
+
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")
|
| 146 |
+
|
| 147 |
+
# you can play the audio with a library like `sounddevice.play` for example
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
#### Demo
|
| 151 |
+
|
| 152 |
+
To run it:
|
| 153 |
+
|
| 154 |
+
```sh
|
| 155 |
+
git clone https://github.com/vllm-project/vllm-omni.git && \
|
| 156 |
+
cd vllm-omni && \
|
| 157 |
+
uv pip install gradio==5.50 && \
|
| 158 |
+
python examples/online_serving/voxtral_tts/gradio_demo.py \
|
| 159 |
+
--host <your-server-url> \
|
| 160 |
+
--port 8000
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
Alternatively you can also try it out live here ➡️ [**HF Space**](https://huggingface.co/spaces/mistralai/voxtral-tts-demo).
|
| 164 |
+
|
| 165 |
+
## License
|
| 166 |
+
|
| 167 |
+
The provided voice-references compatible with this model are licensed under [CC BY-NC 4](https://creativecommons.org/licenses/by-nc/4.0/), e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.
|
| 168 |
+
|
| 169 |
+
*You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*
|
consolidated.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:66c4fd998db10e1a6d9cc5baa10e6264bf10701ec22ccdc0822c7dcc45dbe55b
|
| 3 |
+
size 8004752248
|
params.json
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dim": 3072,
|
| 3 |
+
"n_layers": 26,
|
| 4 |
+
"head_dim": 128,
|
| 5 |
+
"hidden_dim": 9216,
|
| 6 |
+
"n_heads": 32,
|
| 7 |
+
"n_kv_heads": 8,
|
| 8 |
+
"fp8_matmul": false,
|
| 9 |
+
"use_biases": false,
|
| 10 |
+
"causal": true,
|
| 11 |
+
"rope_theta": 1000000.0,
|
| 12 |
+
"norm_eps": 1e-05,
|
| 13 |
+
"init": "NO_INIT",
|
| 14 |
+
"dropout": 0.0,
|
| 15 |
+
"vocab_size": 131072,
|
| 16 |
+
"model_parallel": 1,
|
| 17 |
+
"is_sequence_parallel": false,
|
| 18 |
+
"context_parallel": 1,
|
| 19 |
+
"tied_embeddings": true,
|
| 20 |
+
"shard_on_vocab_dim": false,
|
| 21 |
+
"model_pipelining": 1,
|
| 22 |
+
"virtual_model_pipelining": 1,
|
| 23 |
+
"fused_rms_norm": true,
|
| 24 |
+
"checkpoint": false,
|
| 25 |
+
"use_cache": false,
|
| 26 |
+
"max_concurrent_tokens": 65536,
|
| 27 |
+
"learnable_sinks": false,
|
| 28 |
+
"rms_norm": "PRE",
|
| 29 |
+
"cust_bwd": false,
|
| 30 |
+
"recompute_w1_every": 0,
|
| 31 |
+
"recompute_w3_every": 0,
|
| 32 |
+
"recompute_attn_every": 0,
|
| 33 |
+
"freeze_nonembedding": false,
|
| 34 |
+
"fsdp2": true,
|
| 35 |
+
"dp_replicate_size": 1,
|
| 36 |
+
"zero2": true,
|
| 37 |
+
"fsdp_optimize_backward_concat_if_pp": true,
|
| 38 |
+
"attention_type": "FLASH_ATTN_3",
|
| 39 |
+
"multimodal": {
|
| 40 |
+
"bos_token_id": 1,
|
| 41 |
+
"audio_model_args": {
|
| 42 |
+
"semantic_codebook_size": 8192,
|
| 43 |
+
"acoustic_codebook_size": 21,
|
| 44 |
+
"n_acoustic_codebook": 36,
|
| 45 |
+
"audio_encoding_args": {
|
| 46 |
+
"codebook_pattern": "parallel",
|
| 47 |
+
"interleave_audio_tokens_per_segment": 8192,
|
| 48 |
+
"interleave_text_tokens_per_segment": 8192,
|
| 49 |
+
"single_trailing_segment": false,
|
| 50 |
+
"num_codebooks": 37,
|
| 51 |
+
"sampling_rate": 24000,
|
| 52 |
+
"frame_rate": 12.5
|
| 53 |
+
},
|
| 54 |
+
"audio_token_id": 24,
|
| 55 |
+
"begin_audio_token_id": 25,
|
| 56 |
+
"input_embedding_concat_type": "sum",
|
| 57 |
+
"acoustic_transformer_args": {
|
| 58 |
+
"input_dim": 3072,
|
| 59 |
+
"dim": 3072,
|
| 60 |
+
"n_layers": 3,
|
| 61 |
+
"head_dim": 128,
|
| 62 |
+
"hidden_dim": 9216,
|
| 63 |
+
"n_heads": 32,
|
| 64 |
+
"n_kv_heads": 8,
|
| 65 |
+
"use_biases": false,
|
| 66 |
+
"rope_theta": 10000.0,
|
| 67 |
+
"sigma": 1e-05,
|
| 68 |
+
"sigma_max": 1.0
|
| 69 |
+
},
|
| 70 |
+
"p_uncond": 0.0,
|
| 71 |
+
"text_feature_bugged": false,
|
| 72 |
+
"condition_dropped_token_id": 42
|
| 73 |
+
},
|
| 74 |
+
"audio_tokenizer_args": {
|
| 75 |
+
"channels": 1,
|
| 76 |
+
"sampling_rate": 24000,
|
| 77 |
+
"pretransform_patch_size": 240,
|
| 78 |
+
"patch_proj_kernel_size": 7,
|
| 79 |
+
"semantic_codebook_size": 8192,
|
| 80 |
+
"semantic_dim": 256,
|
| 81 |
+
"acoustic_codebook_size": 21,
|
| 82 |
+
"acoustic_dim": 36,
|
| 83 |
+
"conv_weight_norm": true,
|
| 84 |
+
"causal": true,
|
| 85 |
+
"attn_sliding_window_size": 16,
|
| 86 |
+
"half_attn_window_upon_downsampling": true,
|
| 87 |
+
"dim": 1024,
|
| 88 |
+
"hidden_dim": 4096,
|
| 89 |
+
"head_dim": 128,
|
| 90 |
+
"n_heads": 8,
|
| 91 |
+
"n_kv_heads": 8,
|
| 92 |
+
"qk_norm_eps": 1e-06,
|
| 93 |
+
"qk_norm": true,
|
| 94 |
+
"use_biases": false,
|
| 95 |
+
"norm_eps": 0.01,
|
| 96 |
+
"layer_scale": true,
|
| 97 |
+
"layer_scale_init": 0.01,
|
| 98 |
+
"decoder_transformer_lengths_str": "2,2,2,2",
|
| 99 |
+
"decoder_convs_kernels_str": "3,4,4,4",
|
| 100 |
+
"decoder_convs_strides_str": "1,2,2,2",
|
| 101 |
+
"voice": {
|
| 102 |
+
"casual_female": 0,
|
| 103 |
+
"casual_male": 1,
|
| 104 |
+
"cheerful_female": 2,
|
| 105 |
+
"neutral_female": 3,
|
| 106 |
+
"neutral_male": 4,
|
| 107 |
+
"pt_male": 5,
|
| 108 |
+
"pt_female": 6,
|
| 109 |
+
"nl_male": 7,
|
| 110 |
+
"nl_female": 8,
|
| 111 |
+
"it_male": 9,
|
| 112 |
+
"it_female": 10,
|
| 113 |
+
"fr_male": 11,
|
| 114 |
+
"fr_female": 12,
|
| 115 |
+
"es_male": 13,
|
| 116 |
+
"es_female": 14,
|
| 117 |
+
"de_male": 15,
|
| 118 |
+
"de_female": 16,
|
| 119 |
+
"ar_male": 17,
|
| 120 |
+
"hi_male": 18,
|
| 121 |
+
"hi_female": 19
|
| 122 |
+
}
|
| 123 |
+
}
|
| 124 |
+
},
|
| 125 |
+
"torch_compile_swiglu_noncust_bwd": false,
|
| 126 |
+
"override_parameters_str": "",
|
| 127 |
+
"max_seq_len": 65536,
|
| 128 |
+
"model_type": "voxtral_tts",
|
| 129 |
+
"max_position_embeddings": 128000
|
| 130 |
+
}
|
tekken.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:587989c9f56676b35e7d16d6fc61461301e402d908392a8ce16f0349f61b56d7
|
| 3 |
+
size 14894731
|
voice_embedding/ar_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f44603f6433cbb4b2abc7f496a382632171118557a175cb385df168a0dc20464
|
| 3 |
+
size 413253
|
voice_embedding/casual_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:780637984644064ee22e60b3152e0cd43fa64b2dcd39d9cab6cd2c62f2ce0342
|
| 3 |
+
size 1316421
|
voice_embedding/casual_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a056c9156ad0058e9d1368363bf3a25a9fcd8fe53e211ffac97de0bbffb3504
|
| 3 |
+
size 904773
|
voice_embedding/cheerful_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:75fe69c8fcb5a0883a3d0bc1215b28f28cc0586aff5732eeebd2b254e8288253
|
| 3 |
+
size 812613
|
voice_embedding/de_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:282fc191fda496de2ebf2c809acb44056dde6fbe2f1cb99e85e67985bc6f6619
|
| 3 |
+
size 904773
|
voice_embedding/de_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bd75d9fd3ffb9df0481668ce8781287a58f552e2388c5bbc0efdd4ebff0421bf
|
| 3 |
+
size 1003077
|
voice_embedding/es_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:90e01ad34f231cc881987c3b1c0728853fd9b904e52c296a07c71a132949d8a6
|
| 3 |
+
size 849477
|
voice_embedding/es_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ec116d8f4a102291bae3d9156d7c3222d9e1056020bf5894a7504bfc09640fdf
|
| 3 |
+
size 1279557
|
voice_embedding/fr_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:82628d963670f919aa302f9c8a7336c745418a145934edb211810b07d9c8b852
|
| 3 |
+
size 597573
|
voice_embedding/fr_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:73395073472be3fb586b487705ac4ebf35f99db664f56400137e8bfcfe4cd8a8
|
| 3 |
+
size 597573
|
voice_embedding/hi_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aa7718cdd6f65735226bcc701379fdec64f36d0207ca79fc4c61b445ca7bde82
|
| 3 |
+
size 529989
|
voice_embedding/hi_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c3cde36ab9a336f67fd33b46435cdf645cff9e10117f13bcbcb67b44b80a11b0
|
| 3 |
+
size 579141
|
voice_embedding/it_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:29e1714bdb3ce0726e590ce1862fbe953c168ba51a05bc7daa8cb35cddc312b4
|
| 3 |
+
size 1058373
|
voice_embedding/it_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b98ba2253e2a0b872e20d33d29cab32263cc81062c01e3f5a8696de89e6f47b1
|
| 3 |
+
size 1033797
|
voice_embedding/neutral_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2a03f4008614da7b1505a360a6b0d58d94dd72b0b0f49bf216e39de5eb733c61
|
| 3 |
+
size 1340997
|
voice_embedding/neutral_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:439df812990e6e4bcc6010ca12f12df90916e862bc1e1b56036d6433b892834e
|
| 3 |
+
size 1039941
|
voice_embedding/nl_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b1bad34c22e0563f05c1f13c1db96680778c297aea6a5c0bb202950648b796b6
|
| 3 |
+
size 898629
|
voice_embedding/nl_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:43fd2de89dc08503f37ae3107273eeb3f2a6195d705ff58d2228b3b5642ff7de
|
| 3 |
+
size 849477
|
voice_embedding/pt_female.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:82f1006b2cd69118cba67085daa1795d9dab90b9bc70e1392e77f82cb616c9ce
|
| 3 |
+
size 1076805
|
voice_embedding/pt_male.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7b30dca6c5d16c7b10a1c09c53e971c1bb1fab65692d7244876fbdc4ad52ba18
|
| 3 |
+
size 886341
|