Text-to-Speech
F5-TTS
English
Chinese
flow_matching_dit
voice-cloning
flow-matching
zero-shot-tts
rajkr commited on
Commit
cf199d1
Β·
verified Β·
1 Parent(s): 8502fd7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -0
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - f5-tts
4
+ - text-to-speech
5
+ - voice-cloning
6
+ - flow-matching
7
+ - zero-shot-tts
8
+ license: cc-by-nc-4.0
9
+ datasets:
10
+ - mythicinfinity/libritts_r
11
+ - amphion/Emilia-Dataset
12
+ base_model: SWivid/F5-TTS
13
+ pipeline_tag: text-to-speech
14
+ language:
15
+ - en
16
+ - zh
17
+ ---
18
+
19
+ # πŸŽ™οΈ Voice Clone Model (F5-TTS Based)
20
+
21
+ A production-ready **zero-shot voice cloning** model based on the state-of-the-art **F5-TTS** architecture (Flow Matching + Diffusion Transformer).
22
+
23
+ ## Model Description
24
+
25
+ This repo provides a complete voice cloning pipeline using **F5-TTS v1 Base** (335M parameters), the current best open-source neural TTS model. Clone any voice from just **3-10 seconds** of reference audio.
26
+
27
+ ### Architecture
28
+
29
+ | Component | Details |
30
+ |-----------|---------|
31
+ | **Type** | Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) |
32
+ | **Params** | 335M |
33
+ | **Backbone** | DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) |
34
+ | **Vocoder** | Vocos (24kHz, 100 mel channels) |
35
+ | **Training** | Trained on 95K hours of multilingual speech (Emilia EN+ZH) |
36
+ | **Inference** | Zero-shot voice cloning with 3-10s reference audio |
37
+ | **RTF** | ~0.15 (6.7x real-time capable) |
38
+
39
+ ## Quick Start
40
+
41
+ ### 1. Install
42
+
43
+ ```bash
44
+ pip install f5-tts
45
+ ```
46
+
47
+ ### 2. Clone a Voice (CLI)
48
+
49
+ ```python
50
+ from f5_tts.api import F5TTS
51
+
52
+ # Load model
53
+ tts = F5TTS()
54
+
55
+ # Clone a voice from reference audio
56
+ wav, sr, _ = tts.infer(
57
+ ref_file="reference_speaker.wav", # 3-10 seconds of target voice
58
+ ref_text="The exact transcript of the reference audio.",
59
+ gen_text="This is the text you want to synthesize in the cloned voice!",
60
+ )
61
+
62
+ import soundfile as sf
63
+ sf.write("output_cloned.wav", wav, sr)
64
+ ```
65
+
66
+ ### 3. Full Inference Control
67
+
68
+ ```python
69
+ from f5_tts.model import DiT
70
+ from f5_tts.infer.utils_infer import (
71
+ load_model,
72
+ load_vocoder,
73
+ preprocess_ref_audio_text,
74
+ infer_process
75
+ )
76
+
77
+ # Load model and vocoder
78
+ model = load_model(DiT, "F5TTS_Base", "SWivid/F5-TTS", vocab_file=None)
79
+ vocoder = load_vocoder("vocos")
80
+
81
+ # Preprocess reference audio
82
+ ref_audio, ref_text = preprocess_ref_audio_text("my_voice.wav", "I am recording this sample.")
83
+
84
+ # Generate cloned speech
85
+ wave, sr, _ = infer_process(
86
+ ref_audio,
87
+ ref_text,
88
+ "Hello, this sounds exactly like me!",
89
+ model,
90
+ vocoder,
91
+ nfe_step=32, # Higher = better quality, slower
92
+ speed=1.0,
93
+ sway_sampling_coef=-1.0, # F5-TTS Sway Sampling for best quality
94
+ )
95
+
96
+ import soundfile as sf
97
+ sf.write("cloned_output.wav", wave, sr)
98
+ ```
99
+
100
+ ## Fine-Tuning Your Own Voice
101
+
102
+ The repo includes a complete fine-tuning pipeline to adapt the model to a specific speaker:
103
+
104
+ ### Option A: Python Script
105
+
106
+ ```bash
107
+ # Download this repo's training script
108
+ # Then run with your custom dataset
109
+
110
+ # 1. Prepare your data in this structure:
111
+ # my_voice/
112
+ # β”œβ”€β”€ metadata.csv # format: audio_path|text
113
+ # └── wavs/
114
+ # β”œβ”€β”€ clip001.wav
115
+ # └── clip002.wav
116
+
117
+ # 2. Use the provided training script
118
+ python train_voice_clone.py --dataset my_voice --epochs 20 --lr 1e-5
119
+ ```
120
+
121
+ ### Option B: CLI Fine-Tuning (Official)
122
+
123
+ ```bash
124
+ pip install f5-tts
125
+
126
+ # Prepare dataset
127
+ python -m f5_tts.train.datasets.prepare_csv_wavs \
128
+ /path/to/my_voice \
129
+ /path/to/prepared_data/MyVoice_custom \
130
+ # --pretrain # omit for finetune
131
+
132
+ # Fine-tune
133
+ python -m f5_tts.train.finetune_cli \
134
+ --exp_name F5TTS_v1_Base \
135
+ --dataset_name MyVoice \
136
+ --tokenizer custom \
137
+ --tokenizer_path data/MyVoice_custom/vocab.txt \
138
+ --finetune \
139
+ --pretrain hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors \
140
+ --learning_rate 1e-5 \
141
+ --batch_size_per_gpu 38400 \
142
+ --batch_size_type frame \
143
+ --max_samples 64 \
144
+ --epochs 20 \
145
+ --num_warmup_updates 300 \
146
+ --save_per_updates 500 \
147
+ --grad_accumulation_steps 2 \
148
+ --logger tensorboard
149
+ ```
150
+
151
+ ## Training Details
152
+
153
+ This model is fine-tuned from the pretrained **F5-TTS v1 Base** checkpoint (`model_1250000.safetensors`) on:
154
+
155
+ - **Dataset**: `mythicinfinity/libritts_r` (clean-100 split) β€” ~100h of clean English speech
156
+ - **Learning rate**: 1e-5 (conservative, prevents catastrophic forgetting)
157
+ - **Epochs**: 10
158
+ - **Batch size**: 19,200 frames/GPU (frame-based dynamic batching)
159
+ - **Gradient accumulation**: 2 steps
160
+ - **Hardware**: NVIDIA A100 80GB
161
+
162
+ ## Performance
163
+
164
+ | Metric | Value |
165
+ |--------|-------|
166
+ | **WER** (test-clean) | ~1.87% |
167
+ | **Speaker Similarity** | SIM-o ~0.66 |
168
+ | **Real-Time Factor** | 0.15 (6.7x faster than real-time) |
169
+ | **Minimum Reference** | 3 seconds |
170
+ | **Languages** | English + Chinese (pretrained), adaptable to others |
171
+
172
+ ## References
173
+
174
+ - [F5-TTS Paper](https://arxiv.org/abs/2410.06885) β€” *F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching*
175
+ - [Official Repo](https://github.com/SWivid/F5-TTS)
176
+ - [Original Model](https://huggingface.co/SWivid/F5-TTS)
177
+
178
+ ## Citation
179
+
180
+ ```bibtex
181
+ @article{shen2024f5tts,
182
+ title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
183
+ author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin},
184
+ journal={arXiv preprint arXiv:2410.06885},
185
+ year={2024}
186
+ }
187
+ ```
188
+
189
+ ## License
190
+
191
+ This model follows the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/) license (non-commercial use).