F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Paper β’ 2410.06885 β’ Published β’ 47
A production-ready zero-shot voice cloning model based on the state-of-the-art F5-TTS architecture (Flow Matching + Diffusion Transformer).
| File | Description |
|---|---|
README.md |
This documentation |
config.json |
Model configuration and hyperparameters |
train_voice_clone.py |
Fine-tuning script β adapt to your own voice data |
inference_voice_clone.py |
Local inference script β zero-shot voice cloning CLI |
voice_clone_f5tts.ipynb |
π Jupyter Notebook β ready for Colab / Kaggle |
Try instantly at rajkr-voice-clone-f5tts-demo.hf.space
Open the notebook directly:
voice_clone_f5tts.ipynb to your Drive first)voice_clone_f5tts.ipynb from this repo β Upload to Kaggle β Enable GPU T4Or follow the quick steps below:
# 1. Enable GPU: Runtime β Change runtime type β GPU
# 2. Install
!pip install -q f5-tts soundfile
# 3. Download model (~1.3GB)
from huggingface_hub import snapshot_download
snapshot_download("SWivid/F5-TTS", local_dir="./f5tts_model", allow_patterns=["F5TTS_v1_Base/*"])
# 4. Clone a voice
from f5_tts.api import F5TTS
tts = F5TTS(ckpt_file="./f5tts_model/F5TTS_v1_Base/model_1250000.safetensors",
vocab_file="./f5tts_model/F5TTS_v1_Base/vocab.txt")
wav, sr, _ = tts.infer(
ref_file="/content/my_voice.wav", # Upload your audio first
ref_text="Exact transcript of your audio.",
gen_text="Say this in the cloned voice!",
nfe_step=32,
)
import soundfile as sf
sf.write("output.wav", wav, sr)
pip install f5-tts soundfile
python -c "
from f5_tts.api import F5TTS
import soundfile as sf
tts = F5TTS() # Auto-downloads model on first run
wav, sr, _ = tts.infer(
ref_file='my_voice.wav',
ref_text='Hello, this is my voice.',
gen_text='Hello from my local machine!',
)
sf.write('output.wav', wav, sr)
"
This repo provides a complete voice cloning pipeline using F5-TTS v1 Base (335M parameters), the current best open-source neural TTS model. Clone any voice from just 3-10 seconds of reference audio.
| Component | Details |
|---|---|
| Type | Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) |
| Params | 335M |
| Backbone | DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) |
| Vocoder | Vocos (24kHz, 100 mel channels) |
| Training | Trained on 95K hours of multilingual speech (Emilia EN+ZH) |
| Inference | Zero-shot voice cloning with 3-10s reference audio |
| RTF | ~0.15 (6.7x real-time capable) |
# 1. Prepare your data:
# my_voice/
# βββ metadata.csv # format: audio_path|text
# βββ wavs/
# βββ clip001.wav
# βββ clip002.wav
# 2. Run training
python train_voice_clone.py \
--hf_dataset mythicinfinity/libritts_r \
--hf_config clean \
--hf_split train.clean.100 \
--epochs 20 \
--lr 1e-5
pip install f5-tts
# Prepare dataset
python -m f5_tts.train.datasets.prepare_csv_wavs \
/path/to/my_voice \
/path/to/prepared_data/MyVoice_custom
# Fine-tune
python -m f5_tts.train.finetune_cli \
--exp_name F5TTS_v1_Base \
--dataset_name MyVoice \
--tokenizer custom \
--finetune \
--learning_rate 1e-5 \
--batch_size_per_gpu 38400 \
--batch_size_type frame \
--max_samples 64 \
--epochs 20 \
--num_warmup_updates 300 \
--grad_accumulation_steps 2 \
--logger tensorboard
| Metric | Value |
|---|---|
| WER (test-clean) | ~1.87% |
| Speaker Similarity | SIM-o ~0.66 |
| Real-Time Factor | 0.15 (6.7x faster than real-time) |
| Minimum Reference | 3 seconds |
| Languages | English + Chinese (pretrained), adaptable to others |
@article{shen2024f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}
CC-BY-NC-4.0 (non-commercial use)
Base model
SWivid/F5-TTS