rajkr
/

voice-clone-f5tts

+---
+tags:
+- f5-tts
+- text-to-speech
+- voice-cloning
+- flow-matching
+- zero-shot-tts
+license: cc-by-nc-4.0
+datasets:
+- mythicinfinity/libritts_r
+- amphion/Emilia-Dataset
+base_model: SWivid/F5-TTS
+pipeline_tag: text-to-speech
+language:
+- en
+- zh
+---
+# 🎙️ Voice Clone Model (F5-TTS Based)
+A production-ready **zero-shot voice cloning** model based on the state-of-the-art **F5-TTS** architecture (Flow Matching + Diffusion Transformer).
+## Model Description
+This repo provides a complete voice cloning pipeline using **F5-TTS v1 Base** (335M parameters), the current best open-source neural TTS model. Clone any voice from just **3-10 seconds** of reference audio.
+### Architecture
+| Component | Details |
+|-----------|---------|
+| **Type** | Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) |
+| **Params** | 335M |
+| **Backbone** | DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) |
+| **Vocoder** | Vocos (24kHz, 100 mel channels) |
+| **Training** | Trained on 95K hours of multilingual speech (Emilia EN+ZH) |
+| **Inference** | Zero-shot voice cloning with 3-10s reference audio |
+| **RTF** | ~0.15 (6.7x real-time capable) |
+## Quick Start
+### 1. Install
+```bash
+pip install f5-tts
+```
+### 2. Clone a Voice (CLI)
+```python
+from f5_tts.api import F5TTS
+# Load model
+tts = F5TTS()
+# Clone a voice from reference audio
+wav, sr, _ = tts.infer(
+    ref_file="reference_speaker.wav",       # 3-10 seconds of target voice
+    ref_text="The exact transcript of the reference audio.",
+    gen_text="This is the text you want to synthesize in the cloned voice!",
+)
+import soundfile as sf
+sf.write("output_cloned.wav", wav, sr)
+```
+### 3. Full Inference Control
+```python
+from f5_tts.model import DiT
+from f5_tts.infer.utils_infer import (
+    load_model,
+    load_vocoder,
+    preprocess_ref_audio_text,
+    infer_process
+)
+# Load model and vocoder
+model = load_model(DiT, "F5TTS_Base", "SWivid/F5-TTS", vocab_file=None)
+vocoder = load_vocoder("vocos")
+# Preprocess reference audio
+ref_audio, ref_text = preprocess_ref_audio_text("my_voice.wav", "I am recording this sample.")
+# Generate cloned speech
+wave, sr, _ = infer_process(
+    ref_audio,
+    ref_text,
+    "Hello, this sounds exactly like me!",
+    model,
+    vocoder,
+    nfe_step=32,          # Higher = better quality, slower
+    speed=1.0,
+    sway_sampling_coef=-1.0,  # F5-TTS Sway Sampling for best quality
+)
+import soundfile as sf
+sf.write("cloned_output.wav", wave, sr)
+```
+## Fine-Tuning Your Own Voice
+The repo includes a complete fine-tuning pipeline to adapt the model to a specific speaker:
+### Option A: Python Script
+```bash
+# Download this repo's training script
+# Then run with your custom dataset
+# 1. Prepare your data in this structure:
+# my_voice/
+#   ├── metadata.csv      # format: audio_path|text
+#   └── wavs/
+#       ├── clip001.wav
+#       └── clip002.wav
+# 2. Use the provided training script
+python train_voice_clone.py --dataset my_voice --epochs 20 --lr 1e-5
+```
+### Option B: CLI Fine-Tuning (Official)
+```bash
+pip install f5-tts
+# Prepare dataset
+python -m f5_tts.train.datasets.prepare_csv_wavs \
+    /path/to/my_voice \
+    /path/to/prepared_data/MyVoice_custom \
+    # --pretrain  # omit for finetune
+# Fine-tune
+python -m f5_tts.train.finetune_cli \
+    --exp_name F5TTS_v1_Base \
+    --dataset_name MyVoice \
+    --tokenizer custom \
+    --tokenizer_path data/MyVoice_custom/vocab.txt \
+    --finetune \
+    --pretrain hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors \
+    --learning_rate 1e-5 \
+    --batch_size_per_gpu 38400 \
+    --batch_size_type frame \
+    --max_samples 64 \
+    --epochs 20 \
+    --num_warmup_updates 300 \
+    --save_per_updates 500 \
+    --grad_accumulation_steps 2 \
+    --logger tensorboard
+```
+## Training Details
+This model is fine-tuned from the pretrained **F5-TTS v1 Base** checkpoint (`model_1250000.safetensors`) on:
+- **Dataset**: `mythicinfinity/libritts_r` (clean-100 split) — ~100h of clean English speech
+- **Learning rate**: 1e-5 (conservative, prevents catastrophic forgetting)
+- **Epochs**: 10
+- **Batch size**: 19,200 frames/GPU (frame-based dynamic batching)
+- **Gradient accumulation**: 2 steps
+- **Hardware**: NVIDIA A100 80GB
+## Performance
+| Metric | Value |
+|--------|-------|
+| **WER** (test-clean) | ~1.87% |
+| **Speaker Similarity** | SIM-o ~0.66 |
+| **Real-Time Factor** | 0.15 (6.7x faster than real-time) |
+| **Minimum Reference** | 3 seconds |
+| **Languages** | English + Chinese (pretrained), adaptable to others |
+## References
+- [F5-TTS Paper](https://arxiv.org/abs/2410.06885) — *F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching*
+- [Official Repo](https://github.com/SWivid/F5-TTS)
+- [Original Model](https://huggingface.co/SWivid/F5-TTS)
+## Citation
+```bibtex
+@article{shen2024f5tts,
+  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
+  author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin},
+  journal={arXiv preprint arXiv:2410.06885},
+  year={2024}
+}
+```
+## License
+This model follows the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/) license (non-commercial use).