--- tags: - f5-tts - text-to-speech - voice-cloning - flow-matching - zero-shot-tts license: cc-by-nc-4.0 datasets: - mythicinfinity/libritts_r - amphion/Emilia-Dataset base_model: SWivid/F5-TTS pipeline_tag: text-to-speech language: - en - zh --- # 🎙️ Voice Clone Model (F5-TTS Based) A production-ready **zero-shot voice cloning** model based on the state-of-the-art **F5-TTS** architecture (Flow Matching + Diffusion Transformer). ## 📦 Files in This Repo | File | Description | |------|-------------| | `README.md` | This documentation | | `config.json` | Model configuration and hyperparameters | | `train_voice_clone.py` | **Fine-tuning script** — adapt to your own voice data | | `inference_voice_clone.py` | **Local inference script** — zero-shot voice cloning CLI | | `voice_clone_f5tts.ipynb` | **📓 Jupyter Notebook** — ready for Colab / Kaggle | ## 🚀 Quick Start Options ### Option 1: Hugging Face Space (No setup) Try instantly at [rajkr-voice-clone-f5tts-demo.hf.space](https://rajkr-voice-clone-f5tts-demo.hf.space) ### Option 2: Google Colab / Kaggle (Free GPU) Open the notebook directly: - **Colab**: [Open in Colab](https://colab.research.google.com/github/rajkr/voice-clone-f5tts/blob/main/voice_clone_f5tts.ipynb) *(upload `voice_clone_f5tts.ipynb` to your Drive first)* - **Kaggle**: Download `voice_clone_f5tts.ipynb` from this repo → Upload to Kaggle → Enable GPU T4 Or follow the quick steps below: ```python # 1. Enable GPU: Runtime → Change runtime type → GPU # 2. Install !pip install -q f5-tts soundfile # 3. Download model (~1.3GB) from huggingface_hub import snapshot_download snapshot_download("SWivid/F5-TTS", local_dir="./f5tts_model", allow_patterns=["F5TTS_v1_Base/*"]) # 4. Clone a voice from f5_tts.api import F5TTS tts = F5TTS(ckpt_file="./f5tts_model/F5TTS_v1_Base/model_1250000.safetensors", vocab_file="./f5tts_model/F5TTS_v1_Base/vocab.txt") wav, sr, _ = tts.infer( ref_file="/content/my_voice.wav", # Upload your audio first ref_text="Exact transcript of your audio.", gen_text="Say this in the cloned voice!", nfe_step=32, ) import soundfile as sf sf.write("output.wav", wav, sr) ``` ### Option 3: Local Machine (GPU recommended) ```bash pip install f5-tts soundfile python -c " from f5_tts.api import F5TTS import soundfile as sf tts = F5TTS() # Auto-downloads model on first run wav, sr, _ = tts.infer( ref_file='my_voice.wav', ref_text='Hello, this is my voice.', gen_text='Hello from my local machine!', ) sf.write('output.wav', wav, sr) " ``` ## Model Description This repo provides a complete voice cloning pipeline using **F5-TTS v1 Base** (335M parameters), the current best open-source neural TTS model. Clone any voice from just **3-10 seconds** of reference audio. ### Architecture | Component | Details | |-----------|---------| | **Type** | Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) | | **Params** | 335M | | **Backbone** | DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) | | **Vocoder** | Vocos (24kHz, 100 mel channels) | | **Training** | Trained on 95K hours of multilingual speech (Emilia EN+ZH) | | **Inference** | Zero-shot voice cloning with 3-10s reference audio | | **RTF** | ~0.15 (6.7x real-time capable) | ## Fine-Tuning Your Own Voice ### Option A: Python Script ```bash # 1. Prepare your data: # my_voice/ # ├── metadata.csv # format: audio_path|text # └── wavs/ # ├── clip001.wav # └── clip002.wav # 2. Run training python train_voice_clone.py \ --hf_dataset mythicinfinity/libritts_r \ --hf_config clean \ --hf_split train.clean.100 \ --epochs 20 \ --lr 1e-5 ``` ### Option B: CLI Fine-Tuning (Official F5-TTS) ```bash pip install f5-tts # Prepare dataset python -m f5_tts.train.datasets.prepare_csv_wavs \ /path/to/my_voice \ /path/to/prepared_data/MyVoice_custom # Fine-tune python -m f5_tts.train.finetune_cli \ --exp_name F5TTS_v1_Base \ --dataset_name MyVoice \ --tokenizer custom \ --finetune \ --learning_rate 1e-5 \ --batch_size_per_gpu 38400 \ --batch_size_type frame \ --max_samples 64 \ --epochs 20 \ --num_warmup_updates 300 \ --grad_accumulation_steps 2 \ --logger tensorboard ``` ## Performance | Metric | Value | |--------|-------| | **WER** (test-clean) | ~1.87% | | **Speaker Similarity** | SIM-o ~0.66 | | **Real-Time Factor** | 0.15 (6.7x faster than real-time) | | **Minimum Reference** | 3 seconds | | **Languages** | English + Chinese (pretrained), adaptable to others | ## References - [F5-TTS Paper](https://arxiv.org/abs/2410.06885) — *F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching* - [Official Repo](https://github.com/SWivid/F5-TTS) - [Original Model](https://huggingface.co/SWivid/F5-TTS) ## Citation ```bibtex @article{shen2024f5tts, title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin}, journal={arXiv preprint arXiv:2410.06885}, year={2024} } ``` ## License CC-BY-NC-4.0 (non-commercial use)