| --- |
| tags: |
| - f5-tts |
| - text-to-speech |
| - voice-cloning |
| - flow-matching |
| - zero-shot-tts |
| license: cc-by-nc-4.0 |
| datasets: |
| - mythicinfinity/libritts_r |
| - amphion/Emilia-Dataset |
| base_model: SWivid/F5-TTS |
| pipeline_tag: text-to-speech |
| language: |
| - en |
| - zh |
| --- |
| |
| # ποΈ Voice Clone Model (F5-TTS Based) |
|
|
| A production-ready **zero-shot voice cloning** model based on the state-of-the-art **F5-TTS** architecture (Flow Matching + Diffusion Transformer). |
|
|
| ## π¦ Files in This Repo |
|
|
| | File | Description | |
| |------|-------------| |
| | `README.md` | This documentation | |
| | `config.json` | Model configuration and hyperparameters | |
| | `train_voice_clone.py` | **Fine-tuning script** β adapt to your own voice data | |
| | `inference_voice_clone.py` | **Local inference script** β zero-shot voice cloning CLI | |
| | `voice_clone_f5tts.ipynb` | **π Jupyter Notebook** β ready for Colab / Kaggle | |
|
|
| ## π Quick Start Options |
|
|
| ### Option 1: Hugging Face Space (No setup) |
| Try instantly at [rajkr-voice-clone-f5tts-demo.hf.space](https://rajkr-voice-clone-f5tts-demo.hf.space) |
|
|
| ### Option 2: Google Colab / Kaggle (Free GPU) |
| Open the notebook directly: |
| - **Colab**: [Open in Colab](https://colab.research.google.com/github/rajkr/voice-clone-f5tts/blob/main/voice_clone_f5tts.ipynb) *(upload `voice_clone_f5tts.ipynb` to your Drive first)* |
| - **Kaggle**: Download `voice_clone_f5tts.ipynb` from this repo β Upload to Kaggle β Enable GPU T4 |
|
|
| Or follow the quick steps below: |
|
|
| ```python |
| # 1. Enable GPU: Runtime β Change runtime type β GPU |
| # 2. Install |
| !pip install -q f5-tts soundfile |
| |
| # 3. Download model (~1.3GB) |
| from huggingface_hub import snapshot_download |
| snapshot_download("SWivid/F5-TTS", local_dir="./f5tts_model", allow_patterns=["F5TTS_v1_Base/*"]) |
| |
| # 4. Clone a voice |
| from f5_tts.api import F5TTS |
| tts = F5TTS(ckpt_file="./f5tts_model/F5TTS_v1_Base/model_1250000.safetensors", |
| vocab_file="./f5tts_model/F5TTS_v1_Base/vocab.txt") |
| |
| wav, sr, _ = tts.infer( |
| ref_file="/content/my_voice.wav", # Upload your audio first |
| ref_text="Exact transcript of your audio.", |
| gen_text="Say this in the cloned voice!", |
| nfe_step=32, |
| ) |
| |
| import soundfile as sf |
| sf.write("output.wav", wav, sr) |
| ``` |
|
|
| ### Option 3: Local Machine (GPU recommended) |
|
|
| ```bash |
| pip install f5-tts soundfile |
| |
| python -c " |
| from f5_tts.api import F5TTS |
| import soundfile as sf |
| |
| tts = F5TTS() # Auto-downloads model on first run |
| wav, sr, _ = tts.infer( |
| ref_file='my_voice.wav', |
| ref_text='Hello, this is my voice.', |
| gen_text='Hello from my local machine!', |
| ) |
| sf.write('output.wav', wav, sr) |
| " |
| ``` |
|
|
| ## Model Description |
|
|
| This repo provides a complete voice cloning pipeline using **F5-TTS v1 Base** (335M parameters), the current best open-source neural TTS model. Clone any voice from just **3-10 seconds** of reference audio. |
|
|
| ### Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | **Type** | Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) | |
| | **Params** | 335M | |
| | **Backbone** | DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) | |
| | **Vocoder** | Vocos (24kHz, 100 mel channels) | |
| | **Training** | Trained on 95K hours of multilingual speech (Emilia EN+ZH) | |
| | **Inference** | Zero-shot voice cloning with 3-10s reference audio | |
| | **RTF** | ~0.15 (6.7x real-time capable) | |
| |
| ## Fine-Tuning Your Own Voice |
| |
| ### Option A: Python Script |
| |
| ```bash |
| # 1. Prepare your data: |
| # my_voice/ |
| # βββ metadata.csv # format: audio_path|text |
| # βββ wavs/ |
| # βββ clip001.wav |
| # βββ clip002.wav |
| |
| # 2. Run training |
| python train_voice_clone.py \ |
| --hf_dataset mythicinfinity/libritts_r \ |
| --hf_config clean \ |
| --hf_split train.clean.100 \ |
| --epochs 20 \ |
| --lr 1e-5 |
| ``` |
| |
| ### Option B: CLI Fine-Tuning (Official F5-TTS) |
|
|
| ```bash |
| pip install f5-tts |
| |
| # Prepare dataset |
| python -m f5_tts.train.datasets.prepare_csv_wavs \ |
| /path/to/my_voice \ |
| /path/to/prepared_data/MyVoice_custom |
| |
| # Fine-tune |
| python -m f5_tts.train.finetune_cli \ |
| --exp_name F5TTS_v1_Base \ |
| --dataset_name MyVoice \ |
| --tokenizer custom \ |
| --finetune \ |
| --learning_rate 1e-5 \ |
| --batch_size_per_gpu 38400 \ |
| --batch_size_type frame \ |
| --max_samples 64 \ |
| --epochs 20 \ |
| --num_warmup_updates 300 \ |
| --grad_accumulation_steps 2 \ |
| --logger tensorboard |
| ``` |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | **WER** (test-clean) | ~1.87% | |
| | **Speaker Similarity** | SIM-o ~0.66 | |
| | **Real-Time Factor** | 0.15 (6.7x faster than real-time) | |
| | **Minimum Reference** | 3 seconds | |
| | **Languages** | English + Chinese (pretrained), adaptable to others | |
|
|
| ## References |
|
|
| - [F5-TTS Paper](https://arxiv.org/abs/2410.06885) β *F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching* |
| - [Official Repo](https://github.com/SWivid/F5-TTS) |
| - [Original Model](https://huggingface.co/SWivid/F5-TTS) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{shen2024f5tts, |
| title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, |
| author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin}, |
| journal={arXiv preprint arXiv:2410.06885}, |
| year={2024} |
| } |
| ``` |
|
|
| ## License |
|
|
| CC-BY-NC-4.0 (non-commercial use) |
|
|