Upload README.md

9684de2 verified 16 days ago

5.31 kB

	---
	tags:
	- f5-tts
	- text-to-speech
	- voice-cloning
	- flow-matching
	- zero-shot-tts
	license: cc-by-nc-4.0
	datasets:
	- mythicinfinity/libritts_r
	- amphion/Emilia-Dataset
	base_model: SWivid/F5-TTS
	pipeline_tag: text-to-speech
	language:
	- en
	- zh
	---

	# 🎙️ Voice Clone Model (F5-TTS Based)

	A production-ready zero-shot voice cloning model based on the state-of-the-art F5-TTS architecture (Flow Matching + Diffusion Transformer).

	## 📦 Files in This Repo

	\| File \| Description \|
	\|------\|-------------\|
	\| `README.md` \| This documentation \|
	\| `config.json` \| Model configuration and hyperparameters \|
	\| `train_voice_clone.py` \| Fine-tuning script — adapt to your own voice data \|
	\| `inference_voice_clone.py` \| Local inference script — zero-shot voice cloning CLI \|
	\| `voice_clone_f5tts.ipynb` \| 📓 Jupyter Notebook — ready for Colab / Kaggle \|

	## 🚀 Quick Start Options

	### Option 1: Hugging Face Space (No setup)
	Try instantly at [rajkr-voice-clone-f5tts-demo.hf.space](https://rajkr-voice-clone-f5tts-demo.hf.space)

	### Option 2: Google Colab / Kaggle (Free GPU)
	Open the notebook directly:
	- Colab: [Open in Colab](https://colab.research.google.com/github/rajkr/voice-clone-f5tts/blob/main/voice_clone_f5tts.ipynb) (upload `voice_clone_f5tts.ipynb` to your Drive first)
	- Kaggle: Download `voice_clone_f5tts.ipynb` from this repo → Upload to Kaggle → Enable GPU T4

	Or follow the quick steps below:

	```python
	# 1. Enable GPU: Runtime → Change runtime type → GPU
	# 2. Install
	!pip install -q f5-tts soundfile

	# 3. Download model (~1.3GB)
	from huggingface_hub import snapshot_download
	snapshot_download("SWivid/F5-TTS", local_dir="./f5tts_model", allow_patterns=["F5TTS_v1_Base/*"])

	# 4. Clone a voice
	from f5_tts.api import F5TTS
	tts = F5TTS(ckpt_file="./f5tts_model/F5TTS_v1_Base/model_1250000.safetensors",
	vocab_file="./f5tts_model/F5TTS_v1_Base/vocab.txt")

	wav, sr, _ = tts.infer(
	ref_file="/content/my_voice.wav", # Upload your audio first
	ref_text="Exact transcript of your audio.",
	gen_text="Say this in the cloned voice!",
	nfe_step=32,
	)

	import soundfile as sf
	sf.write("output.wav", wav, sr)
	```

	### Option 3: Local Machine (GPU recommended)

	```bash
	pip install f5-tts soundfile

	python -c "
	from f5_tts.api import F5TTS
	import soundfile as sf

	tts = F5TTS() # Auto-downloads model on first run
	wav, sr, _ = tts.infer(
	ref_file='my_voice.wav',
	ref_text='Hello, this is my voice.',
	gen_text='Hello from my local machine!',
	)
	sf.write('output.wav', wav, sr)
	"
	```

	## Model Description

	This repo provides a complete voice cloning pipeline using F5-TTS v1 Base (335M parameters), the current best open-source neural TTS model. Clone any voice from just 3-10 seconds of reference audio.

	### Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Type \| Conditional Flow Matching (CFM) with Diffusion Transformer (DiT) \|
	\| Params \| 335M \|
	\| Backbone \| DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4) \|
	\| Vocoder \| Vocos (24kHz, 100 mel channels) \|
	\| Training \| Trained on 95K hours of multilingual speech (Emilia EN+ZH) \|
	\| Inference \| Zero-shot voice cloning with 3-10s reference audio \|
	\| RTF \| ~0.15 (6.7x real-time capable) \|

	## Fine-Tuning Your Own Voice

	### Option A: Python Script

	```bash
	# 1. Prepare your data:
	# my_voice/
	# ├── metadata.csv # format: audio_path\|text
	# └── wavs/
	# ├── clip001.wav
	# └── clip002.wav

	# 2. Run training
	python train_voice_clone.py \
	--hf_dataset mythicinfinity/libritts_r \
	--hf_config clean \
	--hf_split train.clean.100 \
	--epochs 20 \
	--lr 1e-5
	```

	### Option B: CLI Fine-Tuning (Official F5-TTS)

	```bash
	pip install f5-tts

	# Prepare dataset
	python -m f5_tts.train.datasets.prepare_csv_wavs \
	/path/to/my_voice \
	/path/to/prepared_data/MyVoice_custom

	# Fine-tune
	python -m f5_tts.train.finetune_cli \
	--exp_name F5TTS_v1_Base \
	--dataset_name MyVoice \
	--tokenizer custom \
	--finetune \
	--learning_rate 1e-5 \
	--batch_size_per_gpu 38400 \
	--batch_size_type frame \
	--max_samples 64 \
	--epochs 20 \
	--num_warmup_updates 300 \
	--grad_accumulation_steps 2 \
	--logger tensorboard
	```

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| WER (test-clean) \| ~1.87% \|
	\| Speaker Similarity \| SIM-o ~0.66 \|
	\| Real-Time Factor \| 0.15 (6.7x faster than real-time) \|
	\| Minimum Reference \| 3 seconds \|
	\| Languages \| English + Chinese (pretrained), adaptable to others \|

	## References

	- [F5-TTS Paper](https://arxiv.org/abs/2410.06885) — F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
	- [Official Repo](https://github.com/SWivid/F5-TTS)
	- [Original Model](https://huggingface.co/SWivid/F5-TTS)

	## Citation

	```bibtex
	@article{shen2024f5tts,
	title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
	author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin},
	journal={arXiv preprint arXiv:2410.06885},
	year={2024}
	}
	```

	## License

	CC-BY-NC-4.0 (non-commercial use)