🎙️ Voice Clone Model (F5-TTS Based)

A production-ready zero-shot voice cloning model based on the state-of-the-art F5-TTS architecture (Flow Matching + Diffusion Transformer).

📦 Files in This Repo

File	Description
`README.md`	This documentation
`config.json`	Model configuration and hyperparameters
`train_voice_clone.py`	Fine-tuning script — adapt to your own voice data
`inference_voice_clone.py`	Local inference script — zero-shot voice cloning CLI
`voice_clone_f5tts.ipynb`	📓 Jupyter Notebook — ready for Colab / Kaggle

🚀 Quick Start Options

Option 1: Hugging Face Space (No setup)

Try instantly at rajkr-voice-clone-f5tts-demo.hf.space

Option 2: Google Colab / Kaggle (Free GPU)

Open the notebook directly:

Colab: Open in Colab (upload voice_clone_f5tts.ipynb to your Drive first)
Kaggle: Download voice_clone_f5tts.ipynb from this repo → Upload to Kaggle → Enable GPU T4

Or follow the quick steps below:

# 1. Enable GPU: Runtime → Change runtime type → GPU
# 2. Install
!pip install -q f5-tts soundfile

# 3. Download model (~1.3GB)
from huggingface_hub import snapshot_download
snapshot_download("SWivid/F5-TTS", local_dir="./f5tts_model", allow_patterns=["F5TTS_v1_Base/*"])

# 4. Clone a voice
from f5_tts.api import F5TTS
tts = F5TTS(ckpt_file="./f5tts_model/F5TTS_v1_Base/model_1250000.safetensors",
            vocab_file="./f5tts_model/F5TTS_v1_Base/vocab.txt")

wav, sr, _ = tts.infer(
    ref_file="/content/my_voice.wav",     # Upload your audio first
    ref_text="Exact transcript of your audio.",
    gen_text="Say this in the cloned voice!",
    nfe_step=32,
)

import soundfile as sf
sf.write("output.wav", wav, sr)

Option 3: Local Machine (GPU recommended)

pip install f5-tts soundfile

python -c "
from f5_tts.api import F5TTS
import soundfile as sf

tts = F5TTS()  # Auto-downloads model on first run
wav, sr, _ = tts.infer(
    ref_file='my_voice.wav',
    ref_text='Hello, this is my voice.',
    gen_text='Hello from my local machine!',
)
sf.write('output.wav', wav, sr)
"

Model Description

This repo provides a complete voice cloning pipeline using F5-TTS v1 Base (335M parameters), the current best open-source neural TTS model. Clone any voice from just 3-10 seconds of reference audio.

Architecture

Component	Details
Type	Conditional Flow Matching (CFM) with Diffusion Transformer (DiT)
Params	335M
Backbone	DiT (dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
Vocoder	Vocos (24kHz, 100 mel channels)
Training	Trained on 95K hours of multilingual speech (Emilia EN+ZH)
Inference	Zero-shot voice cloning with 3-10s reference audio
RTF	~0.15 (6.7x real-time capable)

Fine-Tuning Your Own Voice

Option A: Python Script

# 1. Prepare your data:
# my_voice/
#   ├── metadata.csv      # format: audio_path|text
#   └── wavs/
#       ├── clip001.wav
#       └── clip002.wav

# 2. Run training
python train_voice_clone.py \
    --hf_dataset mythicinfinity/libritts_r \
    --hf_config clean \
    --hf_split train.clean.100 \
    --epochs 20 \
    --lr 1e-5

Option B: CLI Fine-Tuning (Official F5-TTS)

pip install f5-tts

# Prepare dataset
python -m f5_tts.train.datasets.prepare_csv_wavs \
    /path/to/my_voice \
    /path/to/prepared_data/MyVoice_custom

# Fine-tune
python -m f5_tts.train.finetune_cli \
    --exp_name F5TTS_v1_Base \
    --dataset_name MyVoice \
    --tokenizer custom \
    --finetune \
    --learning_rate 1e-5 \
    --batch_size_per_gpu 38400 \
    --batch_size_type frame \
    --max_samples 64 \
    --epochs 20 \
    --num_warmup_updates 300 \
    --grad_accumulation_steps 2 \
    --logger tensorboard

Performance

Metric	Value
WER (test-clean)	~1.87%
Speaker Similarity	SIM-o ~0.66
Real-Time Factor	0.15 (6.7x faster than real-time)
Minimum Reference	3 seconds
Languages	English + Chinese (pretrained), adaptable to others

References

F5-TTS Paper — F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Official Repo
Original Model

Citation

@article{shen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Shen, Yusheng and Wang, Zhijian and Dalmia, Shaylen and Su, Yuchuan and Liu, Zhejian and Marino, Kevin and Zonooz, Bahram and Yao, Zirun and Ma, Xinyin},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

License

CC-BY-NC-4.0 (non-commercial use)

Downloads last month: -

Model tree for rajkr/voice-clone-f5tts

Base model

SWivid/F5-TTS

Finetuned

(91)

this model

Datasets used to train rajkr/voice-clone-f5tts

Paper for rajkr/voice-clone-f5tts

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 47