Text-to-Speech
F5-TTS
English
Chinese
flow_matching_dit
voice-cloning
flow-matching
zero-shot-tts
rajkr commited on
Commit
0df1682
Β·
verified Β·
1 Parent(s): 9684de2

Upload voice_clone_f5tts.ipynb

Browse files
Files changed (1) hide show
  1. voice_clone_f5tts.ipynb +103 -61
voice_clone_f5tts.ipynb CHANGED
@@ -21,12 +21,25 @@
21
  "cell_type": "markdown",
22
  "metadata": {},
23
  "source": [
24
- "## Step 1: Enable GPU\n",
 
 
25
  "\n",
26
  "- **Colab**: Runtime β†’ Change runtime type β†’ GPU (T4)\n",
27
  "- **Kaggle**: Settings β†’ Accelerator β†’ GPU T4\n",
28
  "\n",
29
- "Then run the cell below to verify GPU is available:"
 
 
 
 
 
 
 
 
 
 
 
30
  ]
31
  },
32
  {
@@ -35,19 +48,22 @@
35
  "metadata": {},
36
  "outputs": [],
37
  "source": [
 
38
  "import torch\n",
39
  "print(f\"PyTorch version: {torch.__version__}\")\n",
40
  "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
41
  "if torch.cuda.is_available():\n",
42
  " print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
43
- " print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")"
 
 
44
  ]
45
  },
46
  {
47
  "cell_type": "markdown",
48
  "metadata": {},
49
  "source": [
50
- "## Step 2: Install Dependencies"
51
  ]
52
  },
53
  {
@@ -56,17 +72,21 @@
56
  "metadata": {},
57
  "outputs": [],
58
  "source": [
 
 
 
 
59
  "!pip install -q f5-tts soundfile\n",
60
- "print(\"Installation complete!\")"
61
  ]
62
  },
63
  {
64
  "cell_type": "markdown",
65
  "metadata": {},
66
  "source": [
67
- "## Step 3: Download F5-TTS Model (~1.3 GB)\n",
68
  "\n",
69
- "This downloads the pretrained checkpoint and vocab file. On Colab, this persists in `/content` but not across sessions. On Kaggle, use `/kaggle/working` for persistence."
70
  ]
71
  },
72
  {
@@ -79,7 +99,7 @@
79
  "import os\n",
80
  "\n",
81
  "# Model cache directory\n",
82
- "MODEL_DIR = \"./f5tts_model\" # or \"/kaggle/working/f5tts_model\" on Kaggle\n",
83
  "\n",
84
  "# Download only the v1 Base checkpoint and vocab\n",
85
  "snapshot_download(\n",
@@ -90,7 +110,8 @@
90
  ")\n",
91
  "\n",
92
  "# Verify files\n",
93
- "for f in os.listdir(f\"{MODEL_DIR}/F5TTS_v1_Base\"):\n",
 
94
  " print(f\" {f}\")"
95
  ]
96
  },
@@ -98,13 +119,14 @@
98
  "cell_type": "markdown",
99
  "metadata": {},
100
  "source": [
101
- "## Step 4: Upload Reference Audio\n",
102
  "\n",
103
- "Upload a 3-10 second audio clip of the voice you want to clone.\n",
104
  "\n",
105
- "**Colab**: Click the folder icon (πŸ“) on the left β†’ Upload to `/content/`\n",
 
106
  "\n",
107
- "**Kaggle**: Use the Input panel or upload via the code cell below:"
108
  ]
109
  },
110
  {
@@ -113,32 +135,54 @@
113
  "metadata": {},
114
  "outputs": [],
115
  "source": [
116
- "# For Colab: use the file upload widget\n",
117
- "# For Kaggle: upload via the Input panel\n",
118
  "\n",
119
- "# Set your reference audio path here:\n",
120
- "REF_AUDIO_PATH = \"/content/my_voice.wav\" # Change this to your uploaded file path\n",
121
  "\n",
122
- "# Verify file exists\n",
123
- "import os\n",
124
- "if os.path.exists(REF_AUDIO_PATH):\n",
125
- " print(f\"βœ… Reference audio found: {REF_AUDIO_PATH}\")\n",
126
- " # Get duration\n",
127
- " import soundfile as sf\n",
128
- " info = sf.info(REF_AUDIO_PATH)\n",
129
- " print(f\" Duration: {info.duration:.2f}s, Sample rate: {info.samplerate}Hz\")\n",
130
- "else:\n",
131
- " print(f\"❌ File not found: {REF_AUDIO_PATH}\")\n",
132
- " print(\"Please upload your reference audio first!\")"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ]
134
  },
135
  {
136
  "cell_type": "markdown",
137
  "metadata": {},
138
  "source": [
139
- "## Step 5: Define Reference Transcript\n",
140
  "\n",
141
- "Type the **exact** words spoken in your reference audio. Accuracy matters for best cloning quality."
142
  ]
143
  },
144
  {
@@ -147,17 +191,18 @@
147
  "metadata": {},
148
  "outputs": [],
149
  "source": [
150
- "# Set your reference transcript here:\n",
151
- "REF_TEXT = \"Hello, this is my voice sample for cloning.\" # <-- CHANGE THIS to match your audio\n",
152
- "\n",
153
- "print(f\"Reference text: {REF_TEXT}\")"
 
154
  ]
155
  },
156
  {
157
  "cell_type": "markdown",
158
  "metadata": {},
159
  "source": [
160
- "## Step 6: Load the Model"
161
  ]
162
  },
163
  {
@@ -166,6 +211,9 @@
166
  "metadata": {},
167
  "outputs": [],
168
  "source": [
 
 
 
169
  "from f5_tts.api import F5TTS\n",
170
  "import torch\n",
171
  "\n",
@@ -181,20 +229,15 @@
181
  " device=device,\n",
182
  ")\n",
183
  "\n",
184
- "print(\"βœ… Model loaded successfully!\")"
 
185
  ]
186
  },
187
  {
188
  "cell_type": "markdown",
189
  "metadata": {},
190
  "source": [
191
- "## Step 7: Clone a Voice! πŸŽ™οΈ\n",
192
- "\n",
193
- "Type what you want the cloned voice to say, then run the cell.\n",
194
- "\n",
195
- "**Settings:**\n",
196
- "- `nfe_step`: Number of function evaluation steps (16=fast, 32=good, 64=best)\n",
197
- "- `speed`: Speech rate (0.5=slow, 1.0=normal, 1.5=fast)"
198
  ]
199
  },
200
  {
@@ -203,16 +246,13 @@
203
  "metadata": {},
204
  "outputs": [],
205
  "source": [
206
- "# What you want the cloned voice to say:\n",
207
- "GEN_TEXT = \"Hello! This is my cloned voice speaking. Amazing how just a few seconds of audio can create such a realistic copy.\"\n",
208
- "\n",
209
- "# Quality settings\n",
210
- "NFE_STEP = 32 # 16=fast, 32=standard, 64=best quality\n",
211
- "SPEED = 1.0 # Speech rate\n",
212
- "\n",
213
- "print(f\"Generating: {GEN_TEXT}\")\n",
214
- "print(f\"NFE step: {NFE_STEP}, Speed: {SPEED}\")\n",
215
- "print(\"\\nGenerating... (this takes 10-30s on T4 GPU)\\n\")\n",
216
  "\n",
217
  "# Run inference\n",
218
  "wav, sr, _ = tts.infer(\n",
@@ -224,8 +264,7 @@
224
  ")\n",
225
  "\n",
226
  "# Save output\n",
227
- "import soundfile as sf\n",
228
- "OUTPUT_PATH = \"/content/output_cloned.wav\"\n",
229
  "sf.write(OUTPUT_PATH, wav, sr)\n",
230
  "\n",
231
  "print(f\"βœ… Done! Saved to: {OUTPUT_PATH}\")\n",
@@ -236,7 +275,7 @@
236
  "cell_type": "markdown",
237
  "metadata": {},
238
  "source": [
239
- "## Step 8: Listen to the Result"
240
  ]
241
  },
242
  {
@@ -276,6 +315,7 @@
276
  "]\n",
277
  "\n",
278
  "for i, sentence in enumerate(sentences):\n",
 
279
  " wav, sr, _ = tts.infer(\n",
280
  " ref_file=REF_AUDIO_PATH,\n",
281
  " ref_text=REF_TEXT,\n",
@@ -285,7 +325,7 @@
285
  " )\n",
286
  " out_path = f\"/content/output_{i+1}.wav\"\n",
287
  " sf.write(out_path, wav, sr)\n",
288
- " print(f\"βœ… Saved: {out_path} ({len(wav)/sr:.2f}s)\")\n",
289
  "\n",
290
  "print(\"\\nAll done! Listen below:\")\n",
291
  "for i in range(len(sentences)):\n",
@@ -300,11 +340,13 @@
300
  "\n",
301
  "| Issue | Solution |\n",
302
  "|-------|----------|\n",
303
- "| \"TorchCodec is required\" | `!pip install torchcodec` or restart runtime after install |\n",
304
- "| Out of memory (OOM) | Reduce `nfe_step` to 16, or restart runtime |\n",
305
- "| Audio sounds wrong | Double-check `REF_TEXT` matches reference audio exactly |\n",
 
306
  "| Slow on first run | Model downloads ~1.3GB on first use β€” subsequent runs are faster |\n",
307
- "| Want to fine-tune | See [rajkr/voice-clone-f5tts](https://huggingface.co/rajkr/voice-clone-f5tts) for training code |\n",
 
308
  "\n",
309
  "---\n",
310
  "\n",
 
21
  "cell_type": "markdown",
22
  "metadata": {},
23
  "source": [
24
+ "## ⚠️ Important Setup Steps\n",
25
+ "\n",
26
+ "### Step 0a: Enable GPU\n",
27
  "\n",
28
  "- **Colab**: Runtime β†’ Change runtime type β†’ GPU (T4)\n",
29
  "- **Kaggle**: Settings β†’ Accelerator β†’ GPU T4\n",
30
  "\n",
31
+ "### Step 0b: Upload Your Reference Audio\n",
32
+ "\n",
33
+ "You MUST upload a 3-10 second audio clip **before** running inference.\n",
34
+ "\n",
35
+ "- **Colab**: Click the πŸ“ folder icon on the left β†’ Upload `my_voice.wav` to `/content/`\n",
36
+ "- **Kaggle**: Use the Input panel or upload to `/kaggle/working/`\n",
37
+ "\n",
38
+ "**Tips for best results:**\n",
39
+ "- Use clear speech with no background noise\n",
40
+ "- Duration: 3-10 seconds\n",
41
+ "- Format: `.wav` or `.mp3` (wav preferred)\n",
42
+ "- Know the **exact transcript** of what is said in the audio"
43
  ]
44
  },
45
  {
 
48
  "metadata": {},
49
  "outputs": [],
50
  "source": [
51
+ "# Verify GPU is available\n",
52
  "import torch\n",
53
  "print(f\"PyTorch version: {torch.__version__}\")\n",
54
  "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
55
  "if torch.cuda.is_available():\n",
56
  " print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
57
+ " print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")\n",
58
+ "else:\n",
59
+ " print(\"⚠️ WARNING: No GPU detected! Enable GPU in Runtime settings for faster inference.\")"
60
  ]
61
  },
62
  {
63
  "cell_type": "markdown",
64
  "metadata": {},
65
  "source": [
66
+ "## Step 1: Install Dependencies"
67
  ]
68
  },
69
  {
 
72
  "metadata": {},
73
  "outputs": [],
74
  "source": [
75
+ "# Fix PYTHONHASHSEED issue and install\n",
76
+ "import os\n",
77
+ "os.environ['PYTHONHASHSEED'] = 'random'\n",
78
+ "\n",
79
  "!pip install -q f5-tts soundfile\n",
80
+ "print(\"βœ… Installation complete!\")"
81
  ]
82
  },
83
  {
84
  "cell_type": "markdown",
85
  "metadata": {},
86
  "source": [
87
+ "## Step 2: Download F5-TTS Model (~1.3 GB)\n",
88
  "\n",
89
+ "This downloads the pretrained checkpoint and vocab file. On Colab, cache persists in the session."
90
  ]
91
  },
92
  {
 
99
  "import os\n",
100
  "\n",
101
  "# Model cache directory\n",
102
+ "MODEL_DIR = \"./f5tts_model\"\n",
103
  "\n",
104
  "# Download only the v1 Base checkpoint and vocab\n",
105
  "snapshot_download(\n",
 
110
  ")\n",
111
  "\n",
112
  "# Verify files\n",
113
+ "print(\"Downloaded files:\")\n",
114
+ "for f in sorted(os.listdir(f\"{MODEL_DIR}/F5TTS_v1_Base\")):\n",
115
  " print(f\" {f}\")"
116
  ]
117
  },
 
119
  "cell_type": "markdown",
120
  "metadata": {},
121
  "source": [
122
+ "## Step 3: Set Your Reference Audio Path\n",
123
  "\n",
124
+ "**Make sure you uploaded your audio file first!**\n",
125
  "\n",
126
+ "- Colab default: `/content/my_voice.wav`\n",
127
+ "- Kaggle default: `/kaggle/working/my_voice.wav`\n",
128
  "\n",
129
+ "Change the path below to match your uploaded file."
130
  ]
131
  },
132
  {
 
135
  "metadata": {},
136
  "outputs": [],
137
  "source": [
138
+ "# === CONFIGURE THESE ===\n",
 
139
  "\n",
140
+ "# Path to your uploaded reference audio file\n",
141
+ "REF_AUDIO_PATH = \"/content/my_voice.wav\" # <-- CHANGE THIS to your file path\n",
142
  "\n",
143
+ "# Exact transcript of what is spoken in the reference audio\n",
144
+ "REF_TEXT = \"Hello, this is my voice sample for cloning.\" # <-- CHANGE THIS to match your audio\n",
145
+ "\n",
146
+ "# Text you want the cloned voice to say\n",
147
+ "GEN_TEXT = \"Hello! This is my cloned voice speaking. Amazing how just a few seconds of audio can create such a realistic copy.\"\n",
148
+ "\n",
149
+ "# Quality settings\n",
150
+ "NFE_STEP = 32 # 16=fast, 32=good quality, 64=best (slower)\n",
151
+ "SPEED = 1.0 # Speech rate (0.5=slow, 1.0=normal, 1.5=fast)\n",
152
+ "\n",
153
+ "# =======================\n",
154
+ "\n",
155
+ "# Verify audio file exists\n",
156
+ "import soundfile as sf\n",
157
+ "if not os.path.exists(REF_AUDIO_PATH):\n",
158
+ " print(f\"❌ ERROR: File not found: {REF_AUDIO_PATH}\")\n",
159
+ " print(\"\\nPlease upload your reference audio first!\")\n",
160
+ " print(\"\\nColab: Click the πŸ“ folder icon (left sidebar) β†’ Upload your .wav file\")\n",
161
+ " print(\"Kaggle: Use the Input panel or upload via the code cell below:\")\n",
162
+ " print(\"\\nfrom google.colab import files\\nuploaded = files.upload()\")\n",
163
+ " raise FileNotFoundError(f\"Audio file not found: {REF_AUDIO_PATH}\")\n",
164
+ "\n",
165
+ "# Show audio info\n",
166
+ "info = sf.info(REF_AUDIO_PATH)\n",
167
+ "print(f\"βœ… Audio found: {REF_AUDIO_PATH}\")\n",
168
+ "print(f\" Duration: {info.duration:.2f}s, Sample rate: {info.samplerate}Hz, Channels: {info.channels}\")\n",
169
+ "\n",
170
+ "if info.duration < 1.0:\n",
171
+ " print(\"⚠️ WARNING: Audio is very short. Use 3-10 seconds for best results.\")\n",
172
+ "elif info.duration > 30.0:\n",
173
+ " print(\"⚠️ WARNING: Audio is very long. Only the first ~10s will be used effectively.\")\n",
174
+ "\n",
175
+ "print(f\"\\nReference text: {REF_TEXT}\")\n",
176
+ "print(f\"Generate text: {GEN_TEXT}\")"
177
  ]
178
  },
179
  {
180
  "cell_type": "markdown",
181
  "metadata": {},
182
  "source": [
183
+ "### (Optional) Upload Audio via Code Cell\n",
184
  "\n",
185
+ "If you prefer to upload via code instead of the file panel, run this cell:"
186
  ]
187
  },
188
  {
 
191
  "metadata": {},
192
  "outputs": [],
193
  "source": [
194
+ "# Only run this if you haven't uploaded via the file panel\n",
195
+ "# from google.colab import files\n",
196
+ "# uploaded = files.upload() # Select your .wav file\n",
197
+ "# REF_AUDIO_PATH = list(uploaded.keys())[0] # Auto-detect uploaded file\n",
198
+ "# print(f\"Uploaded: {REF_AUDIO_PATH}\")"
199
  ]
200
  },
201
  {
202
  "cell_type": "markdown",
203
  "metadata": {},
204
  "source": [
205
+ "## Step 4: Load the Model"
206
  ]
207
  },
208
  {
 
211
  "metadata": {},
212
  "outputs": [],
213
  "source": [
214
+ "import os\n",
215
+ "os.environ['PYTHONHASHSEED'] = 'random' # Fix hash seed issue\n",
216
+ "\n",
217
  "from f5_tts.api import F5TTS\n",
218
  "import torch\n",
219
  "\n",
 
229
  " device=device,\n",
230
  ")\n",
231
  "\n",
232
+ "print(\"βœ… Model loaded successfully!\")\n",
233
+ "print(f\" Ready for inference on {device}\")"
234
  ]
235
  },
236
  {
237
  "cell_type": "markdown",
238
  "metadata": {},
239
  "source": [
240
+ "## Step 5: Clone the Voice! πŸŽ™οΈ"
 
 
 
 
 
 
241
  ]
242
  },
243
  {
 
246
  "metadata": {},
247
  "outputs": [],
248
  "source": [
249
+ "print(f\"Generating with:\")\n",
250
+ "print(f\" Reference: {REF_AUDIO_PATH}\")\n",
251
+ "print(f\" Ref text: {REF_TEXT}\")\n",
252
+ "print(f\" Gen text: {GEN_TEXT}\")\n",
253
+ "print(f\" NFE step: {NFE_STEP} (quality/speed tradeoff)\")\n",
254
+ "print(f\" Speed: {SPEED}\")\n",
255
+ "print(f\"\\nGenerating... (this takes 10-30s on T4 GPU with nfe={NFE_STEP})\\n\")\n",
 
 
 
256
  "\n",
257
  "# Run inference\n",
258
  "wav, sr, _ = tts.infer(\n",
 
264
  ")\n",
265
  "\n",
266
  "# Save output\n",
267
+ "OUTPUT_PATH = \"/content/output_cloned.wav\" # Change to /kaggle/working/ on Kaggle\n",
 
268
  "sf.write(OUTPUT_PATH, wav, sr)\n",
269
  "\n",
270
  "print(f\"βœ… Done! Saved to: {OUTPUT_PATH}\")\n",
 
275
  "cell_type": "markdown",
276
  "metadata": {},
277
  "source": [
278
+ "## Step 6: Listen to the Result"
279
  ]
280
  },
281
  {
 
315
  "]\n",
316
  "\n",
317
  "for i, sentence in enumerate(sentences):\n",
318
+ " print(f\"\\n[{i+1}/{len(sentences)}] Generating: {sentence[:50]}...\")\n",
319
  " wav, sr, _ = tts.infer(\n",
320
  " ref_file=REF_AUDIO_PATH,\n",
321
  " ref_text=REF_TEXT,\n",
 
325
  " )\n",
326
  " out_path = f\"/content/output_{i+1}.wav\"\n",
327
  " sf.write(out_path, wav, sr)\n",
328
+ " print(f\" βœ… Saved: {out_path} ({len(wav)/sr:.2f}s)\")\n",
329
  "\n",
330
  "print(\"\\nAll done! Listen below:\")\n",
331
  "for i in range(len(sentences)):\n",
 
340
  "\n",
341
  "| Issue | Solution |\n",
342
  "|-------|----------|\n",
343
+ "| `PYTHONHASHSEED` error | Fixed in this notebook β€” `os.environ['PYTHONHASHSEED'] = 'random'` before imports |\n",
344
+ "| `FileNotFoundError: my_voice.wav` | Upload your audio file first! See Step 0b |\n",
345
+ "| Out of memory (OOM) | Reduce `NFE_STEP` to 16, or restart runtime |\n",
346
+ "| Audio sounds wrong / robotic | Double-check `REF_TEXT` matches reference audio **exactly** |\n",
347
  "| Slow on first run | Model downloads ~1.3GB on first use β€” subsequent runs are faster |\n",
348
+ "| `TorchCodec is required` | Run `!pip install -q torchcodec` |\n",
349
+ "| Voice doesn't sound like reference | Use longer reference (8-10s), clearer audio, exact transcript |\n",
350
  "\n",
351
  "---\n",
352
  "\n",