rajkr
/

voice-clone-f5tts

@@ -21,12 +21,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 1: Enable GPU\n",
     "\n",
     "- **Colab**: Runtime → Change runtime type → GPU (T4)\n",
     "- **Kaggle**: Settings → Accelerator → GPU T4\n",
     "\n",
-    "Then run the cell below to verify GPU is available:"
    ]
   },
   {
@@ -35,19 +48,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
     "import torch\n",
     "print(f\"PyTorch version: {torch.__version__}\")\n",
     "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
     "if torch.cuda.is_available():\n",
     "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
-    "    print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 2: Install Dependencies"
    ]
   },
   {
@@ -56,17 +72,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
     "!pip install -q f5-tts soundfile\n",
-    "print(\"Installation complete!\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 3: Download F5-TTS Model (~1.3 GB)\n",
     "\n",
-    "This downloads the pretrained checkpoint and vocab file. On Colab, this persists in `/content` but not across sessions. On Kaggle, use `/kaggle/working` for persistence."
    ]
   },
   {
@@ -79,7 +99,7 @@
     "import os\n",
     "\n",
     "# Model cache directory\n",
-    "MODEL_DIR = \"./f5tts_model\"  # or \"/kaggle/working/f5tts_model\" on Kaggle\n",
     "\n",
     "# Download only the v1 Base checkpoint and vocab\n",
     "snapshot_download(\n",
@@ -90,7 +110,8 @@
     ")\n",
     "\n",
     "# Verify files\n",
-    "for f in os.listdir(f\"{MODEL_DIR}/F5TTS_v1_Base\"):\n",
     "    print(f\"  {f}\")"
    ]
   },
@@ -98,13 +119,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 4: Upload Reference Audio\n",
     "\n",
-    "Upload a 3-10 second audio clip of the voice you want to clone.\n",
     "\n",
-    "**Colab**: Click the folder icon (📁) on the left → Upload to `/content/`\n",
     "\n",
-    "**Kaggle**: Use the Input panel or upload via the code cell below:"
    ]
   },
   {
@@ -113,32 +135,54 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# For Colab: use the file upload widget\n",
-    "# For Kaggle: upload via the Input panel\n",
     "\n",
-    "# Set your reference audio path here:\n",
-    "REF_AUDIO_PATH = \"/content/my_voice.wav\"  # Change this to your uploaded file path\n",
     "\n",
-    "# Verify file exists\n",
-    "import os\n",
-    "if os.path.exists(REF_AUDIO_PATH):\n",
-    "    print(f\"✅ Reference audio found: {REF_AUDIO_PATH}\")\n",
-    "    # Get duration\n",
-    "    import soundfile as sf\n",
-    "    info = sf.info(REF_AUDIO_PATH)\n",
-    "    print(f\"   Duration: {info.duration:.2f}s, Sample rate: {info.samplerate}Hz\")\n",
-    "else:\n",
-    "    print(f\"❌ File not found: {REF_AUDIO_PATH}\")\n",
-    "    print(\"Please upload your reference audio first!\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 5: Define Reference Transcript\n",
     "\n",
-    "Type the **exact** words spoken in your reference audio. Accuracy matters for best cloning quality."
    ]
   },
   {
@@ -147,17 +191,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Set your reference transcript here:\n",
-    "REF_TEXT = \"Hello, this is my voice sample for cloning.\"  # <-- CHANGE THIS to match your audio\n",
-    "\n",
-    "print(f\"Reference text: {REF_TEXT}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 6: Load the Model"
    ]
   },
   {
@@ -166,6 +211,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
     "from f5_tts.api import F5TTS\n",
     "import torch\n",
     "\n",
@@ -181,20 +229,15 @@
     "    device=device,\n",
     ")\n",
     "\n",
-    "print(\"✅ Model loaded successfully!\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 7: Clone a Voice! 🎙️\n",
-    "\n",
-    "Type what you want the cloned voice to say, then run the cell.\n",
-    "\n",
-    "**Settings:**\n",
-    "- `nfe_step`: Number of function evaluation steps (16=fast, 32=good, 64=best)\n",
-    "- `speed`: Speech rate (0.5=slow, 1.0=normal, 1.5=fast)"
    ]
   },
   {
@@ -203,16 +246,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# What you want the cloned voice to say:\n",
-    "GEN_TEXT = \"Hello! This is my cloned voice speaking. Amazing how just a few seconds of audio can create such a realistic copy.\"\n",
-    "\n",
-    "# Quality settings\n",
-    "NFE_STEP = 32      # 16=fast, 32=standard, 64=best quality\n",
-    "SPEED = 1.0        # Speech rate\n",
-    "\n",
-    "print(f\"Generating: {GEN_TEXT}\")\n",
-    "print(f\"NFE step: {NFE_STEP}, Speed: {SPEED}\")\n",
-    "print(\"\\nGenerating... (this takes 10-30s on T4 GPU)\\n\")\n",
     "\n",
     "# Run inference\n",
     "wav, sr, _ = tts.infer(\n",
@@ -224,8 +264,7 @@
     ")\n",
     "\n",
     "# Save output\n",
-    "import soundfile as sf\n",
-    "OUTPUT_PATH = \"/content/output_cloned.wav\"\n",
     "sf.write(OUTPUT_PATH, wav, sr)\n",
     "\n",
     "print(f\"✅ Done! Saved to: {OUTPUT_PATH}\")\n",
@@ -236,7 +275,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Step 8: Listen to the Result"
    ]
   },
   {
@@ -276,6 +315,7 @@
     "]\n",
     "\n",
     "for i, sentence in enumerate(sentences):\n",
     "    wav, sr, _ = tts.infer(\n",
     "        ref_file=REF_AUDIO_PATH,\n",
     "        ref_text=REF_TEXT,\n",
@@ -285,7 +325,7 @@
     "    )\n",
     "    out_path = f\"/content/output_{i+1}.wav\"\n",
     "    sf.write(out_path, wav, sr)\n",
-    "    print(f\"✅ Saved: {out_path} ({len(wav)/sr:.2f}s)\")\n",
     "\n",
     "print(\"\\nAll done! Listen below:\")\n",
     "for i in range(len(sentences)):\n",
@@ -300,11 +340,13 @@
     "\n",
     "| Issue | Solution |\n",
     "|-------|----------|\n",
-    "| \"TorchCodec is required\" | `!pip install torchcodec` or restart runtime after install |\n",
-    "| Out of memory (OOM) | Reduce `nfe_step` to 16, or restart runtime |\n",
-    "| Audio sounds wrong | Double-check `REF_TEXT` matches reference audio exactly |\n",
     "| Slow on first run | Model downloads ~1.3GB on first use — subsequent runs are faster |\n",
-    "| Want to fine-tune | See [rajkr/voice-clone-f5tts](https://huggingface.co/rajkr/voice-clone-f5tts) for training code |\n",
     "\n",
     "---\n",
     "\n",

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## ⚠️ Important Setup Steps\n",
+    "\n",
+    "### Step 0a: Enable GPU\n",
     "\n",
     "- **Colab**: Runtime → Change runtime type → GPU (T4)\n",
     "- **Kaggle**: Settings → Accelerator → GPU T4\n",
     "\n",
+    "### Step 0b: Upload Your Reference Audio\n",
+    "\n",
+    "You MUST upload a 3-10 second audio clip **before** running inference.\n",
+    "\n",
+    "- **Colab**: Click the 📁 folder icon on the left → Upload `my_voice.wav` to `/content/`\n",
+    "- **Kaggle**: Use the Input panel or upload to `/kaggle/working/`\n",
+    "\n",
+    "**Tips for best results:**\n",
+    "- Use clear speech with no background noise\n",
+    "- Duration: 3-10 seconds\n",
+    "- Format: `.wav` or `.mp3` (wav preferred)\n",
+    "- Know the **exact transcript** of what is said in the audio"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Verify GPU is available\n",
     "import torch\n",
     "print(f\"PyTorch version: {torch.__version__}\")\n",
     "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
     "if torch.cuda.is_available():\n",
     "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")\n",
+    "else:\n",
+    "    print(\"⚠️ WARNING: No GPU detected! Enable GPU in Runtime settings for faster inference.\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 1: Install Dependencies"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Fix PYTHONHASHSEED issue and install\n",
+    "import os\n",
+    "os.environ['PYTHONHASHSEED'] = 'random'\n",
+    "\n",
     "!pip install -q f5-tts soundfile\n",
+    "print(\"✅ Installation complete!\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 2: Download F5-TTS Model (~1.3 GB)\n",
     "\n",
+    "This downloads the pretrained checkpoint and vocab file. On Colab, cache persists in the session."
    ]
   },
   {
     "import os\n",
     "\n",
     "# Model cache directory\n",
+    "MODEL_DIR = \"./f5tts_model\"\n",
     "\n",
     "# Download only the v1 Base checkpoint and vocab\n",
     "snapshot_download(\n",
     ")\n",
     "\n",
     "# Verify files\n",
+    "print(\"Downloaded files:\")\n",
+    "for f in sorted(os.listdir(f\"{MODEL_DIR}/F5TTS_v1_Base\")):\n",
     "    print(f\"  {f}\")"
    ]
   },
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 3: Set Your Reference Audio Path\n",
     "\n",
+    "**Make sure you uploaded your audio file first!**\n",
     "\n",
+    "- Colab default: `/content/my_voice.wav`\n",
+    "- Kaggle default: `/kaggle/working/my_voice.wav`\n",
     "\n",
+    "Change the path below to match your uploaded file."
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# === CONFIGURE THESE ===\n",
     "\n",
+    "# Path to your uploaded reference audio file\n",
+    "REF_AUDIO_PATH = \"/content/my_voice.wav\"  # <-- CHANGE THIS to your file path\n",
     "\n",
+    "# Exact transcript of what is spoken in the reference audio\n",
+    "REF_TEXT = \"Hello, this is my voice sample for cloning.\"  # <-- CHANGE THIS to match your audio\n",
+    "\n",
+    "# Text you want the cloned voice to say\n",
+    "GEN_TEXT = \"Hello! This is my cloned voice speaking. Amazing how just a few seconds of audio can create such a realistic copy.\"\n",
+    "\n",
+    "# Quality settings\n",
+    "NFE_STEP = 32       # 16=fast, 32=good quality, 64=best (slower)\n",
+    "SPEED = 1.0         # Speech rate (0.5=slow, 1.0=normal, 1.5=fast)\n",
+    "\n",
+    "# =======================\n",
+    "\n",
+    "# Verify audio file exists\n",
+    "import soundfile as sf\n",
+    "if not os.path.exists(REF_AUDIO_PATH):\n",
+    "    print(f\"❌ ERROR: File not found: {REF_AUDIO_PATH}\")\n",
+    "    print(\"\\nPlease upload your reference audio first!\")\n",
+    "    print(\"\\nColab: Click the 📁 folder icon (left sidebar) → Upload your .wav file\")\n",
+    "    print(\"Kaggle: Use the Input panel or upload via the code cell below:\")\n",
+    "    print(\"\\nfrom google.colab import files\\nuploaded = files.upload()\")\n",
+    "    raise FileNotFoundError(f\"Audio file not found: {REF_AUDIO_PATH}\")\n",
+    "\n",
+    "# Show audio info\n",
+    "info = sf.info(REF_AUDIO_PATH)\n",
+    "print(f\"✅ Audio found: {REF_AUDIO_PATH}\")\n",
+    "print(f\"   Duration: {info.duration:.2f}s, Sample rate: {info.samplerate}Hz, Channels: {info.channels}\")\n",
+    "\n",
+    "if info.duration < 1.0:\n",
+    "    print(\"⚠️ WARNING: Audio is very short. Use 3-10 seconds for best results.\")\n",
+    "elif info.duration > 30.0:\n",
+    "    print(\"⚠️ WARNING: Audio is very long. Only the first ~10s will be used effectively.\")\n",
+    "\n",
+    "print(f\"\\nReference text: {REF_TEXT}\")\n",
+    "print(f\"Generate text:  {GEN_TEXT}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "### (Optional) Upload Audio via Code Cell\n",
     "\n",
+    "If you prefer to upload via code instead of the file panel, run this cell:"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Only run this if you haven't uploaded via the file panel\n",
+    "# from google.colab import files\n",
+    "# uploaded = files.upload()  # Select your .wav file\n",
+    "# REF_AUDIO_PATH = list(uploaded.keys())[0]  # Auto-detect uploaded file\n",
+    "# print(f\"Uploaded: {REF_AUDIO_PATH}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 4: Load the Model"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "import os\n",
+    "os.environ['PYTHONHASHSEED'] = 'random'  # Fix hash seed issue\n",
+    "\n",
     "from f5_tts.api import F5TTS\n",
     "import torch\n",
     "\n",
     "    device=device,\n",
     ")\n",
     "\n",
+    "print(\"✅ Model loaded successfully!\")\n",
+    "print(f\"   Ready for inference on {device}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 5: Clone the Voice! 🎙️"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "print(f\"Generating with:\")\n",
+    "print(f\"  Reference: {REF_AUDIO_PATH}\")\n",
+    "print(f\"  Ref text:  {REF_TEXT}\")\n",
+    "print(f\"  Gen text:  {GEN_TEXT}\")\n",
+    "print(f\"  NFE step:  {NFE_STEP} (quality/speed tradeoff)\")\n",
+    "print(f\"  Speed:     {SPEED}\")\n",
+    "print(f\"\\nGenerating... (this takes 10-30s on T4 GPU with nfe={NFE_STEP})\\n\")\n",
     "\n",
     "# Run inference\n",
     "wav, sr, _ = tts.infer(\n",
     ")\n",
     "\n",
     "# Save output\n",
+    "OUTPUT_PATH = \"/content/output_cloned.wav\"  # Change to /kaggle/working/ on Kaggle\n",
     "sf.write(OUTPUT_PATH, wav, sr)\n",
     "\n",
     "print(f\"✅ Done! Saved to: {OUTPUT_PATH}\")\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Step 6: Listen to the Result"
    ]
   },
   {
     "]\n",
     "\n",
     "for i, sentence in enumerate(sentences):\n",
+    "    print(f\"\\n[{i+1}/{len(sentences)}] Generating: {sentence[:50]}...\")\n",
     "    wav, sr, _ = tts.infer(\n",
     "        ref_file=REF_AUDIO_PATH,\n",
     "        ref_text=REF_TEXT,\n",
     "    )\n",
     "    out_path = f\"/content/output_{i+1}.wav\"\n",
     "    sf.write(out_path, wav, sr)\n",
+    "    print(f\"   ✅ Saved: {out_path} ({len(wav)/sr:.2f}s)\")\n",
     "\n",
     "print(\"\\nAll done! Listen below:\")\n",
     "for i in range(len(sentences)):\n",
     "\n",
     "| Issue | Solution |\n",
     "|-------|----------|\n",
+    "| `PYTHONHASHSEED` error | Fixed in this notebook — `os.environ['PYTHONHASHSEED'] = 'random'` before imports |\n",
+    "| `FileNotFoundError: my_voice.wav` | Upload your audio file first! See Step 0b |\n",
+    "| Out of memory (OOM) | Reduce `NFE_STEP` to 16, or restart runtime |\n",
+    "| Audio sounds wrong / robotic | Double-check `REF_TEXT` matches reference audio **exactly** |\n",
     "| Slow on first run | Model downloads ~1.3GB on first use — subsequent runs are faster |\n",
+    "| `TorchCodec is required` | Run `!pip install -q torchcodec` |\n",
+    "| Voice doesn't sound like reference | Use longer reference (8-10s), clearer audio, exact transcript |\n",
     "\n",
     "---\n",
     "\n",