Spaces:
Sleeping
Sleeping
ibcplateformes Claude Opus 4.6 commited on
Commit ·
fea49f2
1
Parent(s): 55b9bab
Replace RVC with Seed-VC for zero-shot voice conversion
Browse filesRVC required fine-tuning (250-500 epochs) incompatible with ZeroGPU's 60s limit,
resulting in poor quality. Seed-VC uses diffusion transformer + in-context learning
for zero-shot conversion with just 3-30 sec of reference audio.
- Rewrite inference.py: Seed-VC pipeline (Whisper + CAMPPlus + RMVPE + BigVGAN)
- Simplify training.py: just save reference audio (no neural network training)
- Simplify setup.py: clone Seed-VC repo instead of Applio
- Update storage.py: handle audio reference files
- Simplify app.py UI: remove epochs slider, 3-30 sec upload
- Update requirements.txt: remove RVC deps, add Seed-VC deps
- Update README.md: reflect new architecture
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md +22 -100
- app.py +112 -172
- pipeline/inference.py +286 -269
- pipeline/setup.py +31 -114
- pipeline/storage.py +73 -70
- pipeline/training.py +70 -489
- requirements.txt +10 -18
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: Clone Vocal
|
| 3 |
emoji: "\U0001F3A4"
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
|
@@ -10,126 +10,48 @@ app_file: app.py
|
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
tags:
|
| 13 |
-
-
|
| 14 |
- voice-cloning
|
| 15 |
- demucs
|
| 16 |
- audio
|
| 17 |
- music
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Clone Vocal
|
| 21 |
|
| 22 |
-
Outil web de **clonage vocal** basé sur **
|
| 23 |
|
| 24 |
## Fonctionnalités
|
| 25 |
|
| 26 |
-
1. **
|
| 27 |
2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
|
| 28 |
-
3. **Conversion vocale** : Remplacement de la voix originale par
|
| 29 |
-
4. **Mixage final** : Remixage automatique
|
| 30 |
5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
|
| 31 |
|
| 32 |
## Comment utiliser
|
| 33 |
|
| 34 |
-
### Étape 1 :
|
| 35 |
-
1.
|
| 36 |
-
2. Uploadez un
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
3. Donnez un nom à votre modèle (ex: `ma_voix`)
|
| 40 |
-
4. Choisissez le nombre d'époques (20 par défaut, suffisant pour un bon résultat)
|
| 41 |
-
5. Cliquez sur **"Lancer l'entraînement"**
|
| 42 |
-
6. Attendez la fin de l'entraînement (~3-5 minutes)
|
| 43 |
|
| 44 |
### Étape 2 : Convertir un morceau
|
| 45 |
-
1.
|
| 46 |
-
2. Sélectionnez votre
|
| 47 |
-
3. Uploadez le morceau
|
| 48 |
-
4. Ajustez les paramètres si besoin
|
| 49 |
-
|
| 50 |
-
- **Taux d'index** : fidélité au timbre (0.75 par défaut)
|
| 51 |
-
- **Volumes** : équilibre voix/instruments
|
| 52 |
-
5. Cliquez sur **"Convertir et mixer"**
|
| 53 |
-
6. Écoutez l'aperçu et téléchargez le résultat
|
| 54 |
-
|
| 55 |
-
### Étape 3 : Gérer vos modèles
|
| 56 |
-
- L'onglet **"Mes modèles"** permet de voir, supprimer, ou importer des modèles externes
|
| 57 |
-
|
| 58 |
-
## Déploiement
|
| 59 |
-
|
| 60 |
-
### Prérequis
|
| 61 |
-
- Un compte [HuggingFace](https://huggingface.co)
|
| 62 |
-
- Un compte [GitHub](https://github.com)
|
| 63 |
-
|
| 64 |
-
### Étapes de déploiement
|
| 65 |
-
|
| 66 |
-
#### 1. Créer un dataset repo sur HuggingFace (pour stocker les modèles)
|
| 67 |
-
1. Allez sur https://huggingface.co/new-dataset
|
| 68 |
-
2. Nom : `rvc-voice-models`
|
| 69 |
-
3. Visibilité : **Privé**
|
| 70 |
-
4. Cliquez **Create**
|
| 71 |
-
|
| 72 |
-
#### 2. Créer un token HuggingFace
|
| 73 |
-
1. Allez sur https://huggingface.co/settings/tokens
|
| 74 |
-
2. Cliquez **Create new token**
|
| 75 |
-
3. Nom : `rvc-voice-cloner`
|
| 76 |
-
4. Permissions : **Write**
|
| 77 |
-
5. Copiez le token
|
| 78 |
-
|
| 79 |
-
#### 3. Créer le repo GitHub
|
| 80 |
-
```bash
|
| 81 |
-
cd rvc-voice-cloner
|
| 82 |
-
git init
|
| 83 |
-
git add .
|
| 84 |
-
git commit -m "Initial commit: Clone Vocal RVC"
|
| 85 |
-
git remote add origin https://github.com/diamesene02/rvc-voice-cloner.git
|
| 86 |
-
git push -u origin main
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
#### 4. Créer le HuggingFace Space
|
| 90 |
-
1. Allez sur https://huggingface.co/new-space
|
| 91 |
-
2. Nom : `clone-vocal-rvc`
|
| 92 |
-
3. SDK : **Gradio**
|
| 93 |
-
4. Hardware : **ZeroGPU** (gratuit pour les espaces publics)
|
| 94 |
-
5. Cliquez **Create Space**
|
| 95 |
-
|
| 96 |
-
#### 5. Configurer les secrets du Space
|
| 97 |
-
Dans les **Settings** du Space :
|
| 98 |
-
- Ajoutez `HF_TOKEN` : votre token HuggingFace (étape 2)
|
| 99 |
-
- Ajoutez `HF_MODELS_REPO` : `votre-username/rvc-voice-models`
|
| 100 |
-
|
| 101 |
-
#### 6. Déployer le code
|
| 102 |
-
```bash
|
| 103 |
-
# Ajouter le remote HuggingFace
|
| 104 |
-
git remote add hf https://huggingface.co/spaces/votre-username/clone-vocal-rvc
|
| 105 |
-
|
| 106 |
-
# Pousser le code
|
| 107 |
-
git push hf main
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
#### 7. Accéder à l'outil
|
| 111 |
-
Votre outil est accessible à :
|
| 112 |
-
```
|
| 113 |
-
https://huggingface.co/spaces/votre-username/clone-vocal-rvc
|
| 114 |
-
```
|
| 115 |
|
| 116 |
## Architecture technique
|
| 117 |
|
| 118 |
-
- **
|
| 119 |
-
- **Demucs** (Meta AI) : Séparation des sources audio
|
| 120 |
- **Gradio** : Interface web
|
| 121 |
-
- **ZeroGPU** : GPU
|
| 122 |
-
- **Applio** : Backend RVC (cloné automatiquement au démarrage)
|
| 123 |
-
|
| 124 |
-
## Limitations
|
| 125 |
-
|
| 126 |
-
- **Quota GPU** : ~5 minutes de GPU gratuit par jour (ZeroGPU)
|
| 127 |
-
- L'entraînement consomme ~3-4 min
|
| 128 |
-
- La conversion consomme ~1-2 min
|
| 129 |
-
- Pour plus de GPU : upgrade vers HuggingFace PRO ($9/mois, 25 min/jour)
|
| 130 |
-
- Les modèles sont stockés sur HuggingFace Hub (persistance entre redémarrages)
|
| 131 |
-
- Premier lancement plus lent (téléchargement des modèles pré-entraînés)
|
| 132 |
|
| 133 |
## Licence
|
| 134 |
|
| 135 |
-
MIT
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Clone Vocal
|
| 3 |
emoji: "\U0001F3A4"
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
|
|
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
tags:
|
| 13 |
+
- seed-vc
|
| 14 |
- voice-cloning
|
| 15 |
- demucs
|
| 16 |
- audio
|
| 17 |
- music
|
| 18 |
+
- zero-shot
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# Clone Vocal
|
| 22 |
|
| 23 |
+
Outil web de **clonage vocal zero-shot** basé sur **Seed-VC** (Diffusion Transformer), accessible depuis votre navigateur.
|
| 24 |
|
| 25 |
## Fonctionnalités
|
| 26 |
|
| 27 |
+
1. **Référence vocale** : Uploadez un court extrait de votre voix (3-30 sec) — pas d'entraînement nécessaire
|
| 28 |
2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
|
| 29 |
+
3. **Conversion vocale** : Remplacement de la voix originale par la vôtre (Seed-VC zero-shot)
|
| 30 |
+
4. **Mixage final** : Remixage automatique voix convertie + instruments originaux
|
| 31 |
5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
|
| 32 |
|
| 33 |
## Comment utiliser
|
| 34 |
|
| 35 |
+
### Étape 1 : Enregistrer votre référence vocale
|
| 36 |
+
1. Onglet **"Ma voix"**
|
| 37 |
+
2. Uploadez un extrait de votre voix (WAV ou MP3, 3 à 30 secondes)
|
| 38 |
+
3. Donnez un nom (ex: `ma_voix`)
|
| 39 |
+
4. Cliquez **"Sauvegarder"**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
### Étape 2 : Convertir un morceau
|
| 42 |
+
1. Onglet **"Convertir un morceau"**
|
| 43 |
+
2. Sélectionnez votre profil vocal
|
| 44 |
+
3. Uploadez le morceau à convertir
|
| 45 |
+
4. Ajustez les paramètres si besoin (transposition, qualité, volumes)
|
| 46 |
+
5. Cliquez **"Convertir et mixer"**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## Architecture technique
|
| 49 |
|
| 50 |
+
- **Seed-VC** : Voice conversion zero-shot par diffusion transformer + in-context learning
|
| 51 |
+
- **Demucs** (Meta AI) : Séparation des sources audio
|
| 52 |
- **Gradio** : Interface web
|
| 53 |
+
- **ZeroGPU** : GPU sur HuggingFace Spaces
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
## Licence
|
| 56 |
|
| 57 |
+
MIT — Basé sur [Seed-VC](https://github.com/Plachtaa/seed-vc) (GPL v3) et [Demucs](https://github.com/facebookresearch/demucs) (MIT)
|
app.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
-
Clone Vocal
|
| 3 |
-
Interface Gradio en
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
@@ -11,8 +11,7 @@ import shutil
|
|
| 11 |
|
| 12 |
import gradio as gr
|
| 13 |
|
| 14 |
-
#
|
| 15 |
-
# Bug: gradio_client/utils.py get_type() crashes when schema is a bool instead of dict
|
| 16 |
try:
|
| 17 |
import gradio_client.utils as _gc_utils
|
| 18 |
|
|
@@ -43,46 +42,38 @@ except Exception:
|
|
| 43 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
| 44 |
logger = logging.getLogger(__name__)
|
| 45 |
|
| 46 |
-
#
|
| 47 |
-
|
| 48 |
logger.info("Initialisation de l'application...")
|
| 49 |
|
| 50 |
-
from pipeline.setup import
|
| 51 |
-
from pipeline.storage import init_storage, list_models, download_model, delete_model
|
| 52 |
|
| 53 |
-
# Setup Applio (clone + download pretrained models)
|
| 54 |
try:
|
| 55 |
-
|
| 56 |
except Exception as e:
|
| 57 |
-
logger.error(
|
| 58 |
|
| 59 |
# Initialize model storage
|
| 60 |
HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
|
| 61 |
if HF_MODELS_REPO:
|
| 62 |
init_storage(HF_MODELS_REPO)
|
| 63 |
-
logger.info(
|
| 64 |
-
else:
|
| 65 |
-
logger.warning(
|
| 66 |
-
"Variable HF_MODELS_REPO non définie. Les modèles seront stockés localement uniquement. "
|
| 67 |
-
"Pour la persistance, ajoutez HF_MODELS_REPO=votre-user/rvc-voice-models dans les secrets du Space."
|
| 68 |
-
)
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
from pipeline.training import full_training_pipeline, extract_features
|
| 73 |
from pipeline.separation import separate_audio
|
| 74 |
from pipeline.inference import convert_voice
|
| 75 |
|
| 76 |
|
| 77 |
-
#
|
| 78 |
|
| 79 |
-
def train_voice_model(audio_file, model_name,
|
| 80 |
-
"""Handler
|
| 81 |
if audio_file is None:
|
| 82 |
return "Erreur : Veuillez uploader un fichier audio.", None
|
| 83 |
|
| 84 |
if not model_name or not model_name.strip():
|
| 85 |
-
return "Erreur : Veuillez entrer un nom pour le
|
| 86 |
|
| 87 |
model_name = model_name.strip().replace(" ", "_")
|
| 88 |
|
|
@@ -90,39 +81,31 @@ def train_voice_model(audio_file, model_name, epochs, progress=gr.Progress()):
|
|
| 90 |
progress(value, desc=desc)
|
| 91 |
|
| 92 |
try:
|
| 93 |
-
progress(0.0, desc="
|
| 94 |
-
|
| 95 |
-
pth_path, index_path = full_training_pipeline(
|
| 96 |
audio_path=audio_file,
|
| 97 |
model_name=model_name,
|
| 98 |
-
epochs=int(epochs),
|
| 99 |
-
sample_rate=40000,
|
| 100 |
-
batch_size=8,
|
| 101 |
progress_callback=progress_callback,
|
| 102 |
)
|
| 103 |
|
| 104 |
-
|
| 105 |
-
result_msg += f"Fichier : {os.path.basename(pth_path)}\n"
|
| 106 |
-
if index_path:
|
| 107 |
-
result_msg += f"Index : {os.path.basename(index_path)}"
|
| 108 |
-
|
| 109 |
-
return result_msg, pth_path
|
| 110 |
|
| 111 |
except Exception as e:
|
| 112 |
import traceback
|
| 113 |
tb = traceback.format_exc()
|
| 114 |
-
logger.error(
|
| 115 |
-
|
| 116 |
-
|
|
|
|
| 117 |
|
| 118 |
|
| 119 |
-
#
|
| 120 |
|
| 121 |
def get_model_choices():
|
| 122 |
"""Get list of trained model names for dropdown."""
|
| 123 |
models = list_models()
|
| 124 |
if not models:
|
| 125 |
-
return ["(aucun
|
| 126 |
return models
|
| 127 |
|
| 128 |
|
|
@@ -130,7 +113,7 @@ def convert_song(
|
|
| 130 |
model_choice,
|
| 131 |
song_file,
|
| 132 |
pitch,
|
| 133 |
-
|
| 134 |
vocal_volume,
|
| 135 |
instrumental_volume,
|
| 136 |
progress=gr.Progress(),
|
|
@@ -139,35 +122,39 @@ def convert_song(
|
|
| 139 |
if song_file is None:
|
| 140 |
return "Erreur : Veuillez uploader un fichier audio.", None, None, None
|
| 141 |
|
| 142 |
-
if model_choice == "(aucun
|
| 143 |
-
return "Erreur : Veuillez d'abord
|
| 144 |
|
| 145 |
from pipeline.mixing import mix_audio
|
| 146 |
|
| 147 |
try:
|
| 148 |
-
# Step 1: Download model
|
| 149 |
-
progress(0.05, desc="Chargement du
|
| 150 |
-
pth_path,
|
| 151 |
if not pth_path:
|
| 152 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
# Step 2: Separate vocals from instruments
|
| 155 |
-
progress(0.10, desc="
|
| 156 |
vocals_path, instruments_path = separate_audio(song_file)
|
| 157 |
|
| 158 |
-
progress(0.
|
| 159 |
|
| 160 |
-
# Step 3: Convert vocals with
|
| 161 |
converted_path = convert_voice(
|
| 162 |
audio_path=vocals_path,
|
| 163 |
-
|
| 164 |
-
index_path=index_path,
|
| 165 |
pitch=int(pitch),
|
| 166 |
-
|
| 167 |
-
index_rate=
|
| 168 |
)
|
| 169 |
|
| 170 |
-
progress(0.
|
| 171 |
|
| 172 |
# Step 4: Mix converted vocals with instruments
|
| 173 |
final_path = mix_audio(
|
|
@@ -177,119 +164,94 @@ def convert_song(
|
|
| 177 |
instrumental_volume=float(instrumental_volume),
|
| 178 |
)
|
| 179 |
|
| 180 |
-
progress(1.0, desc="
|
| 181 |
|
| 182 |
return (
|
| 183 |
-
"Conversion
|
| 184 |
-
vocals_path,
|
| 185 |
-
converted_path,
|
| 186 |
-
final_path,
|
| 187 |
)
|
| 188 |
|
| 189 |
except Exception as e:
|
| 190 |
import traceback
|
| 191 |
tb = traceback.format_exc()
|
| 192 |
-
logger.error(
|
| 193 |
-
return
|
|
|
|
|
|
|
| 194 |
|
| 195 |
|
| 196 |
-
#
|
| 197 |
|
| 198 |
def refresh_models():
|
| 199 |
"""Refresh the model list as HTML."""
|
| 200 |
models = list_models()
|
| 201 |
if not models:
|
| 202 |
-
return "<p style='color:gray;'>Aucun
|
| 203 |
-
rows = "".join(
|
| 204 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
|
| 207 |
def delete_selected_model(model_name_to_delete):
|
| 208 |
"""Delete a model."""
|
| 209 |
-
if not model_name_to_delete or model_name_to_delete == "(aucun
|
| 210 |
-
return "Veuillez
|
| 211 |
try:
|
| 212 |
delete_model(model_name_to_delete)
|
| 213 |
-
return
|
| 214 |
except Exception as e:
|
| 215 |
-
return
|
| 216 |
|
| 217 |
|
| 218 |
-
|
| 219 |
-
"""Upload an external .pth model."""
|
| 220 |
-
if pth_file is None:
|
| 221 |
-
return "Veuillez sélectionner un fichier .pth", refresh_models()
|
| 222 |
-
|
| 223 |
-
if not model_name or not model_name.strip():
|
| 224 |
-
return "Veuillez entrer un nom pour le modèle.", refresh_models()
|
| 225 |
-
|
| 226 |
-
model_name = model_name.strip().replace(" ", "_")
|
| 227 |
-
|
| 228 |
-
from pipeline.storage import LOCAL_MODELS_DIR, upload_model
|
| 229 |
-
|
| 230 |
-
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 231 |
-
os.makedirs(local_dir, exist_ok=True)
|
| 232 |
-
|
| 233 |
-
local_pth = os.path.join(local_dir, f"{model_name}.pth")
|
| 234 |
-
shutil.copy2(pth_file, local_pth)
|
| 235 |
-
|
| 236 |
-
try:
|
| 237 |
-
upload_model(model_name, local_pth)
|
| 238 |
-
except Exception:
|
| 239 |
-
pass # Non-critical
|
| 240 |
-
|
| 241 |
-
return f"Modèle '{model_name}' importé avec succès.", refresh_models()
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
# ── Build Gradio UI ──────────────────────────────────────────────────────────
|
| 245 |
|
| 246 |
DESCRIPTION = """
|
| 247 |
-
# Clone Vocal
|
| 248 |
|
| 249 |
-
Outil de clonage vocal
|
| 250 |
|
| 251 |
**Comment utiliser :**
|
| 252 |
-
1. **Onglet "
|
| 253 |
-
2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la
|
| 254 |
-
3. **Onglet "
|
| 255 |
|
| 256 |
-
> **
|
| 257 |
-
> L'entraînement consomme ~3-4 min de GPU, la conversion ~1-2 min.
|
| 258 |
"""
|
| 259 |
|
| 260 |
with gr.Blocks(
|
| 261 |
-
title="Clone Vocal
|
| 262 |
theme=gr.themes.Soft(),
|
| 263 |
) as app:
|
| 264 |
|
| 265 |
gr.Markdown(DESCRIPTION)
|
| 266 |
|
| 267 |
with gr.Tabs():
|
| 268 |
-
#
|
| 269 |
-
with gr.TabItem("
|
| 270 |
-
gr.Markdown("###
|
| 271 |
|
| 272 |
with gr.Row():
|
| 273 |
with gr.Column(scale=2):
|
| 274 |
train_audio = gr.Audio(
|
| 275 |
-
label="
|
| 276 |
type="filepath",
|
| 277 |
sources=["upload"],
|
| 278 |
)
|
| 279 |
train_model_name = gr.Textbox(
|
| 280 |
-
label="Nom du
|
| 281 |
placeholder="ex: ma_voix",
|
| 282 |
max_lines=1,
|
| 283 |
)
|
| 284 |
-
train_epochs = gr.Slider(
|
| 285 |
-
minimum=5,
|
| 286 |
-
maximum=50,
|
| 287 |
-
value=20,
|
| 288 |
-
step=5,
|
| 289 |
-
label="Nombre d'époques (plus = meilleure qualité, ~3-5 min avec GPU)",
|
| 290 |
-
)
|
| 291 |
train_btn = gr.Button(
|
| 292 |
-
"
|
| 293 |
variant="primary",
|
| 294 |
size="lg",
|
| 295 |
)
|
|
@@ -298,59 +260,59 @@ with gr.Blocks(
|
|
| 298 |
train_status = gr.Textbox(
|
| 299 |
label="Statut",
|
| 300 |
interactive=False,
|
| 301 |
-
lines=
|
| 302 |
)
|
| 303 |
train_download = gr.File(
|
| 304 |
-
label="
|
| 305 |
interactive=False,
|
| 306 |
)
|
| 307 |
|
| 308 |
gr.Markdown(
|
| 309 |
"**Conseils :**\n"
|
| 310 |
"- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
|
| 311 |
-
"- Parlez ou chantez naturellement pendant 3
|
| 312 |
-
"-
|
| 313 |
-
"-
|
| 314 |
)
|
| 315 |
|
| 316 |
train_btn.click(
|
| 317 |
fn=train_voice_model,
|
| 318 |
-
inputs=[train_audio, train_model_name
|
| 319 |
outputs=[train_status, train_download],
|
| 320 |
)
|
| 321 |
|
| 322 |
-
#
|
| 323 |
with gr.TabItem("Convertir un morceau"):
|
| 324 |
-
gr.Markdown("### Remplacer la voix d'un morceau par la
|
| 325 |
|
| 326 |
with gr.Row():
|
| 327 |
with gr.Column(scale=2):
|
| 328 |
convert_model = gr.Dropdown(
|
| 329 |
choices=get_model_choices(),
|
| 330 |
-
label="
|
| 331 |
interactive=True,
|
| 332 |
)
|
| 333 |
-
refresh_btn = gr.Button("
|
| 334 |
convert_audio = gr.Audio(
|
| 335 |
-
label="Morceau
|
| 336 |
type="filepath",
|
| 337 |
sources=["upload"],
|
| 338 |
)
|
| 339 |
|
| 340 |
-
with gr.Accordion("
|
| 341 |
convert_pitch = gr.Slider(
|
| 342 |
-
minimum=-
|
| 343 |
-
maximum=
|
| 344 |
value=0,
|
| 345 |
step=1,
|
| 346 |
-
label="Transposition (demi-tons)
|
| 347 |
)
|
| 348 |
-
|
| 349 |
-
minimum=
|
| 350 |
-
maximum=
|
| 351 |
-
value=
|
| 352 |
-
step=
|
| 353 |
-
label="
|
| 354 |
)
|
| 355 |
convert_vocal_vol = gr.Slider(
|
| 356 |
minimum=0.0,
|
|
@@ -379,16 +341,16 @@ with gr.Blocks(
|
|
| 379 |
interactive=False,
|
| 380 |
lines=3,
|
| 381 |
)
|
| 382 |
-
gr.Markdown("**
|
| 383 |
preview_vocals = gr.Audio(
|
| 384 |
-
label="Voix originale (
|
| 385 |
interactive=False,
|
| 386 |
)
|
| 387 |
preview_converted = gr.Audio(
|
| 388 |
label="Voix convertie",
|
| 389 |
interactive=False,
|
| 390 |
)
|
| 391 |
-
gr.Markdown("**
|
| 392 |
final_output = gr.Audio(
|
| 393 |
label="Morceau final (voix + instruments)",
|
| 394 |
interactive=False,
|
|
@@ -405,49 +367,33 @@ with gr.Blocks(
|
|
| 405 |
convert_model,
|
| 406 |
convert_audio,
|
| 407 |
convert_pitch,
|
| 408 |
-
|
| 409 |
convert_vocal_vol,
|
| 410 |
convert_inst_vol,
|
| 411 |
],
|
| 412 |
outputs=[convert_status, preview_vocals, preview_converted, final_output],
|
| 413 |
)
|
| 414 |
|
| 415 |
-
#
|
| 416 |
-
with gr.TabItem("Mes
|
| 417 |
-
gr.Markdown("###
|
| 418 |
|
| 419 |
models_table = gr.HTML(
|
| 420 |
value=refresh_models(),
|
| 421 |
-
label="
|
| 422 |
)
|
| 423 |
|
| 424 |
with gr.Row():
|
| 425 |
-
models_refresh_btn = gr.Button("
|
| 426 |
models_delete_name = gr.Dropdown(
|
| 427 |
choices=get_model_choices(),
|
| 428 |
-
label="
|
| 429 |
interactive=True,
|
| 430 |
)
|
| 431 |
models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
|
| 432 |
|
| 433 |
models_delete_status = gr.Textbox(label="Statut", interactive=False)
|
| 434 |
|
| 435 |
-
gr.Markdown("---")
|
| 436 |
-
gr.Markdown("### Importer un modèle externe")
|
| 437 |
-
|
| 438 |
-
with gr.Row():
|
| 439 |
-
upload_pth = gr.File(
|
| 440 |
-
label="Fichier .pth du modèle",
|
| 441 |
-
file_types=[".pth"],
|
| 442 |
-
)
|
| 443 |
-
upload_name = gr.Textbox(
|
| 444 |
-
label="Nom du modèle",
|
| 445 |
-
placeholder="ex: voix_importee",
|
| 446 |
-
)
|
| 447 |
-
upload_btn = gr.Button("Importer", size="sm")
|
| 448 |
-
|
| 449 |
-
upload_status = gr.Textbox(label="Statut", interactive=False)
|
| 450 |
-
|
| 451 |
models_refresh_btn.click(
|
| 452 |
fn=refresh_models,
|
| 453 |
outputs=[models_table],
|
|
@@ -463,12 +409,6 @@ with gr.Blocks(
|
|
| 463 |
outputs=[models_delete_status, models_table],
|
| 464 |
)
|
| 465 |
|
| 466 |
-
upload_btn.click(
|
| 467 |
-
fn=upload_external_model,
|
| 468 |
-
inputs=[upload_pth, upload_name],
|
| 469 |
-
outputs=[upload_status, models_table],
|
| 470 |
-
)
|
| 471 |
-
|
| 472 |
|
| 473 |
if __name__ == "__main__":
|
| 474 |
app.launch(server_name="0.0.0.0")
|
|
|
|
| 1 |
"""
|
| 2 |
+
Clone Vocal - Outil web de clonage vocal base sur Seed-VC (zero-shot).
|
| 3 |
+
Interface Gradio en francais, deploye sur HuggingFace Spaces avec ZeroGPU.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
|
|
| 11 |
|
| 12 |
import gradio as gr
|
| 13 |
|
| 14 |
+
# Monkey-patch gradio_client to fix "argument of type 'bool' is not iterable"
|
|
|
|
| 15 |
try:
|
| 16 |
import gradio_client.utils as _gc_utils
|
| 17 |
|
|
|
|
| 42 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
| 43 |
logger = logging.getLogger(__name__)
|
| 44 |
|
| 45 |
+
# Startup: clone Seed-VC
|
|
|
|
| 46 |
logger.info("Initialisation de l'application...")
|
| 47 |
|
| 48 |
+
from pipeline.setup import setup_seed_vc
|
| 49 |
+
from pipeline.storage import init_storage, list_models, download_model, delete_model, get_reference_path
|
| 50 |
|
|
|
|
| 51 |
try:
|
| 52 |
+
setup_seed_vc()
|
| 53 |
except Exception as e:
|
| 54 |
+
logger.error("Erreur lors du setup: {}".format(e))
|
| 55 |
|
| 56 |
# Initialize model storage
|
| 57 |
HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
|
| 58 |
if HF_MODELS_REPO:
|
| 59 |
init_storage(HF_MODELS_REPO)
|
| 60 |
+
logger.info("Stockage HuggingFace configure: {}".format(HF_MODELS_REPO))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
# Import GPU-decorated functions for ZeroGPU detection
|
| 63 |
+
from pipeline.training import save_voice_reference, _gpu_warmup
|
|
|
|
| 64 |
from pipeline.separation import separate_audio
|
| 65 |
from pipeline.inference import convert_voice
|
| 66 |
|
| 67 |
|
| 68 |
+
# -- Training Tab --
|
| 69 |
|
| 70 |
+
def train_voice_model(audio_file, model_name, progress=gr.Progress()):
|
| 71 |
+
"""Handler: save voice reference."""
|
| 72 |
if audio_file is None:
|
| 73 |
return "Erreur : Veuillez uploader un fichier audio.", None
|
| 74 |
|
| 75 |
if not model_name or not model_name.strip():
|
| 76 |
+
return "Erreur : Veuillez entrer un nom pour le modele.", None
|
| 77 |
|
| 78 |
model_name = model_name.strip().replace(" ", "_")
|
| 79 |
|
|
|
|
| 81 |
progress(value, desc=desc)
|
| 82 |
|
| 83 |
try:
|
| 84 |
+
progress(0.0, desc="Demarrage...")
|
| 85 |
+
pth_path, ref_path = save_voice_reference(
|
|
|
|
| 86 |
audio_path=audio_file,
|
| 87 |
model_name=model_name,
|
|
|
|
|
|
|
|
|
|
| 88 |
progress_callback=progress_callback,
|
| 89 |
)
|
| 90 |
|
| 91 |
+
return "Reference vocale '{}' sauvegardee avec succes !".format(model_name), ref_path
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
except Exception as e:
|
| 94 |
import traceback
|
| 95 |
tb = traceback.format_exc()
|
| 96 |
+
logger.error("Erreur training: {}".format(tb))
|
| 97 |
+
return "Erreur : {}: {}\n\nDetails:\n{}".format(
|
| 98 |
+
type(e).__name__, str(e), tb[-500:]
|
| 99 |
+
), None
|
| 100 |
|
| 101 |
|
| 102 |
+
# -- Conversion Tab --
|
| 103 |
|
| 104 |
def get_model_choices():
|
| 105 |
"""Get list of trained model names for dropdown."""
|
| 106 |
models = list_models()
|
| 107 |
if not models:
|
| 108 |
+
return ["(aucun modele)"]
|
| 109 |
return models
|
| 110 |
|
| 111 |
|
|
|
|
| 113 |
model_choice,
|
| 114 |
song_file,
|
| 115 |
pitch,
|
| 116 |
+
diffusion_steps,
|
| 117 |
vocal_volume,
|
| 118 |
instrumental_volume,
|
| 119 |
progress=gr.Progress(),
|
|
|
|
| 122 |
if song_file is None:
|
| 123 |
return "Erreur : Veuillez uploader un fichier audio.", None, None, None
|
| 124 |
|
| 125 |
+
if model_choice == "(aucun modele)" or not model_choice:
|
| 126 |
+
return "Erreur : Veuillez d'abord enregistrer une reference vocale.", None, None, None
|
| 127 |
|
| 128 |
from pipeline.mixing import mix_audio
|
| 129 |
|
| 130 |
try:
|
| 131 |
+
# Step 1: Download model / find reference audio
|
| 132 |
+
progress(0.05, desc="Chargement du modele...")
|
| 133 |
+
pth_path, ref_or_index = download_model(model_choice)
|
| 134 |
if not pth_path:
|
| 135 |
+
return "Erreur : Modele '{}' introuvable.".format(model_choice), None, None, None
|
| 136 |
+
|
| 137 |
+
# Find the reference audio path
|
| 138 |
+
reference_path = get_reference_path(model_choice)
|
| 139 |
+
if not reference_path:
|
| 140 |
+
return "Erreur : Audio de reference introuvable pour '{}'.".format(model_choice), None, None, None
|
| 141 |
|
| 142 |
# Step 2: Separate vocals from instruments
|
| 143 |
+
progress(0.10, desc="Separation des pistes (Demucs)...")
|
| 144 |
vocals_path, instruments_path = separate_audio(song_file)
|
| 145 |
|
| 146 |
+
progress(0.40, desc="Conversion vocale (Seed-VC)...")
|
| 147 |
|
| 148 |
+
# Step 3: Convert vocals with Seed-VC
|
| 149 |
converted_path = convert_voice(
|
| 150 |
audio_path=vocals_path,
|
| 151 |
+
reference_path=reference_path,
|
|
|
|
| 152 |
pitch=int(pitch),
|
| 153 |
+
diffusion_steps=int(diffusion_steps),
|
| 154 |
+
index_rate=0.7,
|
| 155 |
)
|
| 156 |
|
| 157 |
+
progress(0.85, desc="Mixage final...")
|
| 158 |
|
| 159 |
# Step 4: Mix converted vocals with instruments
|
| 160 |
final_path = mix_audio(
|
|
|
|
| 164 |
instrumental_volume=float(instrumental_volume),
|
| 165 |
)
|
| 166 |
|
| 167 |
+
progress(1.0, desc="Termine !")
|
| 168 |
|
| 169 |
return (
|
| 170 |
+
"Conversion terminee avec succes !",
|
| 171 |
+
vocals_path,
|
| 172 |
+
converted_path,
|
| 173 |
+
final_path,
|
| 174 |
)
|
| 175 |
|
| 176 |
except Exception as e:
|
| 177 |
import traceback
|
| 178 |
tb = traceback.format_exc()
|
| 179 |
+
logger.error("Erreur conversion: {}".format(tb))
|
| 180 |
+
return "Erreur : {}: {}\n\nDetails:\n{}".format(
|
| 181 |
+
type(e).__name__, str(e), tb[-500:]
|
| 182 |
+
), None, None, None
|
| 183 |
|
| 184 |
|
| 185 |
+
# -- Models Tab --
|
| 186 |
|
| 187 |
def refresh_models():
|
| 188 |
"""Refresh the model list as HTML."""
|
| 189 |
models = list_models()
|
| 190 |
if not models:
|
| 191 |
+
return "<p style='color:gray;'>Aucun modele enregistre</p>"
|
| 192 |
+
rows = "".join(
|
| 193 |
+
"<tr><td>{}</td><td>Disponible</td></tr>".format(m) for m in models
|
| 194 |
+
)
|
| 195 |
+
return (
|
| 196 |
+
"<table style='width:100%;border-collapse:collapse;'>"
|
| 197 |
+
"<tr><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Nom</th>"
|
| 198 |
+
"<th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Statut</th></tr>"
|
| 199 |
+
"{}</table>".format(rows)
|
| 200 |
+
)
|
| 201 |
|
| 202 |
|
| 203 |
def delete_selected_model(model_name_to_delete):
|
| 204 |
"""Delete a model."""
|
| 205 |
+
if not model_name_to_delete or model_name_to_delete == "(aucun modele)":
|
| 206 |
+
return "Veuillez selectionner un modele a supprimer.", refresh_models()
|
| 207 |
try:
|
| 208 |
delete_model(model_name_to_delete)
|
| 209 |
+
return "Modele '{}' supprime.".format(model_name_to_delete), refresh_models()
|
| 210 |
except Exception as e:
|
| 211 |
+
return "Erreur : {}".format(e), refresh_models()
|
| 212 |
|
| 213 |
|
| 214 |
+
# -- Build Gradio UI --
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
DESCRIPTION = """
|
| 217 |
+
# Clone Vocal
|
| 218 |
|
| 219 |
+
Outil de clonage vocal **zero-shot** base sur **Seed-VC** (Diffusion Transformer).
|
| 220 |
|
| 221 |
**Comment utiliser :**
|
| 222 |
+
1. **Onglet "Ma voix"** : Uploadez un court extrait de votre voix (3-30 sec) pour creer votre profil vocal
|
| 223 |
+
2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la votre
|
| 224 |
+
3. **Onglet "Modeles"** : Gerez vos profils vocaux
|
| 225 |
|
| 226 |
+
> **Zero-shot** : pas d'entrainement necessaire ! Juste 3-30 secondes de votre voix suffisent.
|
|
|
|
| 227 |
"""
|
| 228 |
|
| 229 |
with gr.Blocks(
|
| 230 |
+
title="Clone Vocal",
|
| 231 |
theme=gr.themes.Soft(),
|
| 232 |
) as app:
|
| 233 |
|
| 234 |
gr.Markdown(DESCRIPTION)
|
| 235 |
|
| 236 |
with gr.Tabs():
|
| 237 |
+
# Tab 1: Voice Reference
|
| 238 |
+
with gr.TabItem("Ma voix"):
|
| 239 |
+
gr.Markdown("### Enregistrer votre reference vocale")
|
| 240 |
|
| 241 |
with gr.Row():
|
| 242 |
with gr.Column(scale=2):
|
| 243 |
train_audio = gr.Audio(
|
| 244 |
+
label="Extrait de votre voix (WAV ou MP3, 3-30 secondes)",
|
| 245 |
type="filepath",
|
| 246 |
sources=["upload"],
|
| 247 |
)
|
| 248 |
train_model_name = gr.Textbox(
|
| 249 |
+
label="Nom du profil",
|
| 250 |
placeholder="ex: ma_voix",
|
| 251 |
max_lines=1,
|
| 252 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
train_btn = gr.Button(
|
| 254 |
+
"Sauvegarder",
|
| 255 |
variant="primary",
|
| 256 |
size="lg",
|
| 257 |
)
|
|
|
|
| 260 |
train_status = gr.Textbox(
|
| 261 |
label="Statut",
|
| 262 |
interactive=False,
|
| 263 |
+
lines=3,
|
| 264 |
)
|
| 265 |
train_download = gr.File(
|
| 266 |
+
label="Fichier de reference",
|
| 267 |
interactive=False,
|
| 268 |
)
|
| 269 |
|
| 270 |
gr.Markdown(
|
| 271 |
"**Conseils :**\n"
|
| 272 |
"- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
|
| 273 |
+
"- Parlez ou chantez naturellement pendant 3 a 30 secondes\n"
|
| 274 |
+
"- Plus l'extrait est long et varie, meilleur sera le resultat\n"
|
| 275 |
+
"- Format WAV ou MP3 accepte"
|
| 276 |
)
|
| 277 |
|
| 278 |
train_btn.click(
|
| 279 |
fn=train_voice_model,
|
| 280 |
+
inputs=[train_audio, train_model_name],
|
| 281 |
outputs=[train_status, train_download],
|
| 282 |
)
|
| 283 |
|
| 284 |
+
# Tab 2: Conversion
|
| 285 |
with gr.TabItem("Convertir un morceau"):
|
| 286 |
+
gr.Markdown("### Remplacer la voix d'un morceau par la votre")
|
| 287 |
|
| 288 |
with gr.Row():
|
| 289 |
with gr.Column(scale=2):
|
| 290 |
convert_model = gr.Dropdown(
|
| 291 |
choices=get_model_choices(),
|
| 292 |
+
label="Profil vocal",
|
| 293 |
interactive=True,
|
| 294 |
)
|
| 295 |
+
refresh_btn = gr.Button("Rafraichir la liste", size="sm")
|
| 296 |
convert_audio = gr.Audio(
|
| 297 |
+
label="Morceau a convertir (WAV ou MP3)",
|
| 298 |
type="filepath",
|
| 299 |
sources=["upload"],
|
| 300 |
)
|
| 301 |
|
| 302 |
+
with gr.Accordion("Parametres avances", open=False):
|
| 303 |
convert_pitch = gr.Slider(
|
| 304 |
+
minimum=-24,
|
| 305 |
+
maximum=24,
|
| 306 |
value=0,
|
| 307 |
step=1,
|
| 308 |
+
label="Transposition (demi-tons)",
|
| 309 |
)
|
| 310 |
+
convert_diffusion = gr.Slider(
|
| 311 |
+
minimum=5,
|
| 312 |
+
maximum=50,
|
| 313 |
+
value=25,
|
| 314 |
+
step=5,
|
| 315 |
+
label="Qualite (plus haut = meilleure qualite, plus lent)",
|
| 316 |
)
|
| 317 |
convert_vocal_vol = gr.Slider(
|
| 318 |
minimum=0.0,
|
|
|
|
| 341 |
interactive=False,
|
| 342 |
lines=3,
|
| 343 |
)
|
| 344 |
+
gr.Markdown("**Apercu des pistes :**")
|
| 345 |
preview_vocals = gr.Audio(
|
| 346 |
+
label="Voix originale (separee)",
|
| 347 |
interactive=False,
|
| 348 |
)
|
| 349 |
preview_converted = gr.Audio(
|
| 350 |
label="Voix convertie",
|
| 351 |
interactive=False,
|
| 352 |
)
|
| 353 |
+
gr.Markdown("**Resultat final :**")
|
| 354 |
final_output = gr.Audio(
|
| 355 |
label="Morceau final (voix + instruments)",
|
| 356 |
interactive=False,
|
|
|
|
| 367 |
convert_model,
|
| 368 |
convert_audio,
|
| 369 |
convert_pitch,
|
| 370 |
+
convert_diffusion,
|
| 371 |
convert_vocal_vol,
|
| 372 |
convert_inst_vol,
|
| 373 |
],
|
| 374 |
outputs=[convert_status, preview_vocals, preview_converted, final_output],
|
| 375 |
)
|
| 376 |
|
| 377 |
+
# Tab 3: Models
|
| 378 |
+
with gr.TabItem("Mes modeles"):
|
| 379 |
+
gr.Markdown("### Gerer vos profils vocaux")
|
| 380 |
|
| 381 |
models_table = gr.HTML(
|
| 382 |
value=refresh_models(),
|
| 383 |
+
label="Modeles enregistres",
|
| 384 |
)
|
| 385 |
|
| 386 |
with gr.Row():
|
| 387 |
+
models_refresh_btn = gr.Button("Rafraichir", size="sm")
|
| 388 |
models_delete_name = gr.Dropdown(
|
| 389 |
choices=get_model_choices(),
|
| 390 |
+
label="Modele a supprimer",
|
| 391 |
interactive=True,
|
| 392 |
)
|
| 393 |
models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
|
| 394 |
|
| 395 |
models_delete_status = gr.Textbox(label="Statut", interactive=False)
|
| 396 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 397 |
models_refresh_btn.click(
|
| 398 |
fn=refresh_models,
|
| 399 |
outputs=[models_table],
|
|
|
|
| 409 |
outputs=[models_delete_status, models_table],
|
| 410 |
)
|
| 411 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
|
| 413 |
if __name__ == "__main__":
|
| 414 |
app.launch(server_name="0.0.0.0")
|
pipeline/inference.py
CHANGED
|
@@ -1,16 +1,17 @@
|
|
| 1 |
"""
|
| 2 |
-
Voice conversion module
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
not from fine-tuning the generator.
|
| 6 |
"""
|
| 7 |
|
| 8 |
import os
|
| 9 |
import sys
|
| 10 |
import logging
|
|
|
|
| 11 |
import numpy as np
|
| 12 |
import torch
|
| 13 |
-
import
|
|
|
|
| 14 |
|
| 15 |
logger = logging.getLogger(__name__)
|
| 16 |
|
|
@@ -24,228 +25,165 @@ except ImportError:
|
|
| 24 |
return fn
|
| 25 |
return decorator
|
| 26 |
|
| 27 |
-
from pipeline.setup import
|
| 28 |
|
| 29 |
OUTPUT_DIR = "/tmp/rvc_output"
|
| 30 |
|
| 31 |
-
#
|
| 32 |
-
|
| 33 |
-
_cached_generator = None
|
| 34 |
-
_cached_rmvpe = None
|
| 35 |
|
| 36 |
|
| 37 |
-
def
|
| 38 |
-
"""Load
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
return _cached_hubert.to(device)
|
| 42 |
|
| 43 |
-
|
| 44 |
-
from rvc.lib.utils import load_embedding
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
logger.info("Loaded ContentVec HuBERT model.")
|
| 51 |
-
return model
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
return _cached_generator.to(device)
|
| 59 |
-
|
| 60 |
-
ensure_applio_path()
|
| 61 |
-
from rvc.lib.algorithm.synthesizers import Synthesizer
|
| 62 |
-
|
| 63 |
-
sr_prefix = str(sample_rate)[:2]
|
| 64 |
-
model_path = os.path.join(
|
| 65 |
-
APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan",
|
| 66 |
-
"f0G{}k.pth".format(sr_prefix),
|
| 67 |
-
)
|
| 68 |
-
|
| 69 |
-
if not os.path.exists(model_path):
|
| 70 |
-
raise RuntimeError("Pretrained generator not found: {}".format(model_path))
|
| 71 |
-
|
| 72 |
-
cpt = torch.load(model_path, map_location="cpu", weights_only=False)
|
| 73 |
-
|
| 74 |
-
# Training checkpoint has "model" key, inference format has "weight" key
|
| 75 |
-
weights = cpt.get("weight", cpt.get("model", cpt))
|
| 76 |
-
|
| 77 |
-
# Read config from Applio config files
|
| 78 |
-
import json
|
| 79 |
-
config_path = os.path.join(APPLIO_DIR, "configs", "v2", "{}k.json".format(sr_prefix))
|
| 80 |
-
if os.path.exists(config_path):
|
| 81 |
-
with open(config_path) as f:
|
| 82 |
-
cfg = json.load(f)
|
| 83 |
-
config_args = [
|
| 84 |
-
cfg["data"]["filter_length"] // 2 + 1,
|
| 85 |
-
cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
|
| 86 |
-
cfg["model"]["inter_channels"],
|
| 87 |
-
cfg["model"]["hidden_channels"],
|
| 88 |
-
cfg["model"]["filter_channels"],
|
| 89 |
-
cfg["model"]["n_heads"],
|
| 90 |
-
cfg["model"]["n_layers"],
|
| 91 |
-
cfg["model"]["kernel_size"],
|
| 92 |
-
cfg["model"]["p_dropout"],
|
| 93 |
-
cfg["model"]["resblock"],
|
| 94 |
-
cfg["model"]["resblock_kernel_sizes"],
|
| 95 |
-
cfg["model"]["resblock_dilation_sizes"],
|
| 96 |
-
cfg["model"]["upsample_rates"],
|
| 97 |
-
cfg["model"]["upsample_initial_channel"],
|
| 98 |
-
cfg["model"]["upsample_kernel_sizes"],
|
| 99 |
-
cfg["model"]["spk_embed_dim"],
|
| 100 |
-
cfg["model"]["gin_channels"],
|
| 101 |
-
cfg["data"]["sampling_rate"],
|
| 102 |
-
]
|
| 103 |
-
logger.info("Loaded generator config from Applio.")
|
| 104 |
-
else:
|
| 105 |
-
# Fallback: standard RVC v2 40k config
|
| 106 |
-
config_args = [
|
| 107 |
-
1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
|
| 108 |
-
[3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
|
| 109 |
-
[10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
|
| 110 |
-
]
|
| 111 |
-
|
| 112 |
-
net_g = Synthesizer(*config_args, use_f0=True)
|
| 113 |
-
net_g.load_state_dict(weights, strict=False)
|
| 114 |
-
net_g.requires_grad_(False)
|
| 115 |
-
net_g.to(device)
|
| 116 |
-
_cached_generator = net_g
|
| 117 |
-
logger.info("Loaded pretrained RVC generator.")
|
| 118 |
-
return net_g
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
def _extract_f0(audio_np, sr, device):
|
| 122 |
-
"""Extract F0 using RMVPE. Returns f0 numpy array."""
|
| 123 |
-
global _cached_rmvpe
|
| 124 |
-
|
| 125 |
-
ensure_applio_path()
|
| 126 |
-
|
| 127 |
-
rmvpe_path = os.path.join(
|
| 128 |
-
APPLIO_DIR, "rvc", "models", "predictors", "rmvpe.pt"
|
| 129 |
-
)
|
| 130 |
-
|
| 131 |
-
if os.path.exists(rmvpe_path):
|
| 132 |
-
try:
|
| 133 |
-
from rvc.lib.predictors.RMVPE import RMVPE0Predictor
|
| 134 |
-
|
| 135 |
-
if _cached_rmvpe is None:
|
| 136 |
-
_cached_rmvpe = RMVPE0Predictor(rmvpe_path, device=device)
|
| 137 |
-
logger.info("Loaded RMVPE predictor.")
|
| 138 |
-
|
| 139 |
-
f0 = _cached_rmvpe.infer_from_audio(audio_np, sample_rate=sr, thred=0.03)
|
| 140 |
-
return f0
|
| 141 |
-
except Exception as e:
|
| 142 |
-
logger.warning("RMVPE failed ({}), using torchcrepe fallback.".format(e))
|
| 143 |
-
|
| 144 |
-
# Fallback: torchcrepe
|
| 145 |
-
import torchcrepe
|
| 146 |
-
import librosa
|
| 147 |
-
|
| 148 |
-
audio_16k = librosa.resample(audio_np, orig_sr=sr, target_sr=16000) if sr != 16000 else audio_np
|
| 149 |
-
audio_t = torch.from_numpy(audio_16k).float().unsqueeze(0).to(device)
|
| 150 |
-
|
| 151 |
-
f0 = torchcrepe.predict(
|
| 152 |
-
audio_t, 16000, hop_length=160,
|
| 153 |
-
fmin=50, fmax=1100, model="full", device=device,
|
| 154 |
)
|
| 155 |
-
return f0[0].cpu().numpy()
|
| 156 |
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
f0_mel = 1127.0 * np.log(1.0 + f0 / 700.0)
|
| 161 |
-
f0_mel_min = 1127.0 * np.log(1.0 + 1.0 / 700.0)
|
| 162 |
-
f0_mel_max = 1127.0 * np.log(1.0 + 1100.0 / 700.0)
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
)
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
else:
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
dim = feats.shape[2]
|
| 195 |
-
big_npy = np.zeros((index.ntotal, dim), dtype=np.float32)
|
| 196 |
-
try:
|
| 197 |
-
for i in range(index.ntotal):
|
| 198 |
-
big_npy[i] = index.reconstruct(i)
|
| 199 |
-
except RuntimeError:
|
| 200 |
-
logger.warning("Cannot reconstruct vectors from index, skipping retrieval.")
|
| 201 |
-
return feats
|
| 202 |
-
|
| 203 |
-
npy = feats[0].cpu().numpy().astype(np.float32)
|
| 204 |
-
|
| 205 |
-
# Search k=8 nearest neighbors for each frame
|
| 206 |
-
score, ix = index.search(npy, k=8)
|
| 207 |
-
|
| 208 |
-
# Weight by inverse square distance
|
| 209 |
-
weight = np.square(1.0 / (score + 1e-6))
|
| 210 |
-
weight /= weight.sum(axis=1, keepdims=True)
|
| 211 |
-
|
| 212 |
-
# Weighted combination of nearest neighbor embeddings
|
| 213 |
-
retrieved = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
|
| 214 |
-
|
| 215 |
-
# Blend retrieved (target voice) with source features
|
| 216 |
-
retrieved_t = torch.from_numpy(retrieved).unsqueeze(0).to(device).float()
|
| 217 |
-
blended = index_rate * retrieved_t + (1.0 - index_rate) * feats
|
| 218 |
-
|
| 219 |
-
logger.info(
|
| 220 |
-
"FAISS retrieval done: {} vectors, index_rate={}".format(
|
| 221 |
-
index.ntotal, index_rate
|
| 222 |
-
)
|
| 223 |
-
)
|
| 224 |
-
return blended
|
| 225 |
|
|
|
|
|
|
|
|
|
|
| 226 |
|
| 227 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
def convert_voice(
|
| 229 |
audio_path,
|
| 230 |
-
|
| 231 |
index_path=None,
|
| 232 |
pitch=0,
|
| 233 |
f0_method="rmvpe",
|
| 234 |
-
index_rate=0.
|
| 235 |
protect=0.33,
|
| 236 |
volume_envelope=1.0,
|
| 237 |
output_format="WAV",
|
|
|
|
| 238 |
):
|
| 239 |
"""
|
| 240 |
-
Convert voice using
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
|
|
|
|
|
|
| 245 |
|
| 246 |
Returns path to converted audio file.
|
| 247 |
"""
|
| 248 |
-
import librosa
|
| 249 |
import soundfile as sf
|
| 250 |
|
| 251 |
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
|
@@ -253,92 +191,171 @@ def convert_voice(
|
|
| 253 |
output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
|
| 254 |
|
| 255 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 256 |
-
logger.info("Converting voice on {}
|
| 257 |
-
logger.info("
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
|
| 275 |
-
#
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
).permute(0, 2, 1) # (1, T_100hz, 768)
|
| 279 |
|
| 280 |
-
|
| 281 |
-
|
| 282 |
|
| 283 |
-
# -
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
feats = _faiss_retrieval(feats, index_path, big_npy_path, index_rate, device)
|
| 287 |
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
feats = protect * feats0 + (1.0 - protect) * feats
|
| 291 |
|
| 292 |
-
|
| 293 |
-
f0 = _extract_f0(audio_16k, 16000, device)
|
| 294 |
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
# ---- Step 4: Match lengths ----
|
| 302 |
-
# Target: 100Hz frame rate = 16000 / 160 = 100 frames/sec
|
| 303 |
-
p_len = len(audio_16k) // 160
|
| 304 |
-
p_len = min(p_len, feats.shape[1])
|
| 305 |
-
|
| 306 |
-
# Interpolate F0 to match p_len if needed
|
| 307 |
-
if len(f0) != p_len:
|
| 308 |
-
f0 = np.interp(
|
| 309 |
-
np.linspace(0, len(f0) - 1, p_len),
|
| 310 |
-
np.arange(len(f0)),
|
| 311 |
-
f0,
|
| 312 |
)
|
| 313 |
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
# Quantize F0 and convert to tensors
|
| 318 |
-
f0_coarse = _quantize_f0(f0)
|
| 319 |
-
pitch_t = torch.tensor(f0_coarse, device=device).unsqueeze(0).long()
|
| 320 |
-
pitchf_t = torch.tensor(f0, device=device).unsqueeze(0).float()
|
| 321 |
-
p_len_t = torch.tensor([p_len], device=device).long()
|
| 322 |
-
sid = torch.tensor([0], device=device).long()
|
| 323 |
|
| 324 |
-
#
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
|
| 331 |
-
# ---- Step 6: Post-processing ----
|
| 332 |
# Normalize
|
| 333 |
audio_max = np.abs(audio_out).max()
|
| 334 |
if audio_max > 0.01:
|
| 335 |
audio_out = audio_out / audio_max * 0.95
|
| 336 |
|
| 337 |
-
# Resample
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
# Save as WAV 16-bit
|
| 341 |
-
sf.write(output_path, audio_44k, 44100, subtype="PCM_16")
|
| 342 |
|
| 343 |
-
|
|
|
|
|
|
|
|
|
|
| 344 |
return output_path
|
|
|
|
| 1 |
"""
|
| 2 |
+
Voice conversion module using Seed-VC (zero-shot diffusion transformer).
|
| 3 |
+
No training needed - just reference audio + source audio.
|
| 4 |
+
Uses the singing voice conversion model with F0 conditioning.
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
import os
|
| 8 |
import sys
|
| 9 |
import logging
|
| 10 |
+
import argparse
|
| 11 |
import numpy as np
|
| 12 |
import torch
|
| 13 |
+
import torchaudio
|
| 14 |
+
import librosa
|
| 15 |
|
| 16 |
logger = logging.getLogger(__name__)
|
| 17 |
|
|
|
|
| 25 |
return fn
|
| 26 |
return decorator
|
| 27 |
|
| 28 |
+
from pipeline.setup import SEED_VC_DIR, ensure_seed_vc_path
|
| 29 |
|
| 30 |
OUTPUT_DIR = "/tmp/rvc_output"
|
| 31 |
|
| 32 |
+
# Cached models (loaded once, reused across calls)
|
| 33 |
+
_model_cache = {}
|
|
|
|
|
|
|
| 34 |
|
| 35 |
|
| 36 |
+
def _load_seed_vc_models(device):
|
| 37 |
+
"""Load Seed-VC singing voice conversion models."""
|
| 38 |
+
if "model" in _model_cache:
|
| 39 |
+
return _model_cache
|
|
|
|
| 40 |
|
| 41 |
+
ensure_seed_vc_path()
|
|
|
|
| 42 |
|
| 43 |
+
# Import Seed-VC's model loading utilities
|
| 44 |
+
from modules.commons import recursive_munch, build_model, load_checkpoint
|
| 45 |
+
from hf_utils import load_custom_model_from_hf
|
| 46 |
+
import yaml
|
|
|
|
|
|
|
| 47 |
|
| 48 |
+
# Load the singing model (F0-conditioned, whisper-base, 44kHz, BigVGAN)
|
| 49 |
+
dit_checkpoint_path, dit_config_path = load_custom_model_from_hf(
|
| 50 |
+
"Plachta/Seed-VC",
|
| 51 |
+
"DiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema_v2.pth",
|
| 52 |
+
"config_dit_mel_seed_uvit_whisper_base_f0_44k.yml",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
)
|
|
|
|
| 54 |
|
| 55 |
+
with open(dit_config_path, "r") as f:
|
| 56 |
+
config = yaml.safe_load(f)
|
| 57 |
|
| 58 |
+
model_params = recursive_munch(config["model_params"])
|
| 59 |
+
model = build_model(model_params, stage="DiT")
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
+
# Load checkpoint
|
| 62 |
+
model, _, _, _ = load_checkpoint(
|
| 63 |
+
model, None, dit_checkpoint_path,
|
| 64 |
+
load_only_params=True, ignore_modules=[], is_distributed=False,
|
| 65 |
)
|
| 66 |
+
for key in model:
|
| 67 |
+
model[key].eval()
|
| 68 |
+
model[key].to(device)
|
| 69 |
+
|
| 70 |
+
# FP16 for efficiency
|
| 71 |
+
for key in model:
|
| 72 |
+
if hasattr(model[key], "half"):
|
| 73 |
+
model[key] = model[key].half()
|
| 74 |
+
|
| 75 |
+
# Load speech tokenizer (Whisper)
|
| 76 |
+
from modules.speech_tokenizers.whisper.whisper_enc import WhisperSpeechEncoder
|
| 77 |
+
speech_tokenizer_type = config.get("model_params", {}).get(
|
| 78 |
+
"speech_tokenizer", {}
|
| 79 |
+
).get("type", "whisper")
|
| 80 |
+
|
| 81 |
+
whisper_name = model_params.speech_tokenizer.get("name", "whisper-small")
|
| 82 |
+
whisper_model = WhisperSpeechEncoder.load_model(whisper_name).to(device).eval()
|
| 83 |
+
if hasattr(whisper_model, "half"):
|
| 84 |
+
whisper_model = whisper_model.half()
|
| 85 |
+
|
| 86 |
+
def semantic_fn(waves_16k):
|
| 87 |
+
wav = waves_16k.to(device).half() if waves_16k.dim() == 1 else waves_16k.to(device).half()
|
| 88 |
+
if wav.dim() == 1:
|
| 89 |
+
wav = wav.unsqueeze(0)
|
| 90 |
+
with torch.no_grad():
|
| 91 |
+
return whisper_model.extract_features(wav)
|
| 92 |
+
|
| 93 |
+
# Load vocoder (BigVGAN)
|
| 94 |
+
vocoder_type = config.get("model_params", {}).get("vocoder", {}).get("type", "bigvgan")
|
| 95 |
+
if vocoder_type == "bigvgan":
|
| 96 |
+
from modules.bigvgan import bigvgan
|
| 97 |
+
vocoder_path = os.path.join(SEED_VC_DIR, "modules", "bigvgan")
|
| 98 |
+
vocoder = bigvgan.BigVGAN.from_pretrained(
|
| 99 |
+
"nvidia/bigvgan_v2_44khz_128band_512x", use_cuda_kernel=False
|
| 100 |
+
)
|
| 101 |
+
vocoder = vocoder.to(device).eval()
|
| 102 |
+
if hasattr(vocoder, "half"):
|
| 103 |
+
vocoder = vocoder.half()
|
| 104 |
|
| 105 |
+
def vocoder_fn(mel):
|
| 106 |
+
with torch.no_grad():
|
| 107 |
+
return vocoder(mel.half())
|
| 108 |
else:
|
| 109 |
+
from modules.vocoder import load_vocoder
|
| 110 |
+
vocoder = load_vocoder(vocoder_type, config).to(device).eval()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
+
def vocoder_fn(mel):
|
| 113 |
+
with torch.no_grad():
|
| 114 |
+
return vocoder(mel)
|
| 115 |
|
| 116 |
+
# Load CAMPPlus speaker embedding model
|
| 117 |
+
from modules.campplus.DTDNN import CAMPPlus
|
| 118 |
+
campplus_ckpt_path = load_custom_model_from_hf(
|
| 119 |
+
"funasr/campplus", "campplus_cn_common.bin", config_filename=None
|
| 120 |
+
)
|
| 121 |
+
if isinstance(campplus_ckpt_path, tuple):
|
| 122 |
+
campplus_ckpt_path = campplus_ckpt_path[0]
|
| 123 |
+
campplus_model = CAMPPlus(feat_dim=80, embedding_size=192)
|
| 124 |
+
campplus_model.load_state_dict(torch.load(campplus_ckpt_path, map_location="cpu"))
|
| 125 |
+
campplus_model = campplus_model.to(device).eval().half()
|
| 126 |
+
|
| 127 |
+
# Load F0 extractor (RMVPE)
|
| 128 |
+
from modules.rmvpe import RMVPE
|
| 129 |
+
|
| 130 |
+
rmvpe_path = load_custom_model_from_hf("lj1995/VoiceConversionWebUI", "rmvpe.pt", config_filename=None)
|
| 131 |
+
if isinstance(rmvpe_path, tuple):
|
| 132 |
+
rmvpe_path = rmvpe_path[0]
|
| 133 |
+
f0_extractor = RMVPE(rmvpe_path, is_half=True, device=device)
|
| 134 |
+
|
| 135 |
+
def f0_fn(wav, thred=0.03):
|
| 136 |
+
return f0_extractor.infer_from_audio(wav, thred=thred)
|
| 137 |
+
|
| 138 |
+
# Mel spectrogram config
|
| 139 |
+
from modules.commons import build_mel_fn
|
| 140 |
+
mel_fn_args = config["preprocess_params"]["spect_params"]
|
| 141 |
+
to_mel = build_mel_fn(mel_fn_args)
|
| 142 |
+
sr = config["preprocess_params"]["sr"]
|
| 143 |
+
hop_length = mel_fn_args["hop_length"]
|
| 144 |
+
|
| 145 |
+
_model_cache.update({
|
| 146 |
+
"model": model,
|
| 147 |
+
"semantic_fn": semantic_fn,
|
| 148 |
+
"vocoder_fn": vocoder_fn,
|
| 149 |
+
"campplus_model": campplus_model,
|
| 150 |
+
"f0_fn": f0_fn,
|
| 151 |
+
"to_mel": to_mel,
|
| 152 |
+
"sr": sr,
|
| 153 |
+
"hop_length": hop_length,
|
| 154 |
+
"device": device,
|
| 155 |
+
"max_context_window": model_params.DiT.max_context_window,
|
| 156 |
+
"overlap_frame_len": 16,
|
| 157 |
+
})
|
| 158 |
+
|
| 159 |
+
logger.info(f"Seed-VC models loaded (sr={sr}, hop={hop_length})")
|
| 160 |
+
return _model_cache
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
@spaces.GPU(duration=120)
|
| 164 |
def convert_voice(
|
| 165 |
audio_path,
|
| 166 |
+
reference_path,
|
| 167 |
index_path=None,
|
| 168 |
pitch=0,
|
| 169 |
f0_method="rmvpe",
|
| 170 |
+
index_rate=0.7,
|
| 171 |
protect=0.33,
|
| 172 |
volume_envelope=1.0,
|
| 173 |
output_format="WAV",
|
| 174 |
+
diffusion_steps=25,
|
| 175 |
):
|
| 176 |
"""
|
| 177 |
+
Convert voice using Seed-VC zero-shot singing voice conversion.
|
| 178 |
+
|
| 179 |
+
Args:
|
| 180 |
+
audio_path: Path to source vocals (separated by Demucs)
|
| 181 |
+
reference_path: Path to reference voice audio (3-30 sec)
|
| 182 |
+
pitch: Semitone shift (-24 to +24)
|
| 183 |
+
diffusion_steps: Quality vs speed trade-off (10=fast, 30=quality)
|
| 184 |
|
| 185 |
Returns path to converted audio file.
|
| 186 |
"""
|
|
|
|
| 187 |
import soundfile as sf
|
| 188 |
|
| 189 |
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
|
|
|
| 191 |
output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
|
| 192 |
|
| 193 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 194 |
+
logger.info("Converting voice with Seed-VC on {}".format(device))
|
| 195 |
+
logger.info("Source: {}, Reference: {}, Pitch: {}".format(audio_path, reference_path, pitch))
|
| 196 |
+
|
| 197 |
+
# Load models
|
| 198 |
+
cache = _load_seed_vc_models(device)
|
| 199 |
+
model = cache["model"]
|
| 200 |
+
semantic_fn = cache["semantic_fn"]
|
| 201 |
+
vocoder_fn = cache["vocoder_fn"]
|
| 202 |
+
campplus_model = cache["campplus_model"]
|
| 203 |
+
f0_fn = cache["f0_fn"]
|
| 204 |
+
to_mel = cache["to_mel"]
|
| 205 |
+
sr = cache["sr"]
|
| 206 |
+
hop_length = cache["hop_length"]
|
| 207 |
+
max_context_window = cache["max_context_window"]
|
| 208 |
+
overlap_frame_len = cache["overlap_frame_len"]
|
| 209 |
+
|
| 210 |
+
# Load source audio
|
| 211 |
+
source_audio = librosa.load(audio_path, sr=sr)[0]
|
| 212 |
+
source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
|
| 213 |
+
|
| 214 |
+
# Load reference audio
|
| 215 |
+
ref_audio = librosa.load(reference_path, sr=sr)[0]
|
| 216 |
+
# Limit reference to 30 seconds
|
| 217 |
+
max_ref_samples = 30 * sr
|
| 218 |
+
if len(ref_audio) > max_ref_samples:
|
| 219 |
+
ref_audio = ref_audio[:max_ref_samples]
|
| 220 |
+
ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)
|
| 221 |
+
|
| 222 |
+
# Resample to 16kHz for speech tokenizer
|
| 223 |
+
source_16k = torchaudio.functional.resample(source_audio, sr, 16000)
|
| 224 |
+
ref_16k = torchaudio.functional.resample(ref_audio, sr, 16000)
|
| 225 |
+
|
| 226 |
+
# Extract semantic tokens
|
| 227 |
+
S_alt = semantic_fn(source_16k[0])
|
| 228 |
+
S_ori = semantic_fn(ref_16k[0])
|
| 229 |
+
|
| 230 |
+
# Extract mel spectrograms
|
| 231 |
+
mel_source = to_mel(source_audio.to(device))
|
| 232 |
+
mel_ref = to_mel(ref_audio.to(device))
|
| 233 |
+
target_lengths = torch.LongTensor([mel_ref.size(2)]).to(device)
|
| 234 |
+
|
| 235 |
+
# Extract speaker embedding from reference
|
| 236 |
+
feat_ref = torchaudio.compliance.kaldi.fbank(
|
| 237 |
+
ref_16k[0].unsqueeze(0) if ref_16k.dim() == 2 else ref_16k,
|
| 238 |
+
num_mel_bins=80, sample_frequency=16000,
|
| 239 |
+
dither=0, window_type="hamming",
|
| 240 |
+
)
|
| 241 |
+
feat_ref = feat_ref - feat_ref.mean(dim=0, keepdim=True)
|
| 242 |
+
style_ref = campplus_model(feat_ref.unsqueeze(0).half().to(device))
|
| 243 |
|
| 244 |
+
# Extract F0 for singing
|
| 245 |
+
F0_ori = f0_fn(ref_16k[0].cpu().numpy(), thred=0.03)
|
| 246 |
+
F0_alt = f0_fn(source_16k[0].cpu().numpy(), thred=0.03)
|
|
|
|
| 247 |
|
| 248 |
+
F0_ori = torch.tensor(F0_ori).to(device).float()
|
| 249 |
+
F0_alt = torch.tensor(F0_alt).to(device).float()
|
| 250 |
|
| 251 |
+
# Auto-adjust F0 to match reference pitch range
|
| 252 |
+
voiced_ori = F0_ori > 1
|
| 253 |
+
voiced_alt = F0_alt > 1
|
|
|
|
| 254 |
|
| 255 |
+
log_f0_alt = torch.zeros_like(F0_alt)
|
| 256 |
+
log_f0_alt[voiced_alt] = torch.log(F0_alt[voiced_alt])
|
|
|
|
| 257 |
|
| 258 |
+
shifted_log_f0_alt = log_f0_alt.clone()
|
|
|
|
| 259 |
|
| 260 |
+
if voiced_ori.any() and voiced_alt.any():
|
| 261 |
+
median_log_f0_ori = torch.log(F0_ori[voiced_ori]).median()
|
| 262 |
+
median_log_f0_alt = log_f0_alt[voiced_alt].median()
|
| 263 |
+
shifted_log_f0_alt[voiced_alt] = (
|
| 264 |
+
log_f0_alt[voiced_alt] - median_log_f0_alt + median_log_f0_ori
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 265 |
)
|
| 266 |
|
| 267 |
+
shifted_f0_alt = torch.zeros_like(F0_alt)
|
| 268 |
+
shifted_f0_alt[voiced_alt] = torch.exp(shifted_log_f0_alt[voiced_alt])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
|
| 270 |
+
# Apply semitone pitch shift
|
| 271 |
+
if pitch != 0:
|
| 272 |
+
factor = 2.0 ** (pitch / 12.0)
|
| 273 |
+
shifted_f0_alt[voiced_alt] = shifted_f0_alt[voiced_alt] * factor
|
| 274 |
+
|
| 275 |
+
# Process in chunks with crossfading
|
| 276 |
+
cond = model["DiT"].prepare_concat(S_alt, mel_source)
|
| 277 |
+
|
| 278 |
+
# Prepare F0 conditioning
|
| 279 |
+
max_source_window = max_context_window - mel_ref.size(2)
|
| 280 |
+
overlap_wave_len = overlap_frame_len * hop_length
|
| 281 |
+
|
| 282 |
+
# Interpolate F0 to match mel frames
|
| 283 |
+
n_mel_frames = cond.size(1)
|
| 284 |
+
if len(shifted_f0_alt) != n_mel_frames:
|
| 285 |
+
shifted_f0_alt_interp = torch.nn.functional.interpolate(
|
| 286 |
+
shifted_f0_alt.unsqueeze(0).unsqueeze(0),
|
| 287 |
+
size=n_mel_frames, mode="nearest",
|
| 288 |
+
).squeeze()
|
| 289 |
+
else:
|
| 290 |
+
shifted_f0_alt_interp = shifted_f0_alt
|
| 291 |
+
|
| 292 |
+
# Generate in chunks
|
| 293 |
+
generated_wave_chunks = []
|
| 294 |
+
processed_frames = 0
|
| 295 |
+
|
| 296 |
+
while processed_frames < cond.size(1):
|
| 297 |
+
chunk_end = min(processed_frames + max_source_window, cond.size(1))
|
| 298 |
+
chunk_cond = cond[:, processed_frames:chunk_end]
|
| 299 |
+
chunk_f0 = shifted_f0_alt_interp[processed_frames:chunk_end]
|
| 300 |
+
|
| 301 |
+
# Concatenate reference mel with source chunk
|
| 302 |
+
cat_condition = torch.cat([mel_ref, chunk_cond], dim=2)
|
| 303 |
+
cat_f0 = torch.cat([
|
| 304 |
+
torch.zeros(mel_ref.size(2)).to(device),
|
| 305 |
+
chunk_f0,
|
| 306 |
+
])
|
| 307 |
+
|
| 308 |
+
with torch.no_grad():
|
| 309 |
+
vc_target = model["cfm"].inference(
|
| 310 |
+
cat_condition.half(),
|
| 311 |
+
torch.LongTensor([cat_condition.size(2)]).to(device),
|
| 312 |
+
mel_ref.half(),
|
| 313 |
+
style_ref,
|
| 314 |
+
cat_f0.unsqueeze(0).half(),
|
| 315 |
+
diffusion_steps,
|
| 316 |
+
inference_cfg_rate=index_rate,
|
| 317 |
+
)
|
| 318 |
+
vc_target = vc_target[:, :, mel_ref.size(2):]
|
| 319 |
+
|
| 320 |
+
# Vocoder
|
| 321 |
+
vc_wave = vocoder_fn(vc_target.float())
|
| 322 |
+
|
| 323 |
+
if generated_wave_chunks:
|
| 324 |
+
# Crossfade with previous chunk
|
| 325 |
+
prev = generated_wave_chunks[-1]
|
| 326 |
+
if overlap_wave_len > 0 and len(prev) >= overlap_wave_len:
|
| 327 |
+
cross_len = min(overlap_wave_len, vc_wave.size(-1))
|
| 328 |
+
fade_out = np.linspace(1, 0, cross_len)
|
| 329 |
+
fade_in = np.linspace(0, 1, cross_len)
|
| 330 |
+
prev_np = prev if isinstance(prev, np.ndarray) else prev
|
| 331 |
+
new_np = vc_wave[0].cpu().float().numpy()
|
| 332 |
+
prev_np[-cross_len:] = (
|
| 333 |
+
prev_np[-cross_len:] * fade_out + new_np[:cross_len] * fade_in
|
| 334 |
+
)
|
| 335 |
+
generated_wave_chunks.append(new_np[cross_len:])
|
| 336 |
+
else:
|
| 337 |
+
generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
|
| 338 |
+
else:
|
| 339 |
+
generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
|
| 340 |
+
|
| 341 |
+
processed_frames = chunk_end - overlap_frame_len
|
| 342 |
+
if processed_frames < 0:
|
| 343 |
+
processed_frames = chunk_end
|
| 344 |
+
|
| 345 |
+
# Concatenate all chunks
|
| 346 |
+
audio_out = np.concatenate(generated_wave_chunks)
|
| 347 |
|
|
|
|
| 348 |
# Normalize
|
| 349 |
audio_max = np.abs(audio_out).max()
|
| 350 |
if audio_max > 0.01:
|
| 351 |
audio_out = audio_out / audio_max * 0.95
|
| 352 |
|
| 353 |
+
# Resample to 44.1kHz if needed and save
|
| 354 |
+
if sr != 44100:
|
| 355 |
+
audio_out = librosa.resample(audio_out, orig_sr=sr, target_sr=44100)
|
|
|
|
|
|
|
| 356 |
|
| 357 |
+
sf.write(output_path, audio_out, 44100, subtype="PCM_16")
|
| 358 |
+
logger.info("Conversion complete: {} ({:.1f}s)".format(
|
| 359 |
+
output_path, len(audio_out) / 44100
|
| 360 |
+
))
|
| 361 |
return output_path
|
pipeline/setup.py
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
-
Setup module:
|
|
|
|
| 3 |
"""
|
| 4 |
|
| 5 |
import os
|
|
@@ -9,134 +10,50 @@ import logging
|
|
| 9 |
|
| 10 |
logger = logging.getLogger(__name__)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
|
| 15 |
-
# Pretrained model URLs from HuggingFace
|
| 16 |
-
HF_BASE_URL = "https://huggingface.co/IAHispano/Applio/resolve/main/Resources"
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
# RMVPE pitch extractor
|
| 23 |
-
"rvc/models/predictors/rmvpe.pt": "predictors/rmvpe.pt",
|
| 24 |
-
# ContentVec embedder
|
| 25 |
-
"rvc/models/embedders/contentvec/pytorch_model.bin": "embedders/contentvec/pytorch_model.bin",
|
| 26 |
-
"rvc/models/embedders/contentvec/config.json": "embedders/contentvec/config.json",
|
| 27 |
-
}
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
def clone_applio():
|
| 31 |
-
"""Clone Applio repository if not already present."""
|
| 32 |
-
if os.path.exists(os.path.join(APPLIO_DIR, "core.py")):
|
| 33 |
-
logger.info("Applio already cloned.")
|
| 34 |
return True
|
| 35 |
|
| 36 |
-
logger.info("Cloning
|
| 37 |
try:
|
| 38 |
subprocess.run(
|
| 39 |
-
["git", "clone", "--depth", "1",
|
| 40 |
-
check=True,
|
| 41 |
-
capture_output=True,
|
| 42 |
-
text=True,
|
| 43 |
)
|
| 44 |
-
logger.info("
|
| 45 |
return True
|
| 46 |
except subprocess.CalledProcessError as e:
|
| 47 |
-
logger.error(f"Failed to clone
|
| 48 |
return False
|
| 49 |
|
| 50 |
|
| 51 |
-
def
|
| 52 |
-
"""
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
return True
|
| 56 |
|
| 57 |
-
os.makedirs(os.path.dirname(full_path), exist_ok=True)
|
| 58 |
-
url = f"{HF_BASE_URL}/{remote_path}"
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
except Exception as e:
|
| 72 |
-
logger.
|
| 73 |
-
return False
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
def create_mute_files():
|
| 77 |
-
"""Create mute audio files needed for training filelist generation."""
|
| 78 |
-
import numpy as np
|
| 79 |
-
from scipy.io import wavfile
|
| 80 |
-
|
| 81 |
-
sample_rate = 40000
|
| 82 |
-
mute_dir = os.path.join(APPLIO_DIR, "logs", "mute")
|
| 83 |
-
|
| 84 |
-
for subdir in ["sliced_audios", "sliced_audios_16k", "f0", "f0_voiced", "extracted"]:
|
| 85 |
-
os.makedirs(os.path.join(mute_dir, subdir), exist_ok=True)
|
| 86 |
|
| 87 |
-
|
| 88 |
-
duration_samples = int(sample_rate * 0.4)
|
| 89 |
-
mute_audio = np.zeros(duration_samples, dtype=np.float32)
|
| 90 |
-
|
| 91 |
-
wavfile.write(
|
| 92 |
-
os.path.join(mute_dir, "sliced_audios", f"mute{sample_rate}.wav"),
|
| 93 |
-
sample_rate,
|
| 94 |
-
mute_audio,
|
| 95 |
-
)
|
| 96 |
-
wavfile.write(
|
| 97 |
-
os.path.join(mute_dir, "sliced_audios_16k", f"mute{16000}.wav"),
|
| 98 |
-
16000,
|
| 99 |
-
np.zeros(int(16000 * 0.4), dtype=np.float32),
|
| 100 |
-
)
|
| 101 |
-
|
| 102 |
-
# Create mute feature files
|
| 103 |
-
mute_f0 = np.zeros(int(16000 * 0.4 / 160), dtype=np.float32)
|
| 104 |
-
np.save(os.path.join(mute_dir, "f0", "mute.wav.npy"), mute_f0)
|
| 105 |
-
np.save(os.path.join(mute_dir, "f0_voiced", "mute.wav.npy"), mute_f0)
|
| 106 |
-
|
| 107 |
-
# Create mute embedding (768-dim contentvec)
|
| 108 |
-
mute_embed = np.zeros((int(16000 * 0.4 / 320), 768), dtype=np.float32)
|
| 109 |
-
np.save(os.path.join(mute_dir, "extracted", "mute.npy"), mute_embed)
|
| 110 |
-
|
| 111 |
-
logger.info("Mute files created.")
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
def setup_applio():
|
| 115 |
-
"""Full setup: clone + download models + create mute files."""
|
| 116 |
-
if not clone_applio():
|
| 117 |
-
raise RuntimeError("Failed to clone Applio")
|
| 118 |
-
|
| 119 |
-
# Add Applio to Python path
|
| 120 |
-
if APPLIO_DIR not in sys.path:
|
| 121 |
-
sys.path.insert(0, APPLIO_DIR)
|
| 122 |
-
|
| 123 |
-
# Download required models
|
| 124 |
-
all_ok = True
|
| 125 |
-
for local_path, remote_path in REQUIRED_MODELS.items():
|
| 126 |
-
if not download_pretrained(local_path, remote_path):
|
| 127 |
-
all_ok = False
|
| 128 |
-
|
| 129 |
-
if not all_ok:
|
| 130 |
-
logger.warning("Some models failed to download. Training may not work.")
|
| 131 |
-
|
| 132 |
-
# Create mute files for training
|
| 133 |
-
create_mute_files()
|
| 134 |
-
|
| 135 |
-
logger.info("Applio setup complete.")
|
| 136 |
return True
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
def ensure_applio_path():
|
| 140 |
-
"""Ensure Applio is on the Python path."""
|
| 141 |
-
if APPLIO_DIR not in sys.path:
|
| 142 |
-
sys.path.insert(0, APPLIO_DIR)
|
|
|
|
| 1 |
"""
|
| 2 |
+
Setup module: clone Seed-VC repo at startup.
|
| 3 |
+
Seed-VC downloads its own pretrained models from HuggingFace on first use.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
|
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
| 13 |
+
SEED_VC_DIR = "/tmp/seed-vc"
|
| 14 |
+
SEED_VC_REPO = "https://github.com/Plachtaa/seed-vc.git"
|
| 15 |
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
def clone_seed_vc():
|
| 18 |
+
"""Clone Seed-VC repository if not already present."""
|
| 19 |
+
if os.path.exists(os.path.join(SEED_VC_DIR, "inference.py")):
|
| 20 |
+
logger.info("Seed-VC already cloned.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
return True
|
| 22 |
|
| 23 |
+
logger.info("Cloning Seed-VC repository...")
|
| 24 |
try:
|
| 25 |
subprocess.run(
|
| 26 |
+
["git", "clone", "--depth", "1", SEED_VC_REPO, SEED_VC_DIR],
|
| 27 |
+
check=True, capture_output=True, text=True,
|
|
|
|
|
|
|
| 28 |
)
|
| 29 |
+
logger.info("Seed-VC cloned successfully.")
|
| 30 |
return True
|
| 31 |
except subprocess.CalledProcessError as e:
|
| 32 |
+
logger.error(f"Failed to clone Seed-VC: {e.stderr}")
|
| 33 |
return False
|
| 34 |
|
| 35 |
|
| 36 |
+
def ensure_seed_vc_path():
|
| 37 |
+
"""Ensure Seed-VC is on the Python path."""
|
| 38 |
+
if SEED_VC_DIR not in sys.path:
|
| 39 |
+
sys.path.insert(0, SEED_VC_DIR)
|
|
|
|
| 40 |
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
def setup_seed_vc():
|
| 43 |
+
"""Full setup: clone repo + add to path."""
|
| 44 |
+
if not clone_seed_vc():
|
| 45 |
+
raise RuntimeError("Failed to clone Seed-VC")
|
| 46 |
+
ensure_seed_vc_path()
|
| 47 |
|
| 48 |
+
# Install Seed-VC dependencies that might not be in our requirements.txt
|
| 49 |
+
try:
|
| 50 |
+
subprocess.run(
|
| 51 |
+
[sys.executable, "-m", "pip", "install", "-q",
|
| 52 |
+
"descript-audio-codec", "vocos", "bigvgan"],
|
| 53 |
+
capture_output=True, text=True, timeout=120,
|
| 54 |
+
)
|
| 55 |
except Exception as e:
|
| 56 |
+
logger.warning(f"Some Seed-VC deps may be missing: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
logger.info("Seed-VC setup complete.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
return True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pipeline/storage.py
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
"""
|
| 2 |
-
Model storage module: persist
|
| 3 |
"""
|
| 4 |
|
| 5 |
import os
|
|
@@ -8,64 +8,63 @@ from datetime import datetime
|
|
| 8 |
|
| 9 |
logger = logging.getLogger(__name__)
|
| 10 |
|
| 11 |
-
# Will be set from environment or app config
|
| 12 |
MODELS_REPO_ID = None
|
| 13 |
LOCAL_MODELS_DIR = "/tmp/rvc_models"
|
| 14 |
|
| 15 |
|
| 16 |
-
def init_storage(repo_id
|
| 17 |
"""Initialize storage with the HF dataset repo ID."""
|
| 18 |
global MODELS_REPO_ID
|
| 19 |
MODELS_REPO_ID = repo_id
|
| 20 |
os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
|
| 21 |
-
logger.info(
|
| 22 |
|
| 23 |
|
| 24 |
-
def upload_model(model_name
|
| 25 |
-
"""Upload
|
| 26 |
if not MODELS_REPO_ID:
|
| 27 |
logger.warning("No HF repo configured. Model saved locally only.")
|
| 28 |
return False
|
| 29 |
|
| 30 |
try:
|
| 31 |
from huggingface_hub import HfApi
|
| 32 |
-
|
| 33 |
api = HfApi()
|
| 34 |
|
| 35 |
-
# Upload .pth
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
| 43 |
|
| 44 |
-
# Upload
|
| 45 |
-
if
|
|
|
|
| 46 |
api.upload_file(
|
| 47 |
-
path_or_fileobj=
|
| 48 |
-
path_in_repo=
|
| 49 |
repo_id=MODELS_REPO_ID,
|
| 50 |
repo_type="dataset",
|
| 51 |
)
|
| 52 |
-
logger.info(
|
| 53 |
|
| 54 |
-
# Upload
|
| 55 |
-
if
|
| 56 |
api.upload_file(
|
| 57 |
-
path_or_fileobj=
|
| 58 |
-
path_in_repo=
|
| 59 |
repo_id=MODELS_REPO_ID,
|
| 60 |
repo_type="dataset",
|
| 61 |
)
|
| 62 |
-
logger.info(f"Uploaded {model_name}_big_npy.npy to HF")
|
| 63 |
|
| 64 |
# Upload metadata
|
| 65 |
metadata = {
|
| 66 |
"name": model_name,
|
| 67 |
"created": datetime.now().isoformat(),
|
| 68 |
-
"
|
| 69 |
}
|
| 70 |
import json
|
| 71 |
import tempfile
|
|
@@ -77,7 +76,7 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
|
|
| 77 |
try:
|
| 78 |
api.upload_file(
|
| 79 |
path_or_fileobj=meta_path,
|
| 80 |
-
path_in_repo=
|
| 81 |
repo_id=MODELS_REPO_ID,
|
| 82 |
repo_type="dataset",
|
| 83 |
)
|
|
@@ -86,14 +85,13 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
|
|
| 86 |
|
| 87 |
return True
|
| 88 |
except Exception as e:
|
| 89 |
-
logger.error(
|
| 90 |
return False
|
| 91 |
|
| 92 |
|
| 93 |
-
def download_model(model_name
|
| 94 |
-
"""Download model from HF dataset repo. Returns (pth_path,
|
| 95 |
if not MODELS_REPO_ID:
|
| 96 |
-
# Try local
|
| 97 |
return _get_local_model(model_name)
|
| 98 |
|
| 99 |
try:
|
|
@@ -105,58 +103,69 @@ def download_model(model_name: str):
|
|
| 105 |
pth_path = hf_hub_download(
|
| 106 |
repo_id=MODELS_REPO_ID,
|
| 107 |
repo_type="dataset",
|
| 108 |
-
filename=
|
| 109 |
local_dir=local_dir,
|
| 110 |
)
|
| 111 |
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
index_path = hf_hub_download(
|
| 115 |
-
repo_id=MODELS_REPO_ID,
|
| 116 |
-
repo_type="dataset",
|
| 117 |
-
filename=f"models/{model_name}/{model_name}.index",
|
| 118 |
-
local_dir=local_dir,
|
| 119 |
-
)
|
| 120 |
-
except Exception:
|
| 121 |
-
pass # Index file is optional
|
| 122 |
-
|
| 123 |
-
# Download big_npy embeddings (for FAISS retrieval)
|
| 124 |
try:
|
| 125 |
-
hf_hub_download(
|
| 126 |
repo_id=MODELS_REPO_ID,
|
| 127 |
repo_type="dataset",
|
| 128 |
-
filename=
|
| 129 |
local_dir=local_dir,
|
| 130 |
)
|
| 131 |
except Exception:
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
except Exception as e:
|
| 136 |
-
logger.error(
|
| 137 |
return _get_local_model(model_name)
|
| 138 |
|
| 139 |
|
| 140 |
-
def _get_local_model(model_name
|
| 141 |
"""Get model from local storage."""
|
| 142 |
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 143 |
-
pth_path = os.path.join(local_dir,
|
| 144 |
-
|
| 145 |
|
| 146 |
if os.path.exists(pth_path):
|
| 147 |
-
return pth_path,
|
| 148 |
return None, None
|
| 149 |
|
| 150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
def list_models():
|
| 152 |
-
"""List all available models
|
| 153 |
models = set()
|
| 154 |
|
| 155 |
-
# Check HF repo
|
| 156 |
if MODELS_REPO_ID:
|
| 157 |
try:
|
| 158 |
from huggingface_hub import HfApi
|
| 159 |
-
|
| 160 |
api = HfApi()
|
| 161 |
files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
|
| 162 |
for f in files:
|
|
@@ -165,43 +174,37 @@ def list_models():
|
|
| 165 |
if len(parts) >= 3:
|
| 166 |
models.add(parts[1])
|
| 167 |
except Exception as e:
|
| 168 |
-
logger.error(
|
| 169 |
|
| 170 |
-
# Check local models
|
| 171 |
if os.path.exists(LOCAL_MODELS_DIR):
|
| 172 |
for name in os.listdir(LOCAL_MODELS_DIR):
|
| 173 |
model_dir = os.path.join(LOCAL_MODELS_DIR, name)
|
| 174 |
if os.path.isdir(model_dir):
|
| 175 |
-
pth = os.path.join(model_dir,
|
| 176 |
if os.path.exists(pth):
|
| 177 |
models.add(name)
|
| 178 |
|
| 179 |
return sorted(models)
|
| 180 |
|
| 181 |
|
| 182 |
-
def delete_model(model_name
|
| 183 |
"""Delete a model from HF repo and local storage."""
|
| 184 |
-
# Delete from HF
|
| 185 |
if MODELS_REPO_ID:
|
| 186 |
try:
|
| 187 |
from huggingface_hub import HfApi
|
| 188 |
-
|
| 189 |
api = HfApi()
|
| 190 |
-
# Delete the entire model folder
|
| 191 |
files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
|
| 192 |
for f in files:
|
| 193 |
-
if f.startswith(
|
| 194 |
api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
|
| 195 |
-
logger.info(
|
| 196 |
except Exception as e:
|
| 197 |
-
logger.error(
|
| 198 |
|
| 199 |
-
# Delete local
|
| 200 |
import shutil
|
| 201 |
-
|
| 202 |
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 203 |
if os.path.exists(local_dir):
|
| 204 |
shutil.rmtree(local_dir)
|
| 205 |
-
logger.info(
|
| 206 |
|
| 207 |
return True
|
|
|
|
| 1 |
"""
|
| 2 |
+
Model storage module: persist voice reference files to HuggingFace Dataset repo.
|
| 3 |
"""
|
| 4 |
|
| 5 |
import os
|
|
|
|
| 8 |
|
| 9 |
logger = logging.getLogger(__name__)
|
| 10 |
|
|
|
|
| 11 |
MODELS_REPO_ID = None
|
| 12 |
LOCAL_MODELS_DIR = "/tmp/rvc_models"
|
| 13 |
|
| 14 |
|
| 15 |
+
def init_storage(repo_id):
|
| 16 |
"""Initialize storage with the HF dataset repo ID."""
|
| 17 |
global MODELS_REPO_ID
|
| 18 |
MODELS_REPO_ID = repo_id
|
| 19 |
os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
|
| 20 |
+
logger.info("Storage initialized with repo: {}".format(repo_id))
|
| 21 |
|
| 22 |
|
| 23 |
+
def upload_model(model_name, pth_path, index_path=None, big_npy_path=None, reference_path=None):
|
| 24 |
+
"""Upload model files to HF dataset repo."""
|
| 25 |
if not MODELS_REPO_ID:
|
| 26 |
logger.warning("No HF repo configured. Model saved locally only.")
|
| 27 |
return False
|
| 28 |
|
| 29 |
try:
|
| 30 |
from huggingface_hub import HfApi
|
|
|
|
| 31 |
api = HfApi()
|
| 32 |
|
| 33 |
+
# Upload .pth marker
|
| 34 |
+
if pth_path and os.path.exists(pth_path):
|
| 35 |
+
api.upload_file(
|
| 36 |
+
path_or_fileobj=pth_path,
|
| 37 |
+
path_in_repo="models/{}/{}.pth".format(model_name, model_name),
|
| 38 |
+
repo_id=MODELS_REPO_ID,
|
| 39 |
+
repo_type="dataset",
|
| 40 |
+
)
|
| 41 |
+
logger.info("Uploaded {}.pth to HF".format(model_name))
|
| 42 |
|
| 43 |
+
# Upload reference audio
|
| 44 |
+
if reference_path and os.path.exists(reference_path):
|
| 45 |
+
ref_filename = os.path.basename(reference_path)
|
| 46 |
api.upload_file(
|
| 47 |
+
path_or_fileobj=reference_path,
|
| 48 |
+
path_in_repo="models/{}/{}".format(model_name, ref_filename),
|
| 49 |
repo_id=MODELS_REPO_ID,
|
| 50 |
repo_type="dataset",
|
| 51 |
)
|
| 52 |
+
logger.info("Uploaded {} to HF".format(ref_filename))
|
| 53 |
|
| 54 |
+
# Upload .index file if exists (backward compat)
|
| 55 |
+
if index_path and os.path.exists(index_path):
|
| 56 |
api.upload_file(
|
| 57 |
+
path_or_fileobj=index_path,
|
| 58 |
+
path_in_repo="models/{}/{}.index".format(model_name, model_name),
|
| 59 |
repo_id=MODELS_REPO_ID,
|
| 60 |
repo_type="dataset",
|
| 61 |
)
|
|
|
|
| 62 |
|
| 63 |
# Upload metadata
|
| 64 |
metadata = {
|
| 65 |
"name": model_name,
|
| 66 |
"created": datetime.now().isoformat(),
|
| 67 |
+
"engine": "seed-vc",
|
| 68 |
}
|
| 69 |
import json
|
| 70 |
import tempfile
|
|
|
|
| 76 |
try:
|
| 77 |
api.upload_file(
|
| 78 |
path_or_fileobj=meta_path,
|
| 79 |
+
path_in_repo="models/{}/metadata.json".format(model_name),
|
| 80 |
repo_id=MODELS_REPO_ID,
|
| 81 |
repo_type="dataset",
|
| 82 |
)
|
|
|
|
| 85 |
|
| 86 |
return True
|
| 87 |
except Exception as e:
|
| 88 |
+
logger.error("Failed to upload model: {}".format(e))
|
| 89 |
return False
|
| 90 |
|
| 91 |
|
| 92 |
+
def download_model(model_name):
|
| 93 |
+
"""Download model from HF dataset repo. Returns (pth_path, reference_path)."""
|
| 94 |
if not MODELS_REPO_ID:
|
|
|
|
| 95 |
return _get_local_model(model_name)
|
| 96 |
|
| 97 |
try:
|
|
|
|
| 103 |
pth_path = hf_hub_download(
|
| 104 |
repo_id=MODELS_REPO_ID,
|
| 105 |
repo_type="dataset",
|
| 106 |
+
filename="models/{}/{}.pth".format(model_name, model_name),
|
| 107 |
local_dir=local_dir,
|
| 108 |
)
|
| 109 |
|
| 110 |
+
# Try to download reference audio
|
| 111 |
+
ref_path = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
try:
|
| 113 |
+
ref_path = hf_hub_download(
|
| 114 |
repo_id=MODELS_REPO_ID,
|
| 115 |
repo_type="dataset",
|
| 116 |
+
filename="models/{}/{}_ref.wav".format(model_name, model_name),
|
| 117 |
local_dir=local_dir,
|
| 118 |
)
|
| 119 |
except Exception:
|
| 120 |
+
# Try .index for backward compat with old RVC models
|
| 121 |
+
try:
|
| 122 |
+
ref_path = hf_hub_download(
|
| 123 |
+
repo_id=MODELS_REPO_ID,
|
| 124 |
+
repo_type="dataset",
|
| 125 |
+
filename="models/{}/{}.index".format(model_name, model_name),
|
| 126 |
+
local_dir=local_dir,
|
| 127 |
+
)
|
| 128 |
+
except Exception:
|
| 129 |
+
pass
|
| 130 |
+
|
| 131 |
+
return pth_path, ref_path
|
| 132 |
except Exception as e:
|
| 133 |
+
logger.error("Failed to download model from HF: {}".format(e))
|
| 134 |
return _get_local_model(model_name)
|
| 135 |
|
| 136 |
|
| 137 |
+
def _get_local_model(model_name):
|
| 138 |
"""Get model from local storage."""
|
| 139 |
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 140 |
+
pth_path = os.path.join(local_dir, "{}.pth".format(model_name))
|
| 141 |
+
ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
|
| 142 |
|
| 143 |
if os.path.exists(pth_path):
|
| 144 |
+
return pth_path, ref_path if os.path.exists(ref_path) else None
|
| 145 |
return None, None
|
| 146 |
|
| 147 |
|
| 148 |
+
def get_reference_path(model_name):
|
| 149 |
+
"""Get the reference audio path for a model."""
|
| 150 |
+
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 151 |
+
ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
|
| 152 |
+
if os.path.exists(ref_path):
|
| 153 |
+
return ref_path
|
| 154 |
+
# Search in subdirectories (HF download structure)
|
| 155 |
+
for root, dirs, files in os.walk(local_dir):
|
| 156 |
+
for f in files:
|
| 157 |
+
if f.endswith("_ref.wav"):
|
| 158 |
+
return os.path.join(root, f)
|
| 159 |
+
return None
|
| 160 |
+
|
| 161 |
+
|
| 162 |
def list_models():
|
| 163 |
+
"""List all available models."""
|
| 164 |
models = set()
|
| 165 |
|
|
|
|
| 166 |
if MODELS_REPO_ID:
|
| 167 |
try:
|
| 168 |
from huggingface_hub import HfApi
|
|
|
|
| 169 |
api = HfApi()
|
| 170 |
files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
|
| 171 |
for f in files:
|
|
|
|
| 174 |
if len(parts) >= 3:
|
| 175 |
models.add(parts[1])
|
| 176 |
except Exception as e:
|
| 177 |
+
logger.error("Failed to list models from HF: {}".format(e))
|
| 178 |
|
|
|
|
| 179 |
if os.path.exists(LOCAL_MODELS_DIR):
|
| 180 |
for name in os.listdir(LOCAL_MODELS_DIR):
|
| 181 |
model_dir = os.path.join(LOCAL_MODELS_DIR, name)
|
| 182 |
if os.path.isdir(model_dir):
|
| 183 |
+
pth = os.path.join(model_dir, "{}.pth".format(name))
|
| 184 |
if os.path.exists(pth):
|
| 185 |
models.add(name)
|
| 186 |
|
| 187 |
return sorted(models)
|
| 188 |
|
| 189 |
|
| 190 |
+
def delete_model(model_name):
|
| 191 |
"""Delete a model from HF repo and local storage."""
|
|
|
|
| 192 |
if MODELS_REPO_ID:
|
| 193 |
try:
|
| 194 |
from huggingface_hub import HfApi
|
|
|
|
| 195 |
api = HfApi()
|
|
|
|
| 196 |
files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
|
| 197 |
for f in files:
|
| 198 |
+
if f.startswith("models/{}/".format(model_name)):
|
| 199 |
api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
|
| 200 |
+
logger.info("Deleted {} from HF repo".format(model_name))
|
| 201 |
except Exception as e:
|
| 202 |
+
logger.error("Failed to delete from HF: {}".format(e))
|
| 203 |
|
|
|
|
| 204 |
import shutil
|
|
|
|
| 205 |
local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 206 |
if os.path.exists(local_dir):
|
| 207 |
shutil.rmtree(local_dir)
|
| 208 |
+
logger.info("Deleted {} from local storage".format(model_name))
|
| 209 |
|
| 210 |
return True
|
pipeline/training.py
CHANGED
|
@@ -1,18 +1,12 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
ensuring ZeroGPU's GPU allocation is visible to the training code.
|
| 6 |
"""
|
| 7 |
|
| 8 |
import os
|
| 9 |
-
import sys
|
| 10 |
-
import runpy
|
| 11 |
-
import subprocess
|
| 12 |
import logging
|
| 13 |
import shutil
|
| 14 |
-
import time
|
| 15 |
-
import glob
|
| 16 |
|
| 17 |
logger = logging.getLogger(__name__)
|
| 18 |
|
|
@@ -27,518 +21,105 @@ except ImportError:
|
|
| 27 |
return decorator
|
| 28 |
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
# Prevent "context has already been set" errors from Applio/torch
|
| 35 |
-
# by neutralizing mp.set_start_method calls
|
| 36 |
-
import multiprocessing as _mp
|
| 37 |
-
_orig_set_start_method = _mp.set_start_method
|
| 38 |
-
|
| 39 |
-
def _safe_set_start_method(method=None, force=False):
|
| 40 |
-
try:
|
| 41 |
-
_orig_set_start_method(method, force=True)
|
| 42 |
-
except RuntimeError:
|
| 43 |
-
pass
|
| 44 |
-
|
| 45 |
-
_mp.set_start_method = _safe_set_start_method
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
def _setup_applio_env():
|
| 49 |
-
"""Ensure Applio is on sys.path."""
|
| 50 |
-
if APPLIO_DIR not in sys.path:
|
| 51 |
-
sys.path.insert(0, APPLIO_DIR)
|
| 52 |
-
train_dir = os.path.join(APPLIO_DIR, "rvc", "train")
|
| 53 |
-
if train_dir not in sys.path:
|
| 54 |
-
sys.path.insert(0, train_dir)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def preprocess(model_name: str, audio_path: str, sample_rate: int = 40000):
|
| 58 |
-
"""
|
| 59 |
-
Preprocess audio: load, normalize, slice into segments, save at target SR and 16kHz.
|
| 60 |
-
Custom implementation (no Applio subprocess dependency).
|
| 61 |
-
"""
|
| 62 |
-
import numpy as np
|
| 63 |
-
import librosa
|
| 64 |
-
import soundfile as sf
|
| 65 |
-
|
| 66 |
-
exp_dir = os.path.join(LOGS_DIR, model_name)
|
| 67 |
-
sliced_dir = os.path.join(exp_dir, "sliced_audios")
|
| 68 |
-
sliced_16k_dir = os.path.join(exp_dir, "sliced_audios_16k")
|
| 69 |
-
os.makedirs(sliced_dir, exist_ok=True)
|
| 70 |
-
os.makedirs(sliced_16k_dir, exist_ok=True)
|
| 71 |
-
|
| 72 |
-
logger.info(f"Preprocessing {audio_path} for model {model_name}...")
|
| 73 |
-
|
| 74 |
-
# Load audio at target sample rate
|
| 75 |
-
audio, sr = librosa.load(audio_path, sr=sample_rate, mono=True)
|
| 76 |
-
logger.info(f"Loaded audio: {len(audio)} samples, {len(audio)/sr:.1f}s at {sr}Hz")
|
| 77 |
-
|
| 78 |
-
if len(audio) < sr * 1:
|
| 79 |
-
raise RuntimeError("Audio trop court (< 1 seconde).")
|
| 80 |
-
|
| 81 |
-
# Normalize
|
| 82 |
-
peak = np.abs(audio).max()
|
| 83 |
-
if peak > 0:
|
| 84 |
-
audio = audio / peak * 0.95
|
| 85 |
-
|
| 86 |
-
# Also load at 16kHz
|
| 87 |
-
audio_16k, _ = librosa.load(audio_path, sr=16000, mono=True)
|
| 88 |
-
peak_16k = np.abs(audio_16k).max()
|
| 89 |
-
if peak_16k > 0:
|
| 90 |
-
audio_16k = audio_16k / peak_16k * 0.95
|
| 91 |
-
|
| 92 |
-
# Slice into segments of ~3.5 seconds with 0.3s overlap
|
| 93 |
-
segment_len = int(3.5 * sr)
|
| 94 |
-
hop = int(3.0 * sr) # 3.5 - 0.5 overlap
|
| 95 |
-
segment_len_16k = int(3.5 * 16000)
|
| 96 |
-
hop_16k = int(3.0 * 16000)
|
| 97 |
-
|
| 98 |
-
MAX_SLICES = 40 # Balance quality vs GPU time (60s ZeroGPU limit)
|
| 99 |
-
|
| 100 |
-
n_slices = 0
|
| 101 |
-
idx = 0
|
| 102 |
-
|
| 103 |
-
while idx < len(audio) and n_slices < MAX_SLICES:
|
| 104 |
-
# Slice at target sample rate
|
| 105 |
-
end = min(idx + segment_len, len(audio))
|
| 106 |
-
segment = audio[idx:end]
|
| 107 |
-
|
| 108 |
-
# Skip very short segments (< 0.5s)
|
| 109 |
-
if len(segment) < int(0.5 * sr):
|
| 110 |
-
idx += hop
|
| 111 |
-
continue
|
| 112 |
-
|
| 113 |
-
# Skip silent segments
|
| 114 |
-
if np.abs(segment).max() < 0.01:
|
| 115 |
-
idx += hop
|
| 116 |
-
continue
|
| 117 |
-
|
| 118 |
-
# Compute corresponding 16k positions
|
| 119 |
-
ratio = 16000 / sr
|
| 120 |
-
idx_16k = int(idx * ratio)
|
| 121 |
-
end_16k = int(end * ratio)
|
| 122 |
-
segment_16k = audio_16k[idx_16k:min(end_16k, len(audio_16k))]
|
| 123 |
-
|
| 124 |
-
# Save slices
|
| 125 |
-
fname = f"{model_name}_{n_slices:04d}.wav"
|
| 126 |
-
sf.write(os.path.join(sliced_dir, fname), segment, sr)
|
| 127 |
-
sf.write(os.path.join(sliced_16k_dir, fname), segment_16k, 16000)
|
| 128 |
-
|
| 129 |
-
n_slices += 1
|
| 130 |
-
idx += hop
|
| 131 |
-
|
| 132 |
-
logger.info(f"Preprocessing complete: {n_slices} slices created.")
|
| 133 |
-
|
| 134 |
-
if n_slices == 0:
|
| 135 |
-
raise RuntimeError("Preprocessing produced no audio slices. L'audio est peut-être silencieux.")
|
| 136 |
-
|
| 137 |
-
return n_slices
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
@spaces.GPU(duration=60)
|
| 141 |
-
def extract_features(model_name: str, sample_rate: int = 40000, f0_method: str = "rmvpe"):
|
| 142 |
-
"""
|
| 143 |
-
Extract F0 pitch and HuBERT embeddings.
|
| 144 |
-
Runs IN-PROCESS to access ZeroGPU's GPU allocation.
|
| 145 |
-
"""
|
| 146 |
import torch
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
_setup_applio_env()
|
| 150 |
-
old_cwd = os.getcwd()
|
| 151 |
-
os.chdir(APPLIO_DIR)
|
| 152 |
-
|
| 153 |
-
try:
|
| 154 |
-
exp_dir = os.path.join(LOGS_DIR, model_name)
|
| 155 |
-
wav_path = os.path.join(exp_dir, "sliced_audios_16k")
|
| 156 |
-
|
| 157 |
-
os.makedirs(os.path.join(exp_dir, "f0"), exist_ok=True)
|
| 158 |
-
os.makedirs(os.path.join(exp_dir, "f0_voiced"), exist_ok=True)
|
| 159 |
-
os.makedirs(os.path.join(exp_dir, "extracted"), exist_ok=True)
|
| 160 |
-
|
| 161 |
-
files = []
|
| 162 |
-
for wav_file in sorted(glob.glob(os.path.join(wav_path, "*.wav"))):
|
| 163 |
-
file_name = os.path.basename(wav_file)
|
| 164 |
-
files.append([
|
| 165 |
-
wav_file,
|
| 166 |
-
os.path.join(exp_dir, "f0", file_name + ".npy"),
|
| 167 |
-
os.path.join(exp_dir, "f0_voiced", file_name + ".npy"),
|
| 168 |
-
os.path.join(exp_dir, "extracted", file_name.replace("wav", "npy")),
|
| 169 |
-
])
|
| 170 |
|
| 171 |
-
if not files:
|
| 172 |
-
raise RuntimeError("No preprocessed audio files found for feature extraction.")
|
| 173 |
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
from rvc.train.extract.extract import FeatureInput
|
| 179 |
-
fe = FeatureInput(f0_method=f0_method, device=device)
|
| 180 |
-
for file_info in files:
|
| 181 |
-
fe.process_file(file_info)
|
| 182 |
-
|
| 183 |
-
# HuBERT embedding extraction
|
| 184 |
-
logger.info(f"Extracting embeddings on {device}...")
|
| 185 |
-
from rvc.lib.utils import load_audio_16k, load_embedding
|
| 186 |
-
emb_model = load_embedding("contentvec", None).to(device).float()
|
| 187 |
-
|
| 188 |
-
for file_info in files:
|
| 189 |
-
wav_file_path, _, _, out_file_path = file_info
|
| 190 |
-
if os.path.exists(out_file_path):
|
| 191 |
-
continue
|
| 192 |
-
feats = torch.from_numpy(load_audio_16k(wav_file_path)).to(device).float()
|
| 193 |
-
feats = feats.view(1, -1)
|
| 194 |
-
with torch.no_grad():
|
| 195 |
-
emb_result = emb_model(feats)["last_hidden_state"]
|
| 196 |
-
feats_out = emb_result.squeeze(0).float().cpu().numpy()
|
| 197 |
-
if not np.isnan(feats_out).any():
|
| 198 |
-
np.save(out_file_path, feats_out, allow_pickle=False)
|
| 199 |
-
|
| 200 |
-
# Save embedder model info
|
| 201 |
-
import json
|
| 202 |
-
model_info_path = os.path.join(exp_dir, "model_info.json")
|
| 203 |
-
model_info = {}
|
| 204 |
-
if os.path.exists(model_info_path):
|
| 205 |
-
with open(model_info_path, "r") as f:
|
| 206 |
-
model_info = json.load(f)
|
| 207 |
-
model_info["embedder_model"] = "contentvec"
|
| 208 |
-
with open(model_info_path, "w") as f:
|
| 209 |
-
json.dump(model_info, f, indent=4)
|
| 210 |
-
|
| 211 |
-
# Generate config and filelist
|
| 212 |
-
from rvc.train.extract.preparing_files import generate_config, generate_filelist
|
| 213 |
-
generate_config(sample_rate, exp_dir)
|
| 214 |
-
generate_filelist(exp_dir, sample_rate, include_mutes=2)
|
| 215 |
-
|
| 216 |
-
# Verify output
|
| 217 |
-
if len(os.listdir(os.path.join(exp_dir, "extracted"))) == 0:
|
| 218 |
-
raise RuntimeError("Feature extraction produced no embeddings.")
|
| 219 |
-
if len(os.listdir(os.path.join(exp_dir, "f0"))) == 0:
|
| 220 |
-
raise RuntimeError("F0 extraction produced no pitch files.")
|
| 221 |
-
|
| 222 |
-
logger.info("Feature extraction complete.")
|
| 223 |
-
return True
|
| 224 |
-
finally:
|
| 225 |
-
os.chdir(old_cwd)
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
@spaces.GPU(duration=60)
|
| 229 |
-
def train_model(
|
| 230 |
-
model_name: str,
|
| 231 |
-
sample_rate: int = 40000,
|
| 232 |
-
total_epochs: int = 20,
|
| 233 |
-
batch_size: int = 8,
|
| 234 |
):
|
| 235 |
"""
|
| 236 |
-
|
| 237 |
-
spawning child processes (which can't access ZeroGPU's GPU).
|
| 238 |
-
Max 300s (5 min) on ZeroGPU.
|
| 239 |
-
"""
|
| 240 |
-
import torch.multiprocessing as mp
|
| 241 |
-
import json
|
| 242 |
-
|
| 243 |
-
_setup_applio_env()
|
| 244 |
-
|
| 245 |
-
# Ensure assets/config.json exists (Applio reads precision from it)
|
| 246 |
-
assets_dir = os.path.join(APPLIO_DIR, "assets")
|
| 247 |
-
os.makedirs(assets_dir, exist_ok=True)
|
| 248 |
-
config_json = os.path.join(assets_dir, "config.json")
|
| 249 |
-
if not os.path.exists(config_json):
|
| 250 |
-
with open(config_json, "w") as f:
|
| 251 |
-
json.dump({"precision": "fp32"}, f)
|
| 252 |
-
|
| 253 |
-
# Select pretrained models
|
| 254 |
-
sr_prefix = str(sample_rate)[:2]
|
| 255 |
-
pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
|
| 256 |
-
pd = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0D{sr_prefix}k.pth")
|
| 257 |
-
|
| 258 |
-
if not os.path.exists(pg) or not os.path.exists(pd):
|
| 259 |
-
logger.warning("Pretrained models not found, training from scratch.")
|
| 260 |
-
pg, pd = "", ""
|
| 261 |
|
| 262 |
-
|
| 263 |
-
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
self.args = args
|
| 270 |
-
self.kwargs = kwargs or {}
|
| 271 |
-
self.pid = os.getpid()
|
| 272 |
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
self.target(*self.args, **self.kwargs)
|
| 276 |
-
|
| 277 |
-
def join(self):
|
| 278 |
-
pass
|
| 279 |
-
|
| 280 |
-
train_script = os.path.join(APPLIO_DIR, "rvc", "train", "train.py")
|
| 281 |
-
|
| 282 |
-
argv_args = [
|
| 283 |
-
model_name,
|
| 284 |
-
str(total_epochs), str(total_epochs),
|
| 285 |
-
pg, pd,
|
| 286 |
-
"0", str(batch_size), str(sample_rate),
|
| 287 |
-
"True", "True", "False", "False", "50", "False", "HiFi-GAN", "False",
|
| 288 |
-
]
|
| 289 |
-
|
| 290 |
-
logger.info(f"Training {model_name} for {total_epochs} epochs (in-process)...")
|
| 291 |
-
start_time = time.time()
|
| 292 |
-
|
| 293 |
-
old_argv = sys.argv
|
| 294 |
-
old_cwd = os.getcwd()
|
| 295 |
-
|
| 296 |
-
mp.Process = InlineProcess
|
| 297 |
-
try:
|
| 298 |
-
os.chdir(APPLIO_DIR)
|
| 299 |
-
sys.argv = [train_script] + argv_args
|
| 300 |
-
runpy.run_path(train_script, run_name="__main__")
|
| 301 |
-
except SystemExit as e:
|
| 302 |
-
if e.code not in (0, None):
|
| 303 |
-
raise RuntimeError(f"Training exited with code {e.code}")
|
| 304 |
-
finally:
|
| 305 |
-
mp.Process = OrigProcess
|
| 306 |
-
sys.argv = old_argv
|
| 307 |
-
os.chdir(old_cwd)
|
| 308 |
-
|
| 309 |
-
elapsed = time.time() - start_time
|
| 310 |
-
logger.info(f"Training completed in {elapsed:.1f}s")
|
| 311 |
-
return True
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
def build_index(model_name: str):
|
| 315 |
-
"""Build FAISS index from extracted embeddings."""
|
| 316 |
-
import numpy as np
|
| 317 |
-
|
| 318 |
-
try:
|
| 319 |
-
import faiss
|
| 320 |
-
except ImportError:
|
| 321 |
-
logger.warning("faiss not available, skipping index building.")
|
| 322 |
-
return None
|
| 323 |
-
|
| 324 |
-
exp_dir = os.path.join(LOGS_DIR, model_name)
|
| 325 |
-
extracted_dir = os.path.join(exp_dir, "extracted")
|
| 326 |
-
|
| 327 |
-
if not os.path.exists(extracted_dir):
|
| 328 |
-
logger.warning("No extracted features found for index building.")
|
| 329 |
-
return None
|
| 330 |
-
|
| 331 |
-
# Load all embeddings
|
| 332 |
-
embeddings = []
|
| 333 |
-
for npy_file in sorted(glob.glob(os.path.join(extracted_dir, "*.npy"))):
|
| 334 |
-
try:
|
| 335 |
-
emb = np.load(npy_file)
|
| 336 |
-
if emb.ndim == 2:
|
| 337 |
-
embeddings.append(emb)
|
| 338 |
-
except Exception as e:
|
| 339 |
-
logger.warning(f"Failed to load {npy_file}: {e}")
|
| 340 |
-
|
| 341 |
-
if not embeddings:
|
| 342 |
-
logger.warning("No valid embeddings found for index.")
|
| 343 |
-
return None
|
| 344 |
-
|
| 345 |
-
all_emb = np.concatenate(embeddings, axis=0).astype(np.float32)
|
| 346 |
-
logger.info(f"Building FAISS index from {all_emb.shape[0]} vectors ({all_emb.shape[1]}D)...")
|
| 347 |
-
|
| 348 |
-
# Build IVF index for fast retrieval
|
| 349 |
-
dim = all_emb.shape[1]
|
| 350 |
-
n_vectors = all_emb.shape[0]
|
| 351 |
-
|
| 352 |
-
if n_vectors < 40:
|
| 353 |
-
# Too few vectors for IVF, use flat index
|
| 354 |
-
index = faiss.IndexFlatL2(dim)
|
| 355 |
-
else:
|
| 356 |
-
n_clusters = min(int(np.sqrt(n_vectors)), n_vectors // 4)
|
| 357 |
-
n_clusters = max(n_clusters, 1)
|
| 358 |
-
quantizer = faiss.IndexFlatL2(dim)
|
| 359 |
-
index = faiss.IndexIVFFlat(quantizer, dim, n_clusters)
|
| 360 |
-
index.train(all_emb)
|
| 361 |
-
|
| 362 |
-
index.add(all_emb)
|
| 363 |
-
|
| 364 |
-
index_path = os.path.join(exp_dir, f"{model_name}.index")
|
| 365 |
-
faiss.write_index(index, index_path)
|
| 366 |
-
|
| 367 |
-
# Save raw embeddings for FAISS retrieval at inference time
|
| 368 |
-
big_npy_path = os.path.join(exp_dir, f"{model_name}_big_npy.npy")
|
| 369 |
-
np.save(big_npy_path, all_emb)
|
| 370 |
-
|
| 371 |
-
logger.info(f"FAISS index built: {index_path} ({n_vectors} vectors)")
|
| 372 |
-
return index_path, big_npy_path
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
def find_trained_model(model_name: str):
|
| 376 |
-
"""Find the trained .pth model file."""
|
| 377 |
-
exp_dir = os.path.join(LOGS_DIR, model_name)
|
| 378 |
-
|
| 379 |
-
if os.path.exists(exp_dir):
|
| 380 |
-
exact = os.path.join(exp_dir, f"{model_name}.pth")
|
| 381 |
-
if os.path.exists(exact):
|
| 382 |
-
return exact
|
| 383 |
-
|
| 384 |
-
for f in sorted(os.listdir(exp_dir), reverse=True):
|
| 385 |
-
if f.endswith(".pth") and f.startswith(model_name):
|
| 386 |
-
return os.path.join(exp_dir, f)
|
| 387 |
-
|
| 388 |
-
return None
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
def find_pretrained_model(sample_rate: int = 40000):
|
| 392 |
-
"""Find the pre-trained RVC generator model."""
|
| 393 |
-
sr_prefix = str(sample_rate)[:2]
|
| 394 |
-
pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
|
| 395 |
-
if os.path.exists(pg):
|
| 396 |
-
return pg
|
| 397 |
-
return None
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
def _convert_to_inference_model(checkpoint_path, output_path, sample_rate=40000):
|
| 401 |
-
"""
|
| 402 |
-
Convert a pretrained training checkpoint to RVC inference format.
|
| 403 |
-
Training checkpoints have keys: model, optimizer, iteration, learning_rate
|
| 404 |
-
Inference models need keys: weight, config, info, sr, f0, version
|
| 405 |
"""
|
| 406 |
-
import
|
| 407 |
-
import
|
| 408 |
-
|
| 409 |
-
checkpoint = torch.load(checkpoint_path, map_location="cpu")
|
| 410 |
-
|
| 411 |
-
# Extract generator weights
|
| 412 |
-
if "model" in checkpoint:
|
| 413 |
-
state_dict = checkpoint["model"]
|
| 414 |
-
elif "state_dict" in checkpoint:
|
| 415 |
-
state_dict = checkpoint["state_dict"]
|
| 416 |
-
else:
|
| 417 |
-
state_dict = checkpoint
|
| 418 |
|
| 419 |
-
|
| 420 |
-
weight = {}
|
| 421 |
-
for k, v in state_dict.items():
|
| 422 |
-
new_key = k.replace("module.", "")
|
| 423 |
-
weight[new_key] = v.half()
|
| 424 |
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
config_path = os.path.join(APPLIO_DIR, "configs", "v2", f"{sr_label}.json")
|
| 428 |
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
cfg = json.load(f)
|
| 432 |
-
config = [
|
| 433 |
-
cfg["data"]["filter_length"] // 2 + 1,
|
| 434 |
-
cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
|
| 435 |
-
cfg["model"]["inter_channels"],
|
| 436 |
-
cfg["model"]["hidden_channels"],
|
| 437 |
-
cfg["model"]["filter_channels"],
|
| 438 |
-
cfg["model"]["n_heads"],
|
| 439 |
-
cfg["model"]["n_layers"],
|
| 440 |
-
cfg["model"]["kernel_size"],
|
| 441 |
-
cfg["model"]["p_dropout"],
|
| 442 |
-
cfg["model"]["resblock"],
|
| 443 |
-
cfg["model"]["resblock_kernel_sizes"],
|
| 444 |
-
cfg["model"]["resblock_dilation_sizes"],
|
| 445 |
-
cfg["model"]["upsample_rates"],
|
| 446 |
-
cfg["model"]["upsample_initial_channel"],
|
| 447 |
-
cfg["model"]["upsample_kernel_sizes"],
|
| 448 |
-
cfg["model"]["spk_embed_dim"],
|
| 449 |
-
cfg["model"]["gin_channels"],
|
| 450 |
-
cfg["data"]["sampling_rate"],
|
| 451 |
-
]
|
| 452 |
-
else:
|
| 453 |
-
# Fallback: standard RVC v2 40k config
|
| 454 |
-
config = [
|
| 455 |
-
1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
|
| 456 |
-
[3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
|
| 457 |
-
[10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
|
| 458 |
-
]
|
| 459 |
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
"config": config,
|
| 463 |
-
"info": f"v2_{sr_label}",
|
| 464 |
-
"sr": sr_label,
|
| 465 |
-
"f0": 1,
|
| 466 |
-
"version": "v2",
|
| 467 |
-
}
|
| 468 |
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
|
|
|
|
| 472 |
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
sample_rate: int = 40000,
|
| 479 |
-
batch_size: int = 4,
|
| 480 |
-
progress_callback=None,
|
| 481 |
-
):
|
| 482 |
-
"""
|
| 483 |
-
Run the voice model creation pipeline.
|
| 484 |
-
On CPU: skips heavy HiFi-GAN training, uses pre-trained model + FAISS index.
|
| 485 |
-
Returns (pth_path, index_path) on success.
|
| 486 |
-
"""
|
| 487 |
-
import torch
|
| 488 |
-
from pipeline.storage import upload_model, LOCAL_MODELS_DIR
|
| 489 |
-
|
| 490 |
-
has_gpu = torch.cuda.is_available()
|
| 491 |
|
| 492 |
if progress_callback:
|
| 493 |
-
progress_callback(0.
|
| 494 |
-
|
| 495 |
-
n_slices = preprocess(model_name, audio_path, sample_rate)
|
| 496 |
-
|
| 497 |
-
if progress_callback:
|
| 498 |
-
progress_callback(0.20, f"{n_slices} segments créés. Extraction des caractéristiques vocales...")
|
| 499 |
-
|
| 500 |
-
extract_features(model_name, sample_rate)
|
| 501 |
|
| 502 |
-
|
| 503 |
-
|
|
|
|
|
|
|
| 504 |
|
| 505 |
-
#
|
| 506 |
-
|
| 507 |
-
if
|
| 508 |
-
|
| 509 |
-
index_path, big_npy_path = result
|
| 510 |
|
| 511 |
-
# The user's "model" is the FAISS index + embeddings.
|
| 512 |
-
# The pretrained generator is shared by all models (loaded at inference time).
|
| 513 |
-
# Voice identity comes from FAISS retrieval, not generator fine-tuning.
|
| 514 |
if progress_callback:
|
| 515 |
-
progress_callback(0.
|
| 516 |
|
| 517 |
# Save to local models directory
|
| 518 |
local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 519 |
os.makedirs(local_model_dir, exist_ok=True)
|
| 520 |
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
shutil.copy2(index_path, local_index)
|
| 524 |
|
| 525 |
-
#
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
|
|
|
|
|
|
| 532 |
|
| 533 |
if progress_callback:
|
| 534 |
-
progress_callback(0.
|
| 535 |
|
|
|
|
| 536 |
try:
|
| 537 |
-
upload_model(model_name,
|
| 538 |
except Exception as e:
|
| 539 |
-
logger.warning(
|
| 540 |
|
| 541 |
if progress_callback:
|
| 542 |
-
progress_callback(1.0, "
|
|
|
|
|
|
|
|
|
|
| 543 |
|
| 544 |
-
return
|
|
|
|
| 1 |
"""
|
| 2 |
+
Voice model creation: save a reference audio clip for Seed-VC zero-shot conversion.
|
| 3 |
+
No neural network training needed - Seed-VC uses in-context learning from
|
| 4 |
+
reference audio at inference time.
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
import os
|
|
|
|
|
|
|
|
|
|
| 8 |
import logging
|
| 9 |
import shutil
|
|
|
|
|
|
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
|
|
| 21 |
return decorator
|
| 22 |
|
| 23 |
|
| 24 |
+
# Dummy GPU-decorated function so ZeroGPU detects a GPU function at startup
|
| 25 |
+
@spaces.GPU(duration=10)
|
| 26 |
+
def _gpu_warmup():
|
| 27 |
+
"""Minimal GPU function for ZeroGPU detection."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
import torch
|
| 29 |
+
return torch.cuda.is_available() if hasattr(torch.cuda, "is_available") else False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
def save_voice_reference(
|
| 33 |
+
audio_path,
|
| 34 |
+
model_name,
|
| 35 |
+
progress_callback=None,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
):
|
| 37 |
"""
|
| 38 |
+
Save a voice reference audio clip as the user's 'voice model'.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
With Seed-VC, no training is needed. The reference audio (3-30 seconds)
|
| 41 |
+
is used directly at inference time for zero-shot voice conversion.
|
| 42 |
|
| 43 |
+
Args:
|
| 44 |
+
audio_path: Path to the uploaded voice recording
|
| 45 |
+
model_name: Name for the voice model
|
| 46 |
+
progress_callback: Optional callback for progress updates
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
+
Returns:
|
| 49 |
+
(reference_path, None) - path to saved reference audio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
"""
|
| 51 |
+
import librosa
|
| 52 |
+
import soundfile as sf
|
| 53 |
+
import numpy as np
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
from pipeline.storage import LOCAL_MODELS_DIR, upload_model
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
if progress_callback:
|
| 58 |
+
progress_callback(0.1, "Chargement de l'audio...")
|
|
|
|
| 59 |
|
| 60 |
+
# Load and preprocess the reference audio
|
| 61 |
+
audio, sr = librosa.load(audio_path, sr=44100, mono=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
duration = len(audio) / sr
|
| 64 |
+
logger.info("Reference audio: {:.1f}s at {}Hz".format(duration, sr))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+
if duration < 2.0:
|
| 67 |
+
raise RuntimeError(
|
| 68 |
+
"Audio trop court ({:.1f}s). Minimum 3 secondes recommande.".format(duration)
|
| 69 |
+
)
|
| 70 |
|
| 71 |
+
# Limit to 30 seconds (Seed-VC max reference length)
|
| 72 |
+
max_samples = 30 * sr
|
| 73 |
+
if len(audio) > max_samples:
|
| 74 |
+
audio = audio[:max_samples]
|
| 75 |
+
logger.info("Trimmed reference to 30s (Seed-VC max).")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
if progress_callback:
|
| 78 |
+
progress_callback(0.3, "Normalisation et nettoyage...")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
# Normalize audio
|
| 81 |
+
peak = np.abs(audio).max()
|
| 82 |
+
if peak > 0:
|
| 83 |
+
audio = audio / peak * 0.95
|
| 84 |
|
| 85 |
+
# Trim silence from start and end
|
| 86 |
+
audio_trimmed, _ = librosa.effects.trim(audio, top_db=25)
|
| 87 |
+
if len(audio_trimmed) > sr * 2:
|
| 88 |
+
audio = audio_trimmed
|
|
|
|
| 89 |
|
|
|
|
|
|
|
|
|
|
| 90 |
if progress_callback:
|
| 91 |
+
progress_callback(0.6, "Sauvegarde de la reference vocale...")
|
| 92 |
|
| 93 |
# Save to local models directory
|
| 94 |
local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
|
| 95 |
os.makedirs(local_model_dir, exist_ok=True)
|
| 96 |
|
| 97 |
+
reference_path = os.path.join(local_model_dir, "{}_ref.wav".format(model_name))
|
| 98 |
+
sf.write(reference_path, audio, 44100, subtype="PCM_16")
|
|
|
|
| 99 |
|
| 100 |
+
# Also save a .pth marker for compatibility with storage/listing
|
| 101 |
+
import torch
|
| 102 |
+
marker_path = os.path.join(local_model_dir, "{}.pth".format(model_name))
|
| 103 |
+
torch.save({
|
| 104 |
+
"type": "seed_vc_reference",
|
| 105 |
+
"reference_audio": "{}_ref.wav".format(model_name),
|
| 106 |
+
"duration": len(audio) / sr,
|
| 107 |
+
"sample_rate": 44100,
|
| 108 |
+
}, marker_path)
|
| 109 |
|
| 110 |
if progress_callback:
|
| 111 |
+
progress_callback(0.8, "Upload vers HuggingFace...")
|
| 112 |
|
| 113 |
+
# Upload to HF
|
| 114 |
try:
|
| 115 |
+
upload_model(model_name, marker_path, reference_path=reference_path)
|
| 116 |
except Exception as e:
|
| 117 |
+
logger.warning("Failed to upload to HF (non-critical): {}".format(e))
|
| 118 |
|
| 119 |
if progress_callback:
|
| 120 |
+
progress_callback(1.0, "Reference vocale sauvegardee !")
|
| 121 |
+
|
| 122 |
+
final_duration = len(audio) / sr
|
| 123 |
+
logger.info("Voice reference saved: {} ({:.1f}s)".format(reference_path, final_duration))
|
| 124 |
|
| 125 |
+
return marker_path, reference_path
|
requirements.txt
CHANGED
|
@@ -13,31 +13,23 @@ soundfile==0.12.1
|
|
| 13 |
scipy>=1.11.0
|
| 14 |
numpy<2.0
|
| 15 |
soxr
|
| 16 |
-
noisereduce
|
| 17 |
ffmpeg-python>=0.2.0
|
| 18 |
pedalboard
|
| 19 |
|
| 20 |
-
# RVC dependencies
|
| 21 |
-
faiss-cpu==1.9.0.post1
|
| 22 |
-
torchcrepe
|
| 23 |
-
torchfcpe
|
| 24 |
-
einops
|
| 25 |
-
transformers==4.44.2
|
| 26 |
-
|
| 27 |
# Demucs (stem separation)
|
| 28 |
demucs
|
| 29 |
|
| 30 |
-
#
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
# ML utilities
|
| 34 |
-
tqdm
|
| 35 |
pyyaml
|
| 36 |
requests
|
|
|
|
| 37 |
numba
|
| 38 |
|
| 39 |
-
#
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
| 13 |
scipy>=1.11.0
|
| 14 |
numpy<2.0
|
| 15 |
soxr
|
|
|
|
| 16 |
ffmpeg-python>=0.2.0
|
| 17 |
pedalboard
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
# Demucs (stem separation)
|
| 20 |
demucs
|
| 21 |
|
| 22 |
+
# Seed-VC dependencies
|
| 23 |
+
einops
|
| 24 |
+
transformers>=4.40.0
|
|
|
|
|
|
|
| 25 |
pyyaml
|
| 26 |
requests
|
| 27 |
+
tqdm
|
| 28 |
numba
|
| 29 |
|
| 30 |
+
# Vocoder
|
| 31 |
+
bigvgan
|
| 32 |
+
|
| 33 |
+
# Audio codec & vocos (used by Seed-VC)
|
| 34 |
+
descript-audio-codec
|
| 35 |
+
vocos
|