ibcplateformes Claude Opus 4.6 commited on
Commit
fea49f2
·
1 Parent(s): 55b9bab

Replace RVC with Seed-VC for zero-shot voice conversion

Browse files

RVC required fine-tuning (250-500 epochs) incompatible with ZeroGPU's 60s limit,
resulting in poor quality. Seed-VC uses diffusion transformer + in-context learning
for zero-shot conversion with just 3-30 sec of reference audio.

- Rewrite inference.py: Seed-VC pipeline (Whisper + CAMPPlus + RMVPE + BigVGAN)
- Simplify training.py: just save reference audio (no neural network training)
- Simplify setup.py: clone Seed-VC repo instead of Applio
- Update storage.py: handle audio reference files
- Simplify app.py UI: remove epochs slider, 3-30 sec upload
- Update requirements.txt: remove RVC deps, add Seed-VC deps
- Update README.md: reflect new architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (7) hide show
  1. README.md +22 -100
  2. app.py +112 -172
  3. pipeline/inference.py +286 -269
  4. pipeline/setup.py +31 -114
  5. pipeline/storage.py +73 -70
  6. pipeline/training.py +70 -489
  7. requirements.txt +10 -18
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Clone Vocal RVC
3
  emoji: "\U0001F3A4"
4
  colorFrom: purple
5
  colorTo: blue
@@ -10,126 +10,48 @@ app_file: app.py
10
  pinned: false
11
  license: mit
12
  tags:
13
- - rvc
14
  - voice-cloning
15
  - demucs
16
  - audio
17
  - music
 
18
  ---
19
 
20
- # Clone Vocal RVC
21
 
22
- Outil web de **clonage vocal** basé sur **RVC v2** (Retrieval-based Voice Conversion), accessible depuis votre navigateur.
23
 
24
  ## Fonctionnalités
25
 
26
- 1. **Entraînement vocal** : Uploadez un enregistrement de votre voix (3-5 min) pour créer un modèle vocal personnalisé
27
  2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
28
- 3. **Conversion vocale** : Remplacement de la voix originale par votre voix clonée
29
- 4. **Mixage final** : Remixage automatique de votre voix convertie + les instruments originaux
30
  5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
31
 
32
  ## Comment utiliser
33
 
34
- ### Étape 1 : Entraîner votre modèle vocal
35
- 1. Allez dans l'onglet **"Entraîner ma voix"**
36
- 2. Uploadez un enregistrement de votre voix (WAV ou MP3, 3-5 minutes)
37
- - Parlez ou chantez naturellement
38
- - Évitez le bruit de fond
39
- 3. Donnez un nom à votre modèle (ex: `ma_voix`)
40
- 4. Choisissez le nombre d'époques (20 par défaut, suffisant pour un bon résultat)
41
- 5. Cliquez sur **"Lancer l'entraînement"**
42
- 6. Attendez la fin de l'entraînement (~3-5 minutes)
43
 
44
  ### Étape 2 : Convertir un morceau
45
- 1. Allez dans l'onglet **"Convertir un morceau"**
46
- 2. Sélectionnez votre modèle vocal dans la liste
47
- 3. Uploadez le morceau de musique à convertir (WAV ou MP3)
48
- 4. Ajustez les paramètres si besoin :
49
- - **Transposition** : +/- demi-tons si votre voix est plus grave/aiguë
50
- - **Taux d'index** : fidélité au timbre (0.75 par défaut)
51
- - **Volumes** : équilibre voix/instruments
52
- 5. Cliquez sur **"Convertir et mixer"**
53
- 6. Écoutez l'aperçu et téléchargez le résultat
54
-
55
- ### Étape 3 : Gérer vos modèles
56
- - L'onglet **"Mes modèles"** permet de voir, supprimer, ou importer des modèles externes
57
-
58
- ## Déploiement
59
-
60
- ### Prérequis
61
- - Un compte [HuggingFace](https://huggingface.co)
62
- - Un compte [GitHub](https://github.com)
63
-
64
- ### Étapes de déploiement
65
-
66
- #### 1. Créer un dataset repo sur HuggingFace (pour stocker les modèles)
67
- 1. Allez sur https://huggingface.co/new-dataset
68
- 2. Nom : `rvc-voice-models`
69
- 3. Visibilité : **Privé**
70
- 4. Cliquez **Create**
71
-
72
- #### 2. Créer un token HuggingFace
73
- 1. Allez sur https://huggingface.co/settings/tokens
74
- 2. Cliquez **Create new token**
75
- 3. Nom : `rvc-voice-cloner`
76
- 4. Permissions : **Write**
77
- 5. Copiez le token
78
-
79
- #### 3. Créer le repo GitHub
80
- ```bash
81
- cd rvc-voice-cloner
82
- git init
83
- git add .
84
- git commit -m "Initial commit: Clone Vocal RVC"
85
- git remote add origin https://github.com/diamesene02/rvc-voice-cloner.git
86
- git push -u origin main
87
- ```
88
-
89
- #### 4. Créer le HuggingFace Space
90
- 1. Allez sur https://huggingface.co/new-space
91
- 2. Nom : `clone-vocal-rvc`
92
- 3. SDK : **Gradio**
93
- 4. Hardware : **ZeroGPU** (gratuit pour les espaces publics)
94
- 5. Cliquez **Create Space**
95
-
96
- #### 5. Configurer les secrets du Space
97
- Dans les **Settings** du Space :
98
- - Ajoutez `HF_TOKEN` : votre token HuggingFace (étape 2)
99
- - Ajoutez `HF_MODELS_REPO` : `votre-username/rvc-voice-models`
100
-
101
- #### 6. Déployer le code
102
- ```bash
103
- # Ajouter le remote HuggingFace
104
- git remote add hf https://huggingface.co/spaces/votre-username/clone-vocal-rvc
105
-
106
- # Pousser le code
107
- git push hf main
108
- ```
109
-
110
- #### 7. Accéder à l'outil
111
- Votre outil est accessible à :
112
- ```
113
- https://huggingface.co/spaces/votre-username/clone-vocal-rvc
114
- ```
115
 
116
  ## Architecture technique
117
 
118
- - **RVC v2** : Retrieval-based Voice Conversion avec HiFi-GAN
119
- - **Demucs** (Meta AI) : Séparation des sources audio (voix/instruments)
120
  - **Gradio** : Interface web
121
- - **ZeroGPU** : GPU H200 gratuit sur HuggingFace Spaces
122
- - **Applio** : Backend RVC (cloné automatiquement au démarrage)
123
-
124
- ## Limitations
125
-
126
- - **Quota GPU** : ~5 minutes de GPU gratuit par jour (ZeroGPU)
127
- - L'entraînement consomme ~3-4 min
128
- - La conversion consomme ~1-2 min
129
- - Pour plus de GPU : upgrade vers HuggingFace PRO ($9/mois, 25 min/jour)
130
- - Les modèles sont stockés sur HuggingFace Hub (persistance entre redémarrages)
131
- - Premier lancement plus lent (téléchargement des modèles pré-entraînés)
132
 
133
  ## Licence
134
 
135
- MIT - Basé sur [Applio](https://github.com/IAHispano/Applio) (MIT) et [Demucs](https://github.com/facebookresearch/demucs) (MIT)
 
1
  ---
2
+ title: Clone Vocal
3
  emoji: "\U0001F3A4"
4
  colorFrom: purple
5
  colorTo: blue
 
10
  pinned: false
11
  license: mit
12
  tags:
13
+ - seed-vc
14
  - voice-cloning
15
  - demucs
16
  - audio
17
  - music
18
+ - zero-shot
19
  ---
20
 
21
+ # Clone Vocal
22
 
23
+ Outil web de **clonage vocal zero-shot** basé sur **Seed-VC** (Diffusion Transformer), accessible depuis votre navigateur.
24
 
25
  ## Fonctionnalités
26
 
27
+ 1. **Référence vocale** : Uploadez un court extrait de votre voix (3-30 sec) pas d'entraînement nécessaire
28
  2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
29
+ 3. **Conversion vocale** : Remplacement de la voix originale par la vôtre (Seed-VC zero-shot)
30
+ 4. **Mixage final** : Remixage automatique voix convertie + instruments originaux
31
  5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
32
 
33
  ## Comment utiliser
34
 
35
+ ### Étape 1 : Enregistrer votre référence vocale
36
+ 1. Onglet **"Ma voix"**
37
+ 2. Uploadez un extrait de votre voix (WAV ou MP3, 3 à 30 secondes)
38
+ 3. Donnez un nom (ex: `ma_voix`)
39
+ 4. Cliquez **"Sauvegarder"**
 
 
 
 
40
 
41
  ### Étape 2 : Convertir un morceau
42
+ 1. Onglet **"Convertir un morceau"**
43
+ 2. Sélectionnez votre profil vocal
44
+ 3. Uploadez le morceau à convertir
45
+ 4. Ajustez les paramètres si besoin (transposition, qualité, volumes)
46
+ 5. Cliquez **"Convertir et mixer"**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Architecture technique
49
 
50
+ - **Seed-VC** : Voice conversion zero-shot par diffusion transformer + in-context learning
51
+ - **Demucs** (Meta AI) : Séparation des sources audio
52
  - **Gradio** : Interface web
53
+ - **ZeroGPU** : GPU sur HuggingFace Spaces
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## Licence
56
 
57
+ MIT Basé sur [Seed-VC](https://github.com/Plachtaa/seed-vc) (GPL v3) et [Demucs](https://github.com/facebookresearch/demucs) (MIT)
app.py CHANGED
@@ -1,6 +1,6 @@
1
  """
2
- Clone Vocal RVC - Outil web de clonage vocal basé sur RVC v2
3
- Interface Gradio en français, déployé sur HuggingFace Spaces avec ZeroGPU.
4
  """
5
 
6
  import os
@@ -11,8 +11,7 @@ import shutil
11
 
12
  import gradio as gr
13
 
14
- # ── Monkey-patch gradio_client to fix "argument of type 'bool' is not iterable" ──
15
- # Bug: gradio_client/utils.py get_type() crashes when schema is a bool instead of dict
16
  try:
17
  import gradio_client.utils as _gc_utils
18
 
@@ -43,46 +42,38 @@ except Exception:
43
  logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
44
  logger = logging.getLogger(__name__)
45
 
46
- # ── Startup: clone Applio + download models ──────────────────────────────────
47
-
48
  logger.info("Initialisation de l'application...")
49
 
50
- from pipeline.setup import setup_applio, APPLIO_DIR
51
- from pipeline.storage import init_storage, list_models, download_model, delete_model
52
 
53
- # Setup Applio (clone + download pretrained models)
54
  try:
55
- setup_applio()
56
  except Exception as e:
57
- logger.error(f"Erreur lors du setup: {e}")
58
 
59
  # Initialize model storage
60
  HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
61
  if HF_MODELS_REPO:
62
  init_storage(HF_MODELS_REPO)
63
- logger.info(f"Stockage HuggingFace configuré: {HF_MODELS_REPO}")
64
- else:
65
- logger.warning(
66
- "Variable HF_MODELS_REPO non définie. Les modèles seront stockés localement uniquement. "
67
- "Pour la persistance, ajoutez HF_MODELS_REPO=votre-user/rvc-voice-models dans les secrets du Space."
68
- )
69
 
70
-
71
- # ── Import GPU-decorated functions at top level for ZeroGPU detection ───────
72
- from pipeline.training import full_training_pipeline, extract_features
73
  from pipeline.separation import separate_audio
74
  from pipeline.inference import convert_voice
75
 
76
 
77
- # ── Training Tab ─────────────────────────────────────────────────────────────
78
 
79
- def train_voice_model(audio_file, model_name, epochs, progress=gr.Progress()):
80
- """Handler for voice model training."""
81
  if audio_file is None:
82
  return "Erreur : Veuillez uploader un fichier audio.", None
83
 
84
  if not model_name or not model_name.strip():
85
- return "Erreur : Veuillez entrer un nom pour le modèle.", None
86
 
87
  model_name = model_name.strip().replace(" ", "_")
88
 
@@ -90,39 +81,31 @@ def train_voice_model(audio_file, model_name, epochs, progress=gr.Progress()):
90
  progress(value, desc=desc)
91
 
92
  try:
93
- progress(0.0, desc="Démarrage de l'entraînement...")
94
-
95
- pth_path, index_path = full_training_pipeline(
96
  audio_path=audio_file,
97
  model_name=model_name,
98
- epochs=int(epochs),
99
- sample_rate=40000,
100
- batch_size=8,
101
  progress_callback=progress_callback,
102
  )
103
 
104
- result_msg = f"Modèle '{model_name}' entraîné avec succès !\n"
105
- result_msg += f"Fichier : {os.path.basename(pth_path)}\n"
106
- if index_path:
107
- result_msg += f"Index : {os.path.basename(index_path)}"
108
-
109
- return result_msg, pth_path
110
 
111
  except Exception as e:
112
  import traceback
113
  tb = traceback.format_exc()
114
- logger.error(f"Erreur training: {tb}")
115
- # Show last 500 chars of traceback for debugging
116
- return f"Erreur lors de l'entraînement : {type(e).__name__}: {str(e)}\n\nDétails:\n{tb[-500:]}", None
 
117
 
118
 
119
- # ── Conversion Tab ───────────────────────────────────────────────────────────
120
 
121
  def get_model_choices():
122
  """Get list of trained model names for dropdown."""
123
  models = list_models()
124
  if not models:
125
- return ["(aucun modèle entraîné)"]
126
  return models
127
 
128
 
@@ -130,7 +113,7 @@ def convert_song(
130
  model_choice,
131
  song_file,
132
  pitch,
133
- index_rate,
134
  vocal_volume,
135
  instrumental_volume,
136
  progress=gr.Progress(),
@@ -139,35 +122,39 @@ def convert_song(
139
  if song_file is None:
140
  return "Erreur : Veuillez uploader un fichier audio.", None, None, None
141
 
142
- if model_choice == "(aucun modèle entraîné)" or not model_choice:
143
- return "Erreur : Veuillez d'abord entraîner un modèle vocal.", None, None, None
144
 
145
  from pipeline.mixing import mix_audio
146
 
147
  try:
148
- # Step 1: Download model
149
- progress(0.05, desc="Chargement du modèle...")
150
- pth_path, index_path = download_model(model_choice)
151
  if not pth_path:
152
- return f"Erreur : Modèle '{model_choice}' introuvable.", None, None, None
 
 
 
 
 
153
 
154
  # Step 2: Separate vocals from instruments
155
- progress(0.10, desc="Séparation des pistes (Demucs)...")
156
  vocals_path, instruments_path = separate_audio(song_file)
157
 
158
- progress(0.50, desc="Conversion vocale (RVC)...")
159
 
160
- # Step 3: Convert vocals with RVC
161
  converted_path = convert_voice(
162
  audio_path=vocals_path,
163
- model_path=pth_path,
164
- index_path=index_path,
165
  pitch=int(pitch),
166
- f0_method="rmvpe",
167
- index_rate=float(index_rate),
168
  )
169
 
170
- progress(0.80, desc="Mixage final...")
171
 
172
  # Step 4: Mix converted vocals with instruments
173
  final_path = mix_audio(
@@ -177,119 +164,94 @@ def convert_song(
177
  instrumental_volume=float(instrumental_volume),
178
  )
179
 
180
- progress(1.0, desc="Terminé !")
181
 
182
  return (
183
- "Conversion terminée avec succès !",
184
- vocals_path, # Preview vocals séparées
185
- converted_path, # Preview vocals converties
186
- final_path, # Résultat final
187
  )
188
 
189
  except Exception as e:
190
  import traceback
191
  tb = traceback.format_exc()
192
- logger.error(f"Erreur conversion: {tb}")
193
- return f"Erreur lors de la conversion : {type(e).__name__}: {str(e)}\n\nDétails:\n{tb[-500:]}", None, None, None
 
 
194
 
195
 
196
- # ── Models Tab ───────────────────────────────────────────────────────────────
197
 
198
  def refresh_models():
199
  """Refresh the model list as HTML."""
200
  models = list_models()
201
  if not models:
202
- return "<p style='color:gray;'>Aucun modèle entraîné</p>"
203
- rows = "".join(f"<tr><td>{m}</td><td>Disponible</td></tr>" for m in models)
204
- return f"<table style='width:100%;border-collapse:collapse;'><tr><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Nom</th><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Statut</th></tr>{rows}</table>"
 
 
 
 
 
 
 
205
 
206
 
207
  def delete_selected_model(model_name_to_delete):
208
  """Delete a model."""
209
- if not model_name_to_delete or model_name_to_delete == "(aucun modèle entraîné)":
210
- return "Veuillez sélectionner un modèle à supprimer.", refresh_models()
211
  try:
212
  delete_model(model_name_to_delete)
213
- return f"Modèle '{model_name_to_delete}' supprimé.", refresh_models()
214
  except Exception as e:
215
- return f"Erreur : {e}", refresh_models()
216
 
217
 
218
- def upload_external_model(pth_file, model_name):
219
- """Upload an external .pth model."""
220
- if pth_file is None:
221
- return "Veuillez sélectionner un fichier .pth", refresh_models()
222
-
223
- if not model_name or not model_name.strip():
224
- return "Veuillez entrer un nom pour le modèle.", refresh_models()
225
-
226
- model_name = model_name.strip().replace(" ", "_")
227
-
228
- from pipeline.storage import LOCAL_MODELS_DIR, upload_model
229
-
230
- local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
231
- os.makedirs(local_dir, exist_ok=True)
232
-
233
- local_pth = os.path.join(local_dir, f"{model_name}.pth")
234
- shutil.copy2(pth_file, local_pth)
235
-
236
- try:
237
- upload_model(model_name, local_pth)
238
- except Exception:
239
- pass # Non-critical
240
-
241
- return f"Modèle '{model_name}' importé avec succès.", refresh_models()
242
-
243
-
244
- # ── Build Gradio UI ──────────────────────────────────────────────────────────
245
 
246
  DESCRIPTION = """
247
- # Clone Vocal RVC
248
 
249
- Outil de clonage vocal basé sur **RVC v2** (Retrieval-based Voice Conversion).
250
 
251
  **Comment utiliser :**
252
- 1. **Onglet "Entraîner"** : Uploadez un enregistrement de votre voix (3-5 min) pour créer votre modèle vocal
253
- 2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la vôtre
254
- 3. **Onglet "Modèles"** : Gérez vos modèles vocaux entraînés
255
 
256
- > **Note** : Cet outil utilise ZeroGPU. Le quota GPU gratuit est limité (~5 min/jour).
257
- > L'entraînement consomme ~3-4 min de GPU, la conversion ~1-2 min.
258
  """
259
 
260
  with gr.Blocks(
261
- title="Clone Vocal RVC",
262
  theme=gr.themes.Soft(),
263
  ) as app:
264
 
265
  gr.Markdown(DESCRIPTION)
266
 
267
  with gr.Tabs():
268
- # ── Tab 1: Training ──
269
- with gr.TabItem("Entraîner ma voix"):
270
- gr.Markdown("### Créer un modèle vocal à partir de votre voix")
271
 
272
  with gr.Row():
273
  with gr.Column(scale=2):
274
  train_audio = gr.Audio(
275
- label="Enregistrement vocal (WAV ou MP3, 3-5 minutes)",
276
  type="filepath",
277
  sources=["upload"],
278
  )
279
  train_model_name = gr.Textbox(
280
- label="Nom du modèle",
281
  placeholder="ex: ma_voix",
282
  max_lines=1,
283
  )
284
- train_epochs = gr.Slider(
285
- minimum=5,
286
- maximum=50,
287
- value=20,
288
- step=5,
289
- label="Nombre d'époques (plus = meilleure qualité, ~3-5 min avec GPU)",
290
- )
291
  train_btn = gr.Button(
292
- "Lancer l'entraînement",
293
  variant="primary",
294
  size="lg",
295
  )
@@ -298,59 +260,59 @@ with gr.Blocks(
298
  train_status = gr.Textbox(
299
  label="Statut",
300
  interactive=False,
301
- lines=5,
302
  )
303
  train_download = gr.File(
304
- label="Télécharger le modèle",
305
  interactive=False,
306
  )
307
 
308
  gr.Markdown(
309
  "**Conseils :**\n"
310
  "- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
311
- "- Parlez ou chantez naturellement pendant 3-5 minutes\n"
312
- "- Format WAV ou MP3 accepté\n"
313
- "- 15-25 époques suffisent pour un bon résultat"
314
  )
315
 
316
  train_btn.click(
317
  fn=train_voice_model,
318
- inputs=[train_audio, train_model_name, train_epochs],
319
  outputs=[train_status, train_download],
320
  )
321
 
322
- # ── Tab 2: Conversion ──
323
  with gr.TabItem("Convertir un morceau"):
324
- gr.Markdown("### Remplacer la voix d'un morceau par la vôtre")
325
 
326
  with gr.Row():
327
  with gr.Column(scale=2):
328
  convert_model = gr.Dropdown(
329
  choices=get_model_choices(),
330
- label="Modèle vocal",
331
  interactive=True,
332
  )
333
- refresh_btn = gr.Button("Rafraîchir la liste", size="sm")
334
  convert_audio = gr.Audio(
335
- label="Morceau à convertir (WAV ou MP3)",
336
  type="filepath",
337
  sources=["upload"],
338
  )
339
 
340
- with gr.Accordion("Paramètres avancés", open=False):
341
  convert_pitch = gr.Slider(
342
- minimum=-12,
343
- maximum=12,
344
  value=0,
345
  step=1,
346
- label="Transposition (demi-tons) — ajustez si votre voix est plus grave/aiguë",
347
  )
348
- convert_index_rate = gr.Slider(
349
- minimum=0.0,
350
- maximum=1.0,
351
- value=0.75,
352
- step=0.05,
353
- label="Taux d'index (plus haut = plus fidèle au timbre original)",
354
  )
355
  convert_vocal_vol = gr.Slider(
356
  minimum=0.0,
@@ -379,16 +341,16 @@ with gr.Blocks(
379
  interactive=False,
380
  lines=3,
381
  )
382
- gr.Markdown("**Aperçu des pistes :**")
383
  preview_vocals = gr.Audio(
384
- label="Voix originale (séparée)",
385
  interactive=False,
386
  )
387
  preview_converted = gr.Audio(
388
  label="Voix convertie",
389
  interactive=False,
390
  )
391
- gr.Markdown("**Résultat final :**")
392
  final_output = gr.Audio(
393
  label="Morceau final (voix + instruments)",
394
  interactive=False,
@@ -405,49 +367,33 @@ with gr.Blocks(
405
  convert_model,
406
  convert_audio,
407
  convert_pitch,
408
- convert_index_rate,
409
  convert_vocal_vol,
410
  convert_inst_vol,
411
  ],
412
  outputs=[convert_status, preview_vocals, preview_converted, final_output],
413
  )
414
 
415
- # ── Tab 3: Models ──
416
- with gr.TabItem("Mes modèles"):
417
- gr.Markdown("### Gérer vos modèles vocaux")
418
 
419
  models_table = gr.HTML(
420
  value=refresh_models(),
421
- label="Modèles entraînés",
422
  )
423
 
424
  with gr.Row():
425
- models_refresh_btn = gr.Button("Rafraîchir", size="sm")
426
  models_delete_name = gr.Dropdown(
427
  choices=get_model_choices(),
428
- label="Modèle à supprimer",
429
  interactive=True,
430
  )
431
  models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
432
 
433
  models_delete_status = gr.Textbox(label="Statut", interactive=False)
434
 
435
- gr.Markdown("---")
436
- gr.Markdown("### Importer un modèle externe")
437
-
438
- with gr.Row():
439
- upload_pth = gr.File(
440
- label="Fichier .pth du modèle",
441
- file_types=[".pth"],
442
- )
443
- upload_name = gr.Textbox(
444
- label="Nom du modèle",
445
- placeholder="ex: voix_importee",
446
- )
447
- upload_btn = gr.Button("Importer", size="sm")
448
-
449
- upload_status = gr.Textbox(label="Statut", interactive=False)
450
-
451
  models_refresh_btn.click(
452
  fn=refresh_models,
453
  outputs=[models_table],
@@ -463,12 +409,6 @@ with gr.Blocks(
463
  outputs=[models_delete_status, models_table],
464
  )
465
 
466
- upload_btn.click(
467
- fn=upload_external_model,
468
- inputs=[upload_pth, upload_name],
469
- outputs=[upload_status, models_table],
470
- )
471
-
472
 
473
  if __name__ == "__main__":
474
  app.launch(server_name="0.0.0.0")
 
1
  """
2
+ Clone Vocal - Outil web de clonage vocal base sur Seed-VC (zero-shot).
3
+ Interface Gradio en francais, deploye sur HuggingFace Spaces avec ZeroGPU.
4
  """
5
 
6
  import os
 
11
 
12
  import gradio as gr
13
 
14
+ # Monkey-patch gradio_client to fix "argument of type 'bool' is not iterable"
 
15
  try:
16
  import gradio_client.utils as _gc_utils
17
 
 
42
  logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
43
  logger = logging.getLogger(__name__)
44
 
45
+ # Startup: clone Seed-VC
 
46
  logger.info("Initialisation de l'application...")
47
 
48
+ from pipeline.setup import setup_seed_vc
49
+ from pipeline.storage import init_storage, list_models, download_model, delete_model, get_reference_path
50
 
 
51
  try:
52
+ setup_seed_vc()
53
  except Exception as e:
54
+ logger.error("Erreur lors du setup: {}".format(e))
55
 
56
  # Initialize model storage
57
  HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
58
  if HF_MODELS_REPO:
59
  init_storage(HF_MODELS_REPO)
60
+ logger.info("Stockage HuggingFace configure: {}".format(HF_MODELS_REPO))
 
 
 
 
 
61
 
62
+ # Import GPU-decorated functions for ZeroGPU detection
63
+ from pipeline.training import save_voice_reference, _gpu_warmup
 
64
  from pipeline.separation import separate_audio
65
  from pipeline.inference import convert_voice
66
 
67
 
68
+ # -- Training Tab --
69
 
70
+ def train_voice_model(audio_file, model_name, progress=gr.Progress()):
71
+ """Handler: save voice reference."""
72
  if audio_file is None:
73
  return "Erreur : Veuillez uploader un fichier audio.", None
74
 
75
  if not model_name or not model_name.strip():
76
+ return "Erreur : Veuillez entrer un nom pour le modele.", None
77
 
78
  model_name = model_name.strip().replace(" ", "_")
79
 
 
81
  progress(value, desc=desc)
82
 
83
  try:
84
+ progress(0.0, desc="Demarrage...")
85
+ pth_path, ref_path = save_voice_reference(
 
86
  audio_path=audio_file,
87
  model_name=model_name,
 
 
 
88
  progress_callback=progress_callback,
89
  )
90
 
91
+ return "Reference vocale '{}' sauvegardee avec succes !".format(model_name), ref_path
 
 
 
 
 
92
 
93
  except Exception as e:
94
  import traceback
95
  tb = traceback.format_exc()
96
+ logger.error("Erreur training: {}".format(tb))
97
+ return "Erreur : {}: {}\n\nDetails:\n{}".format(
98
+ type(e).__name__, str(e), tb[-500:]
99
+ ), None
100
 
101
 
102
+ # -- Conversion Tab --
103
 
104
  def get_model_choices():
105
  """Get list of trained model names for dropdown."""
106
  models = list_models()
107
  if not models:
108
+ return ["(aucun modele)"]
109
  return models
110
 
111
 
 
113
  model_choice,
114
  song_file,
115
  pitch,
116
+ diffusion_steps,
117
  vocal_volume,
118
  instrumental_volume,
119
  progress=gr.Progress(),
 
122
  if song_file is None:
123
  return "Erreur : Veuillez uploader un fichier audio.", None, None, None
124
 
125
+ if model_choice == "(aucun modele)" or not model_choice:
126
+ return "Erreur : Veuillez d'abord enregistrer une reference vocale.", None, None, None
127
 
128
  from pipeline.mixing import mix_audio
129
 
130
  try:
131
+ # Step 1: Download model / find reference audio
132
+ progress(0.05, desc="Chargement du modele...")
133
+ pth_path, ref_or_index = download_model(model_choice)
134
  if not pth_path:
135
+ return "Erreur : Modele '{}' introuvable.".format(model_choice), None, None, None
136
+
137
+ # Find the reference audio path
138
+ reference_path = get_reference_path(model_choice)
139
+ if not reference_path:
140
+ return "Erreur : Audio de reference introuvable pour '{}'.".format(model_choice), None, None, None
141
 
142
  # Step 2: Separate vocals from instruments
143
+ progress(0.10, desc="Separation des pistes (Demucs)...")
144
  vocals_path, instruments_path = separate_audio(song_file)
145
 
146
+ progress(0.40, desc="Conversion vocale (Seed-VC)...")
147
 
148
+ # Step 3: Convert vocals with Seed-VC
149
  converted_path = convert_voice(
150
  audio_path=vocals_path,
151
+ reference_path=reference_path,
 
152
  pitch=int(pitch),
153
+ diffusion_steps=int(diffusion_steps),
154
+ index_rate=0.7,
155
  )
156
 
157
+ progress(0.85, desc="Mixage final...")
158
 
159
  # Step 4: Mix converted vocals with instruments
160
  final_path = mix_audio(
 
164
  instrumental_volume=float(instrumental_volume),
165
  )
166
 
167
+ progress(1.0, desc="Termine !")
168
 
169
  return (
170
+ "Conversion terminee avec succes !",
171
+ vocals_path,
172
+ converted_path,
173
+ final_path,
174
  )
175
 
176
  except Exception as e:
177
  import traceback
178
  tb = traceback.format_exc()
179
+ logger.error("Erreur conversion: {}".format(tb))
180
+ return "Erreur : {}: {}\n\nDetails:\n{}".format(
181
+ type(e).__name__, str(e), tb[-500:]
182
+ ), None, None, None
183
 
184
 
185
+ # -- Models Tab --
186
 
187
  def refresh_models():
188
  """Refresh the model list as HTML."""
189
  models = list_models()
190
  if not models:
191
+ return "<p style='color:gray;'>Aucun modele enregistre</p>"
192
+ rows = "".join(
193
+ "<tr><td>{}</td><td>Disponible</td></tr>".format(m) for m in models
194
+ )
195
+ return (
196
+ "<table style='width:100%;border-collapse:collapse;'>"
197
+ "<tr><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Nom</th>"
198
+ "<th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Statut</th></tr>"
199
+ "{}</table>".format(rows)
200
+ )
201
 
202
 
203
  def delete_selected_model(model_name_to_delete):
204
  """Delete a model."""
205
+ if not model_name_to_delete or model_name_to_delete == "(aucun modele)":
206
+ return "Veuillez selectionner un modele a supprimer.", refresh_models()
207
  try:
208
  delete_model(model_name_to_delete)
209
+ return "Modele '{}' supprime.".format(model_name_to_delete), refresh_models()
210
  except Exception as e:
211
+ return "Erreur : {}".format(e), refresh_models()
212
 
213
 
214
+ # -- Build Gradio UI --
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  DESCRIPTION = """
217
+ # Clone Vocal
218
 
219
+ Outil de clonage vocal **zero-shot** base sur **Seed-VC** (Diffusion Transformer).
220
 
221
  **Comment utiliser :**
222
+ 1. **Onglet "Ma voix"** : Uploadez un court extrait de votre voix (3-30 sec) pour creer votre profil vocal
223
+ 2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la votre
224
+ 3. **Onglet "Modeles"** : Gerez vos profils vocaux
225
 
226
+ > **Zero-shot** : pas d'entrainement necessaire ! Juste 3-30 secondes de votre voix suffisent.
 
227
  """
228
 
229
  with gr.Blocks(
230
+ title="Clone Vocal",
231
  theme=gr.themes.Soft(),
232
  ) as app:
233
 
234
  gr.Markdown(DESCRIPTION)
235
 
236
  with gr.Tabs():
237
+ # Tab 1: Voice Reference
238
+ with gr.TabItem("Ma voix"):
239
+ gr.Markdown("### Enregistrer votre reference vocale")
240
 
241
  with gr.Row():
242
  with gr.Column(scale=2):
243
  train_audio = gr.Audio(
244
+ label="Extrait de votre voix (WAV ou MP3, 3-30 secondes)",
245
  type="filepath",
246
  sources=["upload"],
247
  )
248
  train_model_name = gr.Textbox(
249
+ label="Nom du profil",
250
  placeholder="ex: ma_voix",
251
  max_lines=1,
252
  )
 
 
 
 
 
 
 
253
  train_btn = gr.Button(
254
+ "Sauvegarder",
255
  variant="primary",
256
  size="lg",
257
  )
 
260
  train_status = gr.Textbox(
261
  label="Statut",
262
  interactive=False,
263
+ lines=3,
264
  )
265
  train_download = gr.File(
266
+ label="Fichier de reference",
267
  interactive=False,
268
  )
269
 
270
  gr.Markdown(
271
  "**Conseils :**\n"
272
  "- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
273
+ "- Parlez ou chantez naturellement pendant 3 a 30 secondes\n"
274
+ "- Plus l'extrait est long et varie, meilleur sera le resultat\n"
275
+ "- Format WAV ou MP3 accepte"
276
  )
277
 
278
  train_btn.click(
279
  fn=train_voice_model,
280
+ inputs=[train_audio, train_model_name],
281
  outputs=[train_status, train_download],
282
  )
283
 
284
+ # Tab 2: Conversion
285
  with gr.TabItem("Convertir un morceau"):
286
+ gr.Markdown("### Remplacer la voix d'un morceau par la votre")
287
 
288
  with gr.Row():
289
  with gr.Column(scale=2):
290
  convert_model = gr.Dropdown(
291
  choices=get_model_choices(),
292
+ label="Profil vocal",
293
  interactive=True,
294
  )
295
+ refresh_btn = gr.Button("Rafraichir la liste", size="sm")
296
  convert_audio = gr.Audio(
297
+ label="Morceau a convertir (WAV ou MP3)",
298
  type="filepath",
299
  sources=["upload"],
300
  )
301
 
302
+ with gr.Accordion("Parametres avances", open=False):
303
  convert_pitch = gr.Slider(
304
+ minimum=-24,
305
+ maximum=24,
306
  value=0,
307
  step=1,
308
+ label="Transposition (demi-tons)",
309
  )
310
+ convert_diffusion = gr.Slider(
311
+ minimum=5,
312
+ maximum=50,
313
+ value=25,
314
+ step=5,
315
+ label="Qualite (plus haut = meilleure qualite, plus lent)",
316
  )
317
  convert_vocal_vol = gr.Slider(
318
  minimum=0.0,
 
341
  interactive=False,
342
  lines=3,
343
  )
344
+ gr.Markdown("**Apercu des pistes :**")
345
  preview_vocals = gr.Audio(
346
+ label="Voix originale (separee)",
347
  interactive=False,
348
  )
349
  preview_converted = gr.Audio(
350
  label="Voix convertie",
351
  interactive=False,
352
  )
353
+ gr.Markdown("**Resultat final :**")
354
  final_output = gr.Audio(
355
  label="Morceau final (voix + instruments)",
356
  interactive=False,
 
367
  convert_model,
368
  convert_audio,
369
  convert_pitch,
370
+ convert_diffusion,
371
  convert_vocal_vol,
372
  convert_inst_vol,
373
  ],
374
  outputs=[convert_status, preview_vocals, preview_converted, final_output],
375
  )
376
 
377
+ # Tab 3: Models
378
+ with gr.TabItem("Mes modeles"):
379
+ gr.Markdown("### Gerer vos profils vocaux")
380
 
381
  models_table = gr.HTML(
382
  value=refresh_models(),
383
+ label="Modeles enregistres",
384
  )
385
 
386
  with gr.Row():
387
+ models_refresh_btn = gr.Button("Rafraichir", size="sm")
388
  models_delete_name = gr.Dropdown(
389
  choices=get_model_choices(),
390
+ label="Modele a supprimer",
391
  interactive=True,
392
  )
393
  models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
394
 
395
  models_delete_status = gr.Textbox(label="Statut", interactive=False)
396
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
397
  models_refresh_btn.click(
398
  fn=refresh_models,
399
  outputs=[models_table],
 
409
  outputs=[models_delete_status, models_table],
410
  )
411
 
 
 
 
 
 
 
412
 
413
  if __name__ == "__main__":
414
  app.launch(server_name="0.0.0.0")
pipeline/inference.py CHANGED
@@ -1,16 +1,17 @@
1
  """
2
- Voice conversion module: manual RVC v2 inference pipeline.
3
- Uses HuBERT feature extraction + FAISS retrieval + pretrained generator.
4
- The voice identity comes from the FAISS index (target voice embeddings),
5
- not from fine-tuning the generator.
6
  """
7
 
8
  import os
9
  import sys
10
  import logging
 
11
  import numpy as np
12
  import torch
13
- import torch.nn.functional as F
 
14
 
15
  logger = logging.getLogger(__name__)
16
 
@@ -24,228 +25,165 @@ except ImportError:
24
  return fn
25
  return decorator
26
 
27
- from pipeline.setup import APPLIO_DIR, ensure_applio_path
28
 
29
  OUTPUT_DIR = "/tmp/rvc_output"
30
 
31
- # Cache loaded models to avoid reloading on every call
32
- _cached_hubert = None
33
- _cached_generator = None
34
- _cached_rmvpe = None
35
 
36
 
37
- def _load_hubert(device):
38
- """Load ContentVec HuBERT model for feature extraction."""
39
- global _cached_hubert
40
- if _cached_hubert is not None:
41
- return _cached_hubert.to(device)
42
 
43
- ensure_applio_path()
44
- from rvc.lib.utils import load_embedding
45
 
46
- model = load_embedding("contentvec", None)
47
- model = model.to(device).float()
48
- model.requires_grad_(False)
49
- _cached_hubert = model
50
- logger.info("Loaded ContentVec HuBERT model.")
51
- return model
52
 
53
-
54
- def _load_generator(device, sample_rate=40000):
55
- """Load pretrained RVC v2 generator (Synthesizer)."""
56
- global _cached_generator
57
- if _cached_generator is not None:
58
- return _cached_generator.to(device)
59
-
60
- ensure_applio_path()
61
- from rvc.lib.algorithm.synthesizers import Synthesizer
62
-
63
- sr_prefix = str(sample_rate)[:2]
64
- model_path = os.path.join(
65
- APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan",
66
- "f0G{}k.pth".format(sr_prefix),
67
- )
68
-
69
- if not os.path.exists(model_path):
70
- raise RuntimeError("Pretrained generator not found: {}".format(model_path))
71
-
72
- cpt = torch.load(model_path, map_location="cpu", weights_only=False)
73
-
74
- # Training checkpoint has "model" key, inference format has "weight" key
75
- weights = cpt.get("weight", cpt.get("model", cpt))
76
-
77
- # Read config from Applio config files
78
- import json
79
- config_path = os.path.join(APPLIO_DIR, "configs", "v2", "{}k.json".format(sr_prefix))
80
- if os.path.exists(config_path):
81
- with open(config_path) as f:
82
- cfg = json.load(f)
83
- config_args = [
84
- cfg["data"]["filter_length"] // 2 + 1,
85
- cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
86
- cfg["model"]["inter_channels"],
87
- cfg["model"]["hidden_channels"],
88
- cfg["model"]["filter_channels"],
89
- cfg["model"]["n_heads"],
90
- cfg["model"]["n_layers"],
91
- cfg["model"]["kernel_size"],
92
- cfg["model"]["p_dropout"],
93
- cfg["model"]["resblock"],
94
- cfg["model"]["resblock_kernel_sizes"],
95
- cfg["model"]["resblock_dilation_sizes"],
96
- cfg["model"]["upsample_rates"],
97
- cfg["model"]["upsample_initial_channel"],
98
- cfg["model"]["upsample_kernel_sizes"],
99
- cfg["model"]["spk_embed_dim"],
100
- cfg["model"]["gin_channels"],
101
- cfg["data"]["sampling_rate"],
102
- ]
103
- logger.info("Loaded generator config from Applio.")
104
- else:
105
- # Fallback: standard RVC v2 40k config
106
- config_args = [
107
- 1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
108
- [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
109
- [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
110
- ]
111
-
112
- net_g = Synthesizer(*config_args, use_f0=True)
113
- net_g.load_state_dict(weights, strict=False)
114
- net_g.requires_grad_(False)
115
- net_g.to(device)
116
- _cached_generator = net_g
117
- logger.info("Loaded pretrained RVC generator.")
118
- return net_g
119
-
120
-
121
- def _extract_f0(audio_np, sr, device):
122
- """Extract F0 using RMVPE. Returns f0 numpy array."""
123
- global _cached_rmvpe
124
-
125
- ensure_applio_path()
126
-
127
- rmvpe_path = os.path.join(
128
- APPLIO_DIR, "rvc", "models", "predictors", "rmvpe.pt"
129
- )
130
-
131
- if os.path.exists(rmvpe_path):
132
- try:
133
- from rvc.lib.predictors.RMVPE import RMVPE0Predictor
134
-
135
- if _cached_rmvpe is None:
136
- _cached_rmvpe = RMVPE0Predictor(rmvpe_path, device=device)
137
- logger.info("Loaded RMVPE predictor.")
138
-
139
- f0 = _cached_rmvpe.infer_from_audio(audio_np, sample_rate=sr, thred=0.03)
140
- return f0
141
- except Exception as e:
142
- logger.warning("RMVPE failed ({}), using torchcrepe fallback.".format(e))
143
-
144
- # Fallback: torchcrepe
145
- import torchcrepe
146
- import librosa
147
-
148
- audio_16k = librosa.resample(audio_np, orig_sr=sr, target_sr=16000) if sr != 16000 else audio_np
149
- audio_t = torch.from_numpy(audio_16k).float().unsqueeze(0).to(device)
150
-
151
- f0 = torchcrepe.predict(
152
- audio_t, 16000, hop_length=160,
153
- fmin=50, fmax=1100, model="full", device=device,
154
  )
155
- return f0[0].cpu().numpy()
156
 
 
 
157
 
158
- def _quantize_f0(f0):
159
- """Quantize F0 to mel-scale buckets (1-255). 0 = unvoiced."""
160
- f0_mel = 1127.0 * np.log(1.0 + f0 / 700.0)
161
- f0_mel_min = 1127.0 * np.log(1.0 + 1.0 / 700.0)
162
- f0_mel_max = 1127.0 * np.log(1.0 + 1100.0 / 700.0)
163
 
164
- f0_coarse = np.copy(f0_mel)
165
- voiced = f0_coarse > 0
166
- f0_coarse[voiced] = (
167
- (f0_coarse[voiced] - f0_mel_min) * 254.0 / (f0_mel_max - f0_mel_min) + 1.0
168
  )
169
- f0_coarse = np.clip(f0_coarse, 0, 255).astype(np.int64)
170
- f0_coarse[~voiced] = 0
171
- return f0_coarse
172
-
173
-
174
- def _faiss_retrieval(feats, index_path, big_npy_path, index_rate, device):
175
- """
176
- Retrieve target voice features from FAISS index and blend with source.
177
- This is the core of retrieval-based voice conversion: the voice identity
178
- comes from replacing source embeddings with target voice embeddings.
179
- """
180
- import faiss
181
-
182
- index = faiss.read_index(index_path)
183
-
184
- if index.ntotal == 0:
185
- logger.warning("FAISS index is empty, skipping retrieval.")
186
- return feats
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
 
188
- # Load precomputed embeddings array
189
- if big_npy_path and os.path.exists(big_npy_path):
190
- big_npy = np.load(big_npy_path)
191
  else:
192
- # Reconstruct from index (works for IndexFlatL2)
193
- logger.info("No big_npy file found, reconstructing from index...")
194
- dim = feats.shape[2]
195
- big_npy = np.zeros((index.ntotal, dim), dtype=np.float32)
196
- try:
197
- for i in range(index.ntotal):
198
- big_npy[i] = index.reconstruct(i)
199
- except RuntimeError:
200
- logger.warning("Cannot reconstruct vectors from index, skipping retrieval.")
201
- return feats
202
-
203
- npy = feats[0].cpu().numpy().astype(np.float32)
204
-
205
- # Search k=8 nearest neighbors for each frame
206
- score, ix = index.search(npy, k=8)
207
-
208
- # Weight by inverse square distance
209
- weight = np.square(1.0 / (score + 1e-6))
210
- weight /= weight.sum(axis=1, keepdims=True)
211
-
212
- # Weighted combination of nearest neighbor embeddings
213
- retrieved = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
214
-
215
- # Blend retrieved (target voice) with source features
216
- retrieved_t = torch.from_numpy(retrieved).unsqueeze(0).to(device).float()
217
- blended = index_rate * retrieved_t + (1.0 - index_rate) * feats
218
-
219
- logger.info(
220
- "FAISS retrieval done: {} vectors, index_rate={}".format(
221
- index.ntotal, index_rate
222
- )
223
- )
224
- return blended
225
 
 
 
 
226
 
227
- @spaces.GPU(duration=60)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  def convert_voice(
229
  audio_path,
230
- model_path,
231
  index_path=None,
232
  pitch=0,
233
  f0_method="rmvpe",
234
- index_rate=0.75,
235
  protect=0.33,
236
  volume_envelope=1.0,
237
  output_format="WAV",
 
238
  ):
239
  """
240
- Convert voice using the full RVC v2 pipeline:
241
- 1. Extract HuBERT features from source audio
242
- 2. Retrieve target voice features from FAISS index
243
- 3. Extract F0 pitch and apply shift
244
- 4. Run pretrained generator to synthesize converted audio
 
 
245
 
246
  Returns path to converted audio file.
247
  """
248
- import librosa
249
  import soundfile as sf
250
 
251
  os.makedirs(OUTPUT_DIR, exist_ok=True)
@@ -253,92 +191,171 @@ def convert_voice(
253
  output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
254
 
255
  device = "cuda" if torch.cuda.is_available() else "cpu"
256
- logger.info("Converting voice on {}: {}".format(device, audio_path))
257
- logger.info("Index: {}, Pitch: {}, Index rate: {}".format(index_path, pitch, index_rate))
258
-
259
- ensure_applio_path()
260
-
261
- # Load source audio at 16kHz for HuBERT and F0
262
- audio_16k, _ = librosa.load(audio_path, sr=16000, mono=True)
263
- logger.info("Source audio: {:.1f}s".format(len(audio_16k) / 16000))
264
-
265
- if len(audio_16k) < 16000 * 0.5:
266
- raise RuntimeError("Audio source trop court pour la conversion (< 0.5s).")
267
-
268
- # ---- Step 1: Extract HuBERT features ----
269
- hubert = _load_hubert(device)
270
-
271
- feats_input = torch.from_numpy(audio_16k).float().view(1, -1).to(device)
272
- with torch.no_grad():
273
- feats = hubert(feats_input)["last_hidden_state"] # (1, T_50hz, 768)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
274
 
275
- # Upsample 2x to match F0 frame rate (50Hz -> 100Hz)
276
- feats = F.interpolate(
277
- feats.permute(0, 2, 1), scale_factor=2
278
- ).permute(0, 2, 1) # (1, T_100hz, 768)
279
 
280
- # Keep a copy for protect blending
281
- feats0 = feats.clone()
282
 
283
- # ---- Step 2: FAISS retrieval ----
284
- if index_path and os.path.exists(index_path):
285
- big_npy_path = index_path.replace(".index", "_big_npy.npy")
286
- feats = _faiss_retrieval(feats, index_path, big_npy_path, index_rate, device)
287
 
288
- # Apply protect: blend original features for consonants/unvoiced parts
289
- if protect < 0.5 and feats0 is not None:
290
- feats = protect * feats0 + (1.0 - protect) * feats
291
 
292
- # ---- Step 3: Extract F0 ----
293
- f0 = _extract_f0(audio_16k, 16000, device)
294
 
295
- # Apply pitch shift (in semitones)
296
- if pitch != 0:
297
- f0 = f0.copy()
298
- voiced = f0 > 0
299
- f0[voiced] *= 2.0 ** (pitch / 12.0)
300
-
301
- # ---- Step 4: Match lengths ----
302
- # Target: 100Hz frame rate = 16000 / 160 = 100 frames/sec
303
- p_len = len(audio_16k) // 160
304
- p_len = min(p_len, feats.shape[1])
305
-
306
- # Interpolate F0 to match p_len if needed
307
- if len(f0) != p_len:
308
- f0 = np.interp(
309
- np.linspace(0, len(f0) - 1, p_len),
310
- np.arange(len(f0)),
311
- f0,
312
  )
313
 
314
- # Trim features to p_len
315
- feats = feats[:, :p_len, :]
316
-
317
- # Quantize F0 and convert to tensors
318
- f0_coarse = _quantize_f0(f0)
319
- pitch_t = torch.tensor(f0_coarse, device=device).unsqueeze(0).long()
320
- pitchf_t = torch.tensor(f0, device=device).unsqueeze(0).float()
321
- p_len_t = torch.tensor([p_len], device=device).long()
322
- sid = torch.tensor([0], device=device).long()
323
 
324
- # ---- Step 5: Generator inference ----
325
- net_g = _load_generator(device, sample_rate=40000)
326
-
327
- with torch.no_grad():
328
- result = net_g.infer(feats.float(), p_len_t, pitch_t, pitchf_t, sid)
329
- audio_out = result[0][0, 0].data.cpu().float().numpy()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
330
 
331
- # ---- Step 6: Post-processing ----
332
  # Normalize
333
  audio_max = np.abs(audio_out).max()
334
  if audio_max > 0.01:
335
  audio_out = audio_out / audio_max * 0.95
336
 
337
- # Resample 40kHz -> 44.1kHz for standard output
338
- audio_44k = librosa.resample(audio_out, orig_sr=40000, target_sr=44100)
339
-
340
- # Save as WAV 16-bit
341
- sf.write(output_path, audio_44k, 44100, subtype="PCM_16")
342
 
343
- logger.info("Conversion complete: {} ({:.1f}s)".format(output_path, len(audio_44k) / 44100))
 
 
 
344
  return output_path
 
1
  """
2
+ Voice conversion module using Seed-VC (zero-shot diffusion transformer).
3
+ No training needed - just reference audio + source audio.
4
+ Uses the singing voice conversion model with F0 conditioning.
 
5
  """
6
 
7
  import os
8
  import sys
9
  import logging
10
+ import argparse
11
  import numpy as np
12
  import torch
13
+ import torchaudio
14
+ import librosa
15
 
16
  logger = logging.getLogger(__name__)
17
 
 
25
  return fn
26
  return decorator
27
 
28
+ from pipeline.setup import SEED_VC_DIR, ensure_seed_vc_path
29
 
30
  OUTPUT_DIR = "/tmp/rvc_output"
31
 
32
+ # Cached models (loaded once, reused across calls)
33
+ _model_cache = {}
 
 
34
 
35
 
36
+ def _load_seed_vc_models(device):
37
+ """Load Seed-VC singing voice conversion models."""
38
+ if "model" in _model_cache:
39
+ return _model_cache
 
40
 
41
+ ensure_seed_vc_path()
 
42
 
43
+ # Import Seed-VC's model loading utilities
44
+ from modules.commons import recursive_munch, build_model, load_checkpoint
45
+ from hf_utils import load_custom_model_from_hf
46
+ import yaml
 
 
47
 
48
+ # Load the singing model (F0-conditioned, whisper-base, 44kHz, BigVGAN)
49
+ dit_checkpoint_path, dit_config_path = load_custom_model_from_hf(
50
+ "Plachta/Seed-VC",
51
+ "DiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema_v2.pth",
52
+ "config_dit_mel_seed_uvit_whisper_base_f0_44k.yml",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  )
 
54
 
55
+ with open(dit_config_path, "r") as f:
56
+ config = yaml.safe_load(f)
57
 
58
+ model_params = recursive_munch(config["model_params"])
59
+ model = build_model(model_params, stage="DiT")
 
 
 
60
 
61
+ # Load checkpoint
62
+ model, _, _, _ = load_checkpoint(
63
+ model, None, dit_checkpoint_path,
64
+ load_only_params=True, ignore_modules=[], is_distributed=False,
65
  )
66
+ for key in model:
67
+ model[key].eval()
68
+ model[key].to(device)
69
+
70
+ # FP16 for efficiency
71
+ for key in model:
72
+ if hasattr(model[key], "half"):
73
+ model[key] = model[key].half()
74
+
75
+ # Load speech tokenizer (Whisper)
76
+ from modules.speech_tokenizers.whisper.whisper_enc import WhisperSpeechEncoder
77
+ speech_tokenizer_type = config.get("model_params", {}).get(
78
+ "speech_tokenizer", {}
79
+ ).get("type", "whisper")
80
+
81
+ whisper_name = model_params.speech_tokenizer.get("name", "whisper-small")
82
+ whisper_model = WhisperSpeechEncoder.load_model(whisper_name).to(device).eval()
83
+ if hasattr(whisper_model, "half"):
84
+ whisper_model = whisper_model.half()
85
+
86
+ def semantic_fn(waves_16k):
87
+ wav = waves_16k.to(device).half() if waves_16k.dim() == 1 else waves_16k.to(device).half()
88
+ if wav.dim() == 1:
89
+ wav = wav.unsqueeze(0)
90
+ with torch.no_grad():
91
+ return whisper_model.extract_features(wav)
92
+
93
+ # Load vocoder (BigVGAN)
94
+ vocoder_type = config.get("model_params", {}).get("vocoder", {}).get("type", "bigvgan")
95
+ if vocoder_type == "bigvgan":
96
+ from modules.bigvgan import bigvgan
97
+ vocoder_path = os.path.join(SEED_VC_DIR, "modules", "bigvgan")
98
+ vocoder = bigvgan.BigVGAN.from_pretrained(
99
+ "nvidia/bigvgan_v2_44khz_128band_512x", use_cuda_kernel=False
100
+ )
101
+ vocoder = vocoder.to(device).eval()
102
+ if hasattr(vocoder, "half"):
103
+ vocoder = vocoder.half()
104
 
105
+ def vocoder_fn(mel):
106
+ with torch.no_grad():
107
+ return vocoder(mel.half())
108
  else:
109
+ from modules.vocoder import load_vocoder
110
+ vocoder = load_vocoder(vocoder_type, config).to(device).eval()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
+ def vocoder_fn(mel):
113
+ with torch.no_grad():
114
+ return vocoder(mel)
115
 
116
+ # Load CAMPPlus speaker embedding model
117
+ from modules.campplus.DTDNN import CAMPPlus
118
+ campplus_ckpt_path = load_custom_model_from_hf(
119
+ "funasr/campplus", "campplus_cn_common.bin", config_filename=None
120
+ )
121
+ if isinstance(campplus_ckpt_path, tuple):
122
+ campplus_ckpt_path = campplus_ckpt_path[0]
123
+ campplus_model = CAMPPlus(feat_dim=80, embedding_size=192)
124
+ campplus_model.load_state_dict(torch.load(campplus_ckpt_path, map_location="cpu"))
125
+ campplus_model = campplus_model.to(device).eval().half()
126
+
127
+ # Load F0 extractor (RMVPE)
128
+ from modules.rmvpe import RMVPE
129
+
130
+ rmvpe_path = load_custom_model_from_hf("lj1995/VoiceConversionWebUI", "rmvpe.pt", config_filename=None)
131
+ if isinstance(rmvpe_path, tuple):
132
+ rmvpe_path = rmvpe_path[0]
133
+ f0_extractor = RMVPE(rmvpe_path, is_half=True, device=device)
134
+
135
+ def f0_fn(wav, thred=0.03):
136
+ return f0_extractor.infer_from_audio(wav, thred=thred)
137
+
138
+ # Mel spectrogram config
139
+ from modules.commons import build_mel_fn
140
+ mel_fn_args = config["preprocess_params"]["spect_params"]
141
+ to_mel = build_mel_fn(mel_fn_args)
142
+ sr = config["preprocess_params"]["sr"]
143
+ hop_length = mel_fn_args["hop_length"]
144
+
145
+ _model_cache.update({
146
+ "model": model,
147
+ "semantic_fn": semantic_fn,
148
+ "vocoder_fn": vocoder_fn,
149
+ "campplus_model": campplus_model,
150
+ "f0_fn": f0_fn,
151
+ "to_mel": to_mel,
152
+ "sr": sr,
153
+ "hop_length": hop_length,
154
+ "device": device,
155
+ "max_context_window": model_params.DiT.max_context_window,
156
+ "overlap_frame_len": 16,
157
+ })
158
+
159
+ logger.info(f"Seed-VC models loaded (sr={sr}, hop={hop_length})")
160
+ return _model_cache
161
+
162
+
163
+ @spaces.GPU(duration=120)
164
  def convert_voice(
165
  audio_path,
166
+ reference_path,
167
  index_path=None,
168
  pitch=0,
169
  f0_method="rmvpe",
170
+ index_rate=0.7,
171
  protect=0.33,
172
  volume_envelope=1.0,
173
  output_format="WAV",
174
+ diffusion_steps=25,
175
  ):
176
  """
177
+ Convert voice using Seed-VC zero-shot singing voice conversion.
178
+
179
+ Args:
180
+ audio_path: Path to source vocals (separated by Demucs)
181
+ reference_path: Path to reference voice audio (3-30 sec)
182
+ pitch: Semitone shift (-24 to +24)
183
+ diffusion_steps: Quality vs speed trade-off (10=fast, 30=quality)
184
 
185
  Returns path to converted audio file.
186
  """
 
187
  import soundfile as sf
188
 
189
  os.makedirs(OUTPUT_DIR, exist_ok=True)
 
191
  output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
192
 
193
  device = "cuda" if torch.cuda.is_available() else "cpu"
194
+ logger.info("Converting voice with Seed-VC on {}".format(device))
195
+ logger.info("Source: {}, Reference: {}, Pitch: {}".format(audio_path, reference_path, pitch))
196
+
197
+ # Load models
198
+ cache = _load_seed_vc_models(device)
199
+ model = cache["model"]
200
+ semantic_fn = cache["semantic_fn"]
201
+ vocoder_fn = cache["vocoder_fn"]
202
+ campplus_model = cache["campplus_model"]
203
+ f0_fn = cache["f0_fn"]
204
+ to_mel = cache["to_mel"]
205
+ sr = cache["sr"]
206
+ hop_length = cache["hop_length"]
207
+ max_context_window = cache["max_context_window"]
208
+ overlap_frame_len = cache["overlap_frame_len"]
209
+
210
+ # Load source audio
211
+ source_audio = librosa.load(audio_path, sr=sr)[0]
212
+ source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
213
+
214
+ # Load reference audio
215
+ ref_audio = librosa.load(reference_path, sr=sr)[0]
216
+ # Limit reference to 30 seconds
217
+ max_ref_samples = 30 * sr
218
+ if len(ref_audio) > max_ref_samples:
219
+ ref_audio = ref_audio[:max_ref_samples]
220
+ ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)
221
+
222
+ # Resample to 16kHz for speech tokenizer
223
+ source_16k = torchaudio.functional.resample(source_audio, sr, 16000)
224
+ ref_16k = torchaudio.functional.resample(ref_audio, sr, 16000)
225
+
226
+ # Extract semantic tokens
227
+ S_alt = semantic_fn(source_16k[0])
228
+ S_ori = semantic_fn(ref_16k[0])
229
+
230
+ # Extract mel spectrograms
231
+ mel_source = to_mel(source_audio.to(device))
232
+ mel_ref = to_mel(ref_audio.to(device))
233
+ target_lengths = torch.LongTensor([mel_ref.size(2)]).to(device)
234
+
235
+ # Extract speaker embedding from reference
236
+ feat_ref = torchaudio.compliance.kaldi.fbank(
237
+ ref_16k[0].unsqueeze(0) if ref_16k.dim() == 2 else ref_16k,
238
+ num_mel_bins=80, sample_frequency=16000,
239
+ dither=0, window_type="hamming",
240
+ )
241
+ feat_ref = feat_ref - feat_ref.mean(dim=0, keepdim=True)
242
+ style_ref = campplus_model(feat_ref.unsqueeze(0).half().to(device))
243
 
244
+ # Extract F0 for singing
245
+ F0_ori = f0_fn(ref_16k[0].cpu().numpy(), thred=0.03)
246
+ F0_alt = f0_fn(source_16k[0].cpu().numpy(), thred=0.03)
 
247
 
248
+ F0_ori = torch.tensor(F0_ori).to(device).float()
249
+ F0_alt = torch.tensor(F0_alt).to(device).float()
250
 
251
+ # Auto-adjust F0 to match reference pitch range
252
+ voiced_ori = F0_ori > 1
253
+ voiced_alt = F0_alt > 1
 
254
 
255
+ log_f0_alt = torch.zeros_like(F0_alt)
256
+ log_f0_alt[voiced_alt] = torch.log(F0_alt[voiced_alt])
 
257
 
258
+ shifted_log_f0_alt = log_f0_alt.clone()
 
259
 
260
+ if voiced_ori.any() and voiced_alt.any():
261
+ median_log_f0_ori = torch.log(F0_ori[voiced_ori]).median()
262
+ median_log_f0_alt = log_f0_alt[voiced_alt].median()
263
+ shifted_log_f0_alt[voiced_alt] = (
264
+ log_f0_alt[voiced_alt] - median_log_f0_alt + median_log_f0_ori
 
 
 
 
 
 
 
 
 
 
 
 
265
  )
266
 
267
+ shifted_f0_alt = torch.zeros_like(F0_alt)
268
+ shifted_f0_alt[voiced_alt] = torch.exp(shifted_log_f0_alt[voiced_alt])
 
 
 
 
 
 
 
269
 
270
+ # Apply semitone pitch shift
271
+ if pitch != 0:
272
+ factor = 2.0 ** (pitch / 12.0)
273
+ shifted_f0_alt[voiced_alt] = shifted_f0_alt[voiced_alt] * factor
274
+
275
+ # Process in chunks with crossfading
276
+ cond = model["DiT"].prepare_concat(S_alt, mel_source)
277
+
278
+ # Prepare F0 conditioning
279
+ max_source_window = max_context_window - mel_ref.size(2)
280
+ overlap_wave_len = overlap_frame_len * hop_length
281
+
282
+ # Interpolate F0 to match mel frames
283
+ n_mel_frames = cond.size(1)
284
+ if len(shifted_f0_alt) != n_mel_frames:
285
+ shifted_f0_alt_interp = torch.nn.functional.interpolate(
286
+ shifted_f0_alt.unsqueeze(0).unsqueeze(0),
287
+ size=n_mel_frames, mode="nearest",
288
+ ).squeeze()
289
+ else:
290
+ shifted_f0_alt_interp = shifted_f0_alt
291
+
292
+ # Generate in chunks
293
+ generated_wave_chunks = []
294
+ processed_frames = 0
295
+
296
+ while processed_frames < cond.size(1):
297
+ chunk_end = min(processed_frames + max_source_window, cond.size(1))
298
+ chunk_cond = cond[:, processed_frames:chunk_end]
299
+ chunk_f0 = shifted_f0_alt_interp[processed_frames:chunk_end]
300
+
301
+ # Concatenate reference mel with source chunk
302
+ cat_condition = torch.cat([mel_ref, chunk_cond], dim=2)
303
+ cat_f0 = torch.cat([
304
+ torch.zeros(mel_ref.size(2)).to(device),
305
+ chunk_f0,
306
+ ])
307
+
308
+ with torch.no_grad():
309
+ vc_target = model["cfm"].inference(
310
+ cat_condition.half(),
311
+ torch.LongTensor([cat_condition.size(2)]).to(device),
312
+ mel_ref.half(),
313
+ style_ref,
314
+ cat_f0.unsqueeze(0).half(),
315
+ diffusion_steps,
316
+ inference_cfg_rate=index_rate,
317
+ )
318
+ vc_target = vc_target[:, :, mel_ref.size(2):]
319
+
320
+ # Vocoder
321
+ vc_wave = vocoder_fn(vc_target.float())
322
+
323
+ if generated_wave_chunks:
324
+ # Crossfade with previous chunk
325
+ prev = generated_wave_chunks[-1]
326
+ if overlap_wave_len > 0 and len(prev) >= overlap_wave_len:
327
+ cross_len = min(overlap_wave_len, vc_wave.size(-1))
328
+ fade_out = np.linspace(1, 0, cross_len)
329
+ fade_in = np.linspace(0, 1, cross_len)
330
+ prev_np = prev if isinstance(prev, np.ndarray) else prev
331
+ new_np = vc_wave[0].cpu().float().numpy()
332
+ prev_np[-cross_len:] = (
333
+ prev_np[-cross_len:] * fade_out + new_np[:cross_len] * fade_in
334
+ )
335
+ generated_wave_chunks.append(new_np[cross_len:])
336
+ else:
337
+ generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
338
+ else:
339
+ generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
340
+
341
+ processed_frames = chunk_end - overlap_frame_len
342
+ if processed_frames < 0:
343
+ processed_frames = chunk_end
344
+
345
+ # Concatenate all chunks
346
+ audio_out = np.concatenate(generated_wave_chunks)
347
 
 
348
  # Normalize
349
  audio_max = np.abs(audio_out).max()
350
  if audio_max > 0.01:
351
  audio_out = audio_out / audio_max * 0.95
352
 
353
+ # Resample to 44.1kHz if needed and save
354
+ if sr != 44100:
355
+ audio_out = librosa.resample(audio_out, orig_sr=sr, target_sr=44100)
 
 
356
 
357
+ sf.write(output_path, audio_out, 44100, subtype="PCM_16")
358
+ logger.info("Conversion complete: {} ({:.1f}s)".format(
359
+ output_path, len(audio_out) / 44100
360
+ ))
361
  return output_path
pipeline/setup.py CHANGED
@@ -1,5 +1,6 @@
1
  """
2
- Setup module: clones Applio at startup and downloads pretrained models.
 
3
  """
4
 
5
  import os
@@ -9,134 +10,50 @@ import logging
9
 
10
  logger = logging.getLogger(__name__)
11
 
12
- APPLIO_DIR = "/tmp/Applio"
13
- APPLIO_REPO = "https://github.com/IAHispano/Applio.git"
14
 
15
- # Pretrained model URLs from HuggingFace
16
- HF_BASE_URL = "https://huggingface.co/IAHispano/Applio/resolve/main/Resources"
17
 
18
- REQUIRED_MODELS = {
19
- # Pretrained v2 (HiFi-GAN) for 40k sample rate
20
- "rvc/models/pretraineds/hifi-gan/f0G40k.pth": "pretrained_v2/f0G40k.pth",
21
- "rvc/models/pretraineds/hifi-gan/f0D40k.pth": "pretrained_v2/f0D40k.pth",
22
- # RMVPE pitch extractor
23
- "rvc/models/predictors/rmvpe.pt": "predictors/rmvpe.pt",
24
- # ContentVec embedder
25
- "rvc/models/embedders/contentvec/pytorch_model.bin": "embedders/contentvec/pytorch_model.bin",
26
- "rvc/models/embedders/contentvec/config.json": "embedders/contentvec/config.json",
27
- }
28
-
29
-
30
- def clone_applio():
31
- """Clone Applio repository if not already present."""
32
- if os.path.exists(os.path.join(APPLIO_DIR, "core.py")):
33
- logger.info("Applio already cloned.")
34
  return True
35
 
36
- logger.info("Cloning Applio repository...")
37
  try:
38
  subprocess.run(
39
- ["git", "clone", "--depth", "1", APPLIO_REPO, APPLIO_DIR],
40
- check=True,
41
- capture_output=True,
42
- text=True,
43
  )
44
- logger.info("Applio cloned successfully.")
45
  return True
46
  except subprocess.CalledProcessError as e:
47
- logger.error(f"Failed to clone Applio: {e.stderr}")
48
  return False
49
 
50
 
51
- def download_pretrained(local_path, remote_path):
52
- """Download a single pretrained model file if not present."""
53
- full_path = os.path.join(APPLIO_DIR, local_path)
54
- if os.path.exists(full_path):
55
- return True
56
 
57
- os.makedirs(os.path.dirname(full_path), exist_ok=True)
58
- url = f"{HF_BASE_URL}/{remote_path}"
59
 
60
- logger.info(f"Downloading {remote_path}...")
61
- try:
62
- import requests
 
 
63
 
64
- response = requests.get(url, stream=True, timeout=(10, 120))
65
- response.raise_for_status()
66
- with open(full_path, "wb") as f:
67
- for chunk in response.iter_content(chunk_size=8192):
68
- f.write(chunk)
69
- logger.info(f"Downloaded {remote_path}")
70
- return True
71
  except Exception as e:
72
- logger.error(f"Failed to download {remote_path}: {e}")
73
- return False
74
-
75
-
76
- def create_mute_files():
77
- """Create mute audio files needed for training filelist generation."""
78
- import numpy as np
79
- from scipy.io import wavfile
80
-
81
- sample_rate = 40000
82
- mute_dir = os.path.join(APPLIO_DIR, "logs", "mute")
83
-
84
- for subdir in ["sliced_audios", "sliced_audios_16k", "f0", "f0_voiced", "extracted"]:
85
- os.makedirs(os.path.join(mute_dir, subdir), exist_ok=True)
86
 
87
- # Create mute wav files
88
- duration_samples = int(sample_rate * 0.4)
89
- mute_audio = np.zeros(duration_samples, dtype=np.float32)
90
-
91
- wavfile.write(
92
- os.path.join(mute_dir, "sliced_audios", f"mute{sample_rate}.wav"),
93
- sample_rate,
94
- mute_audio,
95
- )
96
- wavfile.write(
97
- os.path.join(mute_dir, "sliced_audios_16k", f"mute{16000}.wav"),
98
- 16000,
99
- np.zeros(int(16000 * 0.4), dtype=np.float32),
100
- )
101
-
102
- # Create mute feature files
103
- mute_f0 = np.zeros(int(16000 * 0.4 / 160), dtype=np.float32)
104
- np.save(os.path.join(mute_dir, "f0", "mute.wav.npy"), mute_f0)
105
- np.save(os.path.join(mute_dir, "f0_voiced", "mute.wav.npy"), mute_f0)
106
-
107
- # Create mute embedding (768-dim contentvec)
108
- mute_embed = np.zeros((int(16000 * 0.4 / 320), 768), dtype=np.float32)
109
- np.save(os.path.join(mute_dir, "extracted", "mute.npy"), mute_embed)
110
-
111
- logger.info("Mute files created.")
112
-
113
-
114
- def setup_applio():
115
- """Full setup: clone + download models + create mute files."""
116
- if not clone_applio():
117
- raise RuntimeError("Failed to clone Applio")
118
-
119
- # Add Applio to Python path
120
- if APPLIO_DIR not in sys.path:
121
- sys.path.insert(0, APPLIO_DIR)
122
-
123
- # Download required models
124
- all_ok = True
125
- for local_path, remote_path in REQUIRED_MODELS.items():
126
- if not download_pretrained(local_path, remote_path):
127
- all_ok = False
128
-
129
- if not all_ok:
130
- logger.warning("Some models failed to download. Training may not work.")
131
-
132
- # Create mute files for training
133
- create_mute_files()
134
-
135
- logger.info("Applio setup complete.")
136
  return True
137
-
138
-
139
- def ensure_applio_path():
140
- """Ensure Applio is on the Python path."""
141
- if APPLIO_DIR not in sys.path:
142
- sys.path.insert(0, APPLIO_DIR)
 
1
  """
2
+ Setup module: clone Seed-VC repo at startup.
3
+ Seed-VC downloads its own pretrained models from HuggingFace on first use.
4
  """
5
 
6
  import os
 
10
 
11
  logger = logging.getLogger(__name__)
12
 
13
+ SEED_VC_DIR = "/tmp/seed-vc"
14
+ SEED_VC_REPO = "https://github.com/Plachtaa/seed-vc.git"
15
 
 
 
16
 
17
+ def clone_seed_vc():
18
+ """Clone Seed-VC repository if not already present."""
19
+ if os.path.exists(os.path.join(SEED_VC_DIR, "inference.py")):
20
+ logger.info("Seed-VC already cloned.")
 
 
 
 
 
 
 
 
 
 
 
 
21
  return True
22
 
23
+ logger.info("Cloning Seed-VC repository...")
24
  try:
25
  subprocess.run(
26
+ ["git", "clone", "--depth", "1", SEED_VC_REPO, SEED_VC_DIR],
27
+ check=True, capture_output=True, text=True,
 
 
28
  )
29
+ logger.info("Seed-VC cloned successfully.")
30
  return True
31
  except subprocess.CalledProcessError as e:
32
+ logger.error(f"Failed to clone Seed-VC: {e.stderr}")
33
  return False
34
 
35
 
36
+ def ensure_seed_vc_path():
37
+ """Ensure Seed-VC is on the Python path."""
38
+ if SEED_VC_DIR not in sys.path:
39
+ sys.path.insert(0, SEED_VC_DIR)
 
40
 
 
 
41
 
42
+ def setup_seed_vc():
43
+ """Full setup: clone repo + add to path."""
44
+ if not clone_seed_vc():
45
+ raise RuntimeError("Failed to clone Seed-VC")
46
+ ensure_seed_vc_path()
47
 
48
+ # Install Seed-VC dependencies that might not be in our requirements.txt
49
+ try:
50
+ subprocess.run(
51
+ [sys.executable, "-m", "pip", "install", "-q",
52
+ "descript-audio-codec", "vocos", "bigvgan"],
53
+ capture_output=True, text=True, timeout=120,
54
+ )
55
  except Exception as e:
56
+ logger.warning(f"Some Seed-VC deps may be missing: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ logger.info("Seed-VC setup complete.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  return True
 
 
 
 
 
 
pipeline/storage.py CHANGED
@@ -1,5 +1,5 @@
1
  """
2
- Model storage module: persist trained RVC models to HuggingFace Dataset repo.
3
  """
4
 
5
  import os
@@ -8,64 +8,63 @@ from datetime import datetime
8
 
9
  logger = logging.getLogger(__name__)
10
 
11
- # Will be set from environment or app config
12
  MODELS_REPO_ID = None
13
  LOCAL_MODELS_DIR = "/tmp/rvc_models"
14
 
15
 
16
- def init_storage(repo_id: str):
17
  """Initialize storage with the HF dataset repo ID."""
18
  global MODELS_REPO_ID
19
  MODELS_REPO_ID = repo_id
20
  os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
21
- logger.info(f"Storage initialized with repo: {repo_id}")
22
 
23
 
24
- def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy_path: str = None):
25
- """Upload trained model files to HF dataset repo."""
26
  if not MODELS_REPO_ID:
27
  logger.warning("No HF repo configured. Model saved locally only.")
28
  return False
29
 
30
  try:
31
  from huggingface_hub import HfApi
32
-
33
  api = HfApi()
34
 
35
- # Upload .pth file
36
- api.upload_file(
37
- path_or_fileobj=pth_path,
38
- path_in_repo=f"models/{model_name}/{model_name}.pth",
39
- repo_id=MODELS_REPO_ID,
40
- repo_type="dataset",
41
- )
42
- logger.info(f"Uploaded {model_name}.pth to HF")
 
43
 
44
- # Upload .index file if exists
45
- if index_path and os.path.exists(index_path):
 
46
  api.upload_file(
47
- path_or_fileobj=index_path,
48
- path_in_repo=f"models/{model_name}/{model_name}.index",
49
  repo_id=MODELS_REPO_ID,
50
  repo_type="dataset",
51
  )
52
- logger.info(f"Uploaded {model_name}.index to HF")
53
 
54
- # Upload big_npy embeddings if exists
55
- if big_npy_path and os.path.exists(big_npy_path):
56
  api.upload_file(
57
- path_or_fileobj=big_npy_path,
58
- path_in_repo=f"models/{model_name}/{model_name}_big_npy.npy",
59
  repo_id=MODELS_REPO_ID,
60
  repo_type="dataset",
61
  )
62
- logger.info(f"Uploaded {model_name}_big_npy.npy to HF")
63
 
64
  # Upload metadata
65
  metadata = {
66
  "name": model_name,
67
  "created": datetime.now().isoformat(),
68
- "sample_rate": 40000,
69
  }
70
  import json
71
  import tempfile
@@ -77,7 +76,7 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
77
  try:
78
  api.upload_file(
79
  path_or_fileobj=meta_path,
80
- path_in_repo=f"models/{model_name}/metadata.json",
81
  repo_id=MODELS_REPO_ID,
82
  repo_type="dataset",
83
  )
@@ -86,14 +85,13 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
86
 
87
  return True
88
  except Exception as e:
89
- logger.error(f"Failed to upload model: {e}")
90
  return False
91
 
92
 
93
- def download_model(model_name: str):
94
- """Download model from HF dataset repo. Returns (pth_path, index_path)."""
95
  if not MODELS_REPO_ID:
96
- # Try local
97
  return _get_local_model(model_name)
98
 
99
  try:
@@ -105,58 +103,69 @@ def download_model(model_name: str):
105
  pth_path = hf_hub_download(
106
  repo_id=MODELS_REPO_ID,
107
  repo_type="dataset",
108
- filename=f"models/{model_name}/{model_name}.pth",
109
  local_dir=local_dir,
110
  )
111
 
112
- index_path = None
113
- try:
114
- index_path = hf_hub_download(
115
- repo_id=MODELS_REPO_ID,
116
- repo_type="dataset",
117
- filename=f"models/{model_name}/{model_name}.index",
118
- local_dir=local_dir,
119
- )
120
- except Exception:
121
- pass # Index file is optional
122
-
123
- # Download big_npy embeddings (for FAISS retrieval)
124
  try:
125
- hf_hub_download(
126
  repo_id=MODELS_REPO_ID,
127
  repo_type="dataset",
128
- filename=f"models/{model_name}/{model_name}_big_npy.npy",
129
  local_dir=local_dir,
130
  )
131
  except Exception:
132
- pass # Will reconstruct from index if missing
133
-
134
- return pth_path, index_path
 
 
 
 
 
 
 
 
 
135
  except Exception as e:
136
- logger.error(f"Failed to download model from HF: {e}")
137
  return _get_local_model(model_name)
138
 
139
 
140
- def _get_local_model(model_name: str):
141
  """Get model from local storage."""
142
  local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
143
- pth_path = os.path.join(local_dir, f"{model_name}.pth")
144
- index_path = os.path.join(local_dir, f"{model_name}.index")
145
 
146
  if os.path.exists(pth_path):
147
- return pth_path, index_path if os.path.exists(index_path) else None
148
  return None, None
149
 
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  def list_models():
152
- """List all available models (from HF repo + local)."""
153
  models = set()
154
 
155
- # Check HF repo
156
  if MODELS_REPO_ID:
157
  try:
158
  from huggingface_hub import HfApi
159
-
160
  api = HfApi()
161
  files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
162
  for f in files:
@@ -165,43 +174,37 @@ def list_models():
165
  if len(parts) >= 3:
166
  models.add(parts[1])
167
  except Exception as e:
168
- logger.error(f"Failed to list models from HF: {e}")
169
 
170
- # Check local models
171
  if os.path.exists(LOCAL_MODELS_DIR):
172
  for name in os.listdir(LOCAL_MODELS_DIR):
173
  model_dir = os.path.join(LOCAL_MODELS_DIR, name)
174
  if os.path.isdir(model_dir):
175
- pth = os.path.join(model_dir, f"{name}.pth")
176
  if os.path.exists(pth):
177
  models.add(name)
178
 
179
  return sorted(models)
180
 
181
 
182
- def delete_model(model_name: str):
183
  """Delete a model from HF repo and local storage."""
184
- # Delete from HF
185
  if MODELS_REPO_ID:
186
  try:
187
  from huggingface_hub import HfApi
188
-
189
  api = HfApi()
190
- # Delete the entire model folder
191
  files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
192
  for f in files:
193
- if f.startswith(f"models/{model_name}/"):
194
  api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
195
- logger.info(f"Deleted {model_name} from HF repo")
196
  except Exception as e:
197
- logger.error(f"Failed to delete from HF: {e}")
198
 
199
- # Delete local
200
  import shutil
201
-
202
  local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
203
  if os.path.exists(local_dir):
204
  shutil.rmtree(local_dir)
205
- logger.info(f"Deleted {model_name} from local storage")
206
 
207
  return True
 
1
  """
2
+ Model storage module: persist voice reference files to HuggingFace Dataset repo.
3
  """
4
 
5
  import os
 
8
 
9
  logger = logging.getLogger(__name__)
10
 
 
11
  MODELS_REPO_ID = None
12
  LOCAL_MODELS_DIR = "/tmp/rvc_models"
13
 
14
 
15
+ def init_storage(repo_id):
16
  """Initialize storage with the HF dataset repo ID."""
17
  global MODELS_REPO_ID
18
  MODELS_REPO_ID = repo_id
19
  os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
20
+ logger.info("Storage initialized with repo: {}".format(repo_id))
21
 
22
 
23
+ def upload_model(model_name, pth_path, index_path=None, big_npy_path=None, reference_path=None):
24
+ """Upload model files to HF dataset repo."""
25
  if not MODELS_REPO_ID:
26
  logger.warning("No HF repo configured. Model saved locally only.")
27
  return False
28
 
29
  try:
30
  from huggingface_hub import HfApi
 
31
  api = HfApi()
32
 
33
+ # Upload .pth marker
34
+ if pth_path and os.path.exists(pth_path):
35
+ api.upload_file(
36
+ path_or_fileobj=pth_path,
37
+ path_in_repo="models/{}/{}.pth".format(model_name, model_name),
38
+ repo_id=MODELS_REPO_ID,
39
+ repo_type="dataset",
40
+ )
41
+ logger.info("Uploaded {}.pth to HF".format(model_name))
42
 
43
+ # Upload reference audio
44
+ if reference_path and os.path.exists(reference_path):
45
+ ref_filename = os.path.basename(reference_path)
46
  api.upload_file(
47
+ path_or_fileobj=reference_path,
48
+ path_in_repo="models/{}/{}".format(model_name, ref_filename),
49
  repo_id=MODELS_REPO_ID,
50
  repo_type="dataset",
51
  )
52
+ logger.info("Uploaded {} to HF".format(ref_filename))
53
 
54
+ # Upload .index file if exists (backward compat)
55
+ if index_path and os.path.exists(index_path):
56
  api.upload_file(
57
+ path_or_fileobj=index_path,
58
+ path_in_repo="models/{}/{}.index".format(model_name, model_name),
59
  repo_id=MODELS_REPO_ID,
60
  repo_type="dataset",
61
  )
 
62
 
63
  # Upload metadata
64
  metadata = {
65
  "name": model_name,
66
  "created": datetime.now().isoformat(),
67
+ "engine": "seed-vc",
68
  }
69
  import json
70
  import tempfile
 
76
  try:
77
  api.upload_file(
78
  path_or_fileobj=meta_path,
79
+ path_in_repo="models/{}/metadata.json".format(model_name),
80
  repo_id=MODELS_REPO_ID,
81
  repo_type="dataset",
82
  )
 
85
 
86
  return True
87
  except Exception as e:
88
+ logger.error("Failed to upload model: {}".format(e))
89
  return False
90
 
91
 
92
+ def download_model(model_name):
93
+ """Download model from HF dataset repo. Returns (pth_path, reference_path)."""
94
  if not MODELS_REPO_ID:
 
95
  return _get_local_model(model_name)
96
 
97
  try:
 
103
  pth_path = hf_hub_download(
104
  repo_id=MODELS_REPO_ID,
105
  repo_type="dataset",
106
+ filename="models/{}/{}.pth".format(model_name, model_name),
107
  local_dir=local_dir,
108
  )
109
 
110
+ # Try to download reference audio
111
+ ref_path = None
 
 
 
 
 
 
 
 
 
 
112
  try:
113
+ ref_path = hf_hub_download(
114
  repo_id=MODELS_REPO_ID,
115
  repo_type="dataset",
116
+ filename="models/{}/{}_ref.wav".format(model_name, model_name),
117
  local_dir=local_dir,
118
  )
119
  except Exception:
120
+ # Try .index for backward compat with old RVC models
121
+ try:
122
+ ref_path = hf_hub_download(
123
+ repo_id=MODELS_REPO_ID,
124
+ repo_type="dataset",
125
+ filename="models/{}/{}.index".format(model_name, model_name),
126
+ local_dir=local_dir,
127
+ )
128
+ except Exception:
129
+ pass
130
+
131
+ return pth_path, ref_path
132
  except Exception as e:
133
+ logger.error("Failed to download model from HF: {}".format(e))
134
  return _get_local_model(model_name)
135
 
136
 
137
+ def _get_local_model(model_name):
138
  """Get model from local storage."""
139
  local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
140
+ pth_path = os.path.join(local_dir, "{}.pth".format(model_name))
141
+ ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
142
 
143
  if os.path.exists(pth_path):
144
+ return pth_path, ref_path if os.path.exists(ref_path) else None
145
  return None, None
146
 
147
 
148
+ def get_reference_path(model_name):
149
+ """Get the reference audio path for a model."""
150
+ local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
151
+ ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
152
+ if os.path.exists(ref_path):
153
+ return ref_path
154
+ # Search in subdirectories (HF download structure)
155
+ for root, dirs, files in os.walk(local_dir):
156
+ for f in files:
157
+ if f.endswith("_ref.wav"):
158
+ return os.path.join(root, f)
159
+ return None
160
+
161
+
162
  def list_models():
163
+ """List all available models."""
164
  models = set()
165
 
 
166
  if MODELS_REPO_ID:
167
  try:
168
  from huggingface_hub import HfApi
 
169
  api = HfApi()
170
  files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
171
  for f in files:
 
174
  if len(parts) >= 3:
175
  models.add(parts[1])
176
  except Exception as e:
177
+ logger.error("Failed to list models from HF: {}".format(e))
178
 
 
179
  if os.path.exists(LOCAL_MODELS_DIR):
180
  for name in os.listdir(LOCAL_MODELS_DIR):
181
  model_dir = os.path.join(LOCAL_MODELS_DIR, name)
182
  if os.path.isdir(model_dir):
183
+ pth = os.path.join(model_dir, "{}.pth".format(name))
184
  if os.path.exists(pth):
185
  models.add(name)
186
 
187
  return sorted(models)
188
 
189
 
190
+ def delete_model(model_name):
191
  """Delete a model from HF repo and local storage."""
 
192
  if MODELS_REPO_ID:
193
  try:
194
  from huggingface_hub import HfApi
 
195
  api = HfApi()
 
196
  files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
197
  for f in files:
198
+ if f.startswith("models/{}/".format(model_name)):
199
  api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
200
+ logger.info("Deleted {} from HF repo".format(model_name))
201
  except Exception as e:
202
+ logger.error("Failed to delete from HF: {}".format(e))
203
 
 
204
  import shutil
 
205
  local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
206
  if os.path.exists(local_dir):
207
  shutil.rmtree(local_dir)
208
+ logger.info("Deleted {} from local storage".format(model_name))
209
 
210
  return True
pipeline/training.py CHANGED
@@ -1,18 +1,12 @@
1
  """
2
- Training pipeline: wraps Applio's preprocess, extract, and train steps.
3
- All GPU-intensive operations run IN-PROCESS under @spaces.GPU decorators.
4
- Uses runpy.run_path to execute Applio scripts in the current process,
5
- ensuring ZeroGPU's GPU allocation is visible to the training code.
6
  """
7
 
8
  import os
9
- import sys
10
- import runpy
11
- import subprocess
12
  import logging
13
  import shutil
14
- import time
15
- import glob
16
 
17
  logger = logging.getLogger(__name__)
18
 
@@ -27,518 +21,105 @@ except ImportError:
27
  return decorator
28
 
29
 
30
- from pipeline.setup import APPLIO_DIR
31
-
32
- LOGS_DIR = os.path.join(APPLIO_DIR, "logs")
33
-
34
- # Prevent "context has already been set" errors from Applio/torch
35
- # by neutralizing mp.set_start_method calls
36
- import multiprocessing as _mp
37
- _orig_set_start_method = _mp.set_start_method
38
-
39
- def _safe_set_start_method(method=None, force=False):
40
- try:
41
- _orig_set_start_method(method, force=True)
42
- except RuntimeError:
43
- pass
44
-
45
- _mp.set_start_method = _safe_set_start_method
46
-
47
-
48
- def _setup_applio_env():
49
- """Ensure Applio is on sys.path."""
50
- if APPLIO_DIR not in sys.path:
51
- sys.path.insert(0, APPLIO_DIR)
52
- train_dir = os.path.join(APPLIO_DIR, "rvc", "train")
53
- if train_dir not in sys.path:
54
- sys.path.insert(0, train_dir)
55
-
56
-
57
- def preprocess(model_name: str, audio_path: str, sample_rate: int = 40000):
58
- """
59
- Preprocess audio: load, normalize, slice into segments, save at target SR and 16kHz.
60
- Custom implementation (no Applio subprocess dependency).
61
- """
62
- import numpy as np
63
- import librosa
64
- import soundfile as sf
65
-
66
- exp_dir = os.path.join(LOGS_DIR, model_name)
67
- sliced_dir = os.path.join(exp_dir, "sliced_audios")
68
- sliced_16k_dir = os.path.join(exp_dir, "sliced_audios_16k")
69
- os.makedirs(sliced_dir, exist_ok=True)
70
- os.makedirs(sliced_16k_dir, exist_ok=True)
71
-
72
- logger.info(f"Preprocessing {audio_path} for model {model_name}...")
73
-
74
- # Load audio at target sample rate
75
- audio, sr = librosa.load(audio_path, sr=sample_rate, mono=True)
76
- logger.info(f"Loaded audio: {len(audio)} samples, {len(audio)/sr:.1f}s at {sr}Hz")
77
-
78
- if len(audio) < sr * 1:
79
- raise RuntimeError("Audio trop court (< 1 seconde).")
80
-
81
- # Normalize
82
- peak = np.abs(audio).max()
83
- if peak > 0:
84
- audio = audio / peak * 0.95
85
-
86
- # Also load at 16kHz
87
- audio_16k, _ = librosa.load(audio_path, sr=16000, mono=True)
88
- peak_16k = np.abs(audio_16k).max()
89
- if peak_16k > 0:
90
- audio_16k = audio_16k / peak_16k * 0.95
91
-
92
- # Slice into segments of ~3.5 seconds with 0.3s overlap
93
- segment_len = int(3.5 * sr)
94
- hop = int(3.0 * sr) # 3.5 - 0.5 overlap
95
- segment_len_16k = int(3.5 * 16000)
96
- hop_16k = int(3.0 * 16000)
97
-
98
- MAX_SLICES = 40 # Balance quality vs GPU time (60s ZeroGPU limit)
99
-
100
- n_slices = 0
101
- idx = 0
102
-
103
- while idx < len(audio) and n_slices < MAX_SLICES:
104
- # Slice at target sample rate
105
- end = min(idx + segment_len, len(audio))
106
- segment = audio[idx:end]
107
-
108
- # Skip very short segments (< 0.5s)
109
- if len(segment) < int(0.5 * sr):
110
- idx += hop
111
- continue
112
-
113
- # Skip silent segments
114
- if np.abs(segment).max() < 0.01:
115
- idx += hop
116
- continue
117
-
118
- # Compute corresponding 16k positions
119
- ratio = 16000 / sr
120
- idx_16k = int(idx * ratio)
121
- end_16k = int(end * ratio)
122
- segment_16k = audio_16k[idx_16k:min(end_16k, len(audio_16k))]
123
-
124
- # Save slices
125
- fname = f"{model_name}_{n_slices:04d}.wav"
126
- sf.write(os.path.join(sliced_dir, fname), segment, sr)
127
- sf.write(os.path.join(sliced_16k_dir, fname), segment_16k, 16000)
128
-
129
- n_slices += 1
130
- idx += hop
131
-
132
- logger.info(f"Preprocessing complete: {n_slices} slices created.")
133
-
134
- if n_slices == 0:
135
- raise RuntimeError("Preprocessing produced no audio slices. L'audio est peut-être silencieux.")
136
-
137
- return n_slices
138
-
139
-
140
- @spaces.GPU(duration=60)
141
- def extract_features(model_name: str, sample_rate: int = 40000, f0_method: str = "rmvpe"):
142
- """
143
- Extract F0 pitch and HuBERT embeddings.
144
- Runs IN-PROCESS to access ZeroGPU's GPU allocation.
145
- """
146
  import torch
147
- import numpy as np
148
-
149
- _setup_applio_env()
150
- old_cwd = os.getcwd()
151
- os.chdir(APPLIO_DIR)
152
-
153
- try:
154
- exp_dir = os.path.join(LOGS_DIR, model_name)
155
- wav_path = os.path.join(exp_dir, "sliced_audios_16k")
156
-
157
- os.makedirs(os.path.join(exp_dir, "f0"), exist_ok=True)
158
- os.makedirs(os.path.join(exp_dir, "f0_voiced"), exist_ok=True)
159
- os.makedirs(os.path.join(exp_dir, "extracted"), exist_ok=True)
160
-
161
- files = []
162
- for wav_file in sorted(glob.glob(os.path.join(wav_path, "*.wav"))):
163
- file_name = os.path.basename(wav_file)
164
- files.append([
165
- wav_file,
166
- os.path.join(exp_dir, "f0", file_name + ".npy"),
167
- os.path.join(exp_dir, "f0_voiced", file_name + ".npy"),
168
- os.path.join(exp_dir, "extracted", file_name.replace("wav", "npy")),
169
- ])
170
 
171
- if not files:
172
- raise RuntimeError("No preprocessed audio files found for feature extraction.")
173
 
174
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
175
-
176
- # F0 extraction
177
- logger.info(f"Extracting F0 with {f0_method} on {device}...")
178
- from rvc.train.extract.extract import FeatureInput
179
- fe = FeatureInput(f0_method=f0_method, device=device)
180
- for file_info in files:
181
- fe.process_file(file_info)
182
-
183
- # HuBERT embedding extraction
184
- logger.info(f"Extracting embeddings on {device}...")
185
- from rvc.lib.utils import load_audio_16k, load_embedding
186
- emb_model = load_embedding("contentvec", None).to(device).float()
187
-
188
- for file_info in files:
189
- wav_file_path, _, _, out_file_path = file_info
190
- if os.path.exists(out_file_path):
191
- continue
192
- feats = torch.from_numpy(load_audio_16k(wav_file_path)).to(device).float()
193
- feats = feats.view(1, -1)
194
- with torch.no_grad():
195
- emb_result = emb_model(feats)["last_hidden_state"]
196
- feats_out = emb_result.squeeze(0).float().cpu().numpy()
197
- if not np.isnan(feats_out).any():
198
- np.save(out_file_path, feats_out, allow_pickle=False)
199
-
200
- # Save embedder model info
201
- import json
202
- model_info_path = os.path.join(exp_dir, "model_info.json")
203
- model_info = {}
204
- if os.path.exists(model_info_path):
205
- with open(model_info_path, "r") as f:
206
- model_info = json.load(f)
207
- model_info["embedder_model"] = "contentvec"
208
- with open(model_info_path, "w") as f:
209
- json.dump(model_info, f, indent=4)
210
-
211
- # Generate config and filelist
212
- from rvc.train.extract.preparing_files import generate_config, generate_filelist
213
- generate_config(sample_rate, exp_dir)
214
- generate_filelist(exp_dir, sample_rate, include_mutes=2)
215
-
216
- # Verify output
217
- if len(os.listdir(os.path.join(exp_dir, "extracted"))) == 0:
218
- raise RuntimeError("Feature extraction produced no embeddings.")
219
- if len(os.listdir(os.path.join(exp_dir, "f0"))) == 0:
220
- raise RuntimeError("F0 extraction produced no pitch files.")
221
-
222
- logger.info("Feature extraction complete.")
223
- return True
224
- finally:
225
- os.chdir(old_cwd)
226
-
227
-
228
- @spaces.GPU(duration=60)
229
- def train_model(
230
- model_name: str,
231
- sample_rate: int = 40000,
232
- total_epochs: int = 20,
233
- batch_size: int = 8,
234
  ):
235
  """
236
- Train RVC v2 model. Runs IN-PROCESS with mp.Process patched to avoid
237
- spawning child processes (which can't access ZeroGPU's GPU).
238
- Max 300s (5 min) on ZeroGPU.
239
- """
240
- import torch.multiprocessing as mp
241
- import json
242
-
243
- _setup_applio_env()
244
-
245
- # Ensure assets/config.json exists (Applio reads precision from it)
246
- assets_dir = os.path.join(APPLIO_DIR, "assets")
247
- os.makedirs(assets_dir, exist_ok=True)
248
- config_json = os.path.join(assets_dir, "config.json")
249
- if not os.path.exists(config_json):
250
- with open(config_json, "w") as f:
251
- json.dump({"precision": "fp32"}, f)
252
-
253
- # Select pretrained models
254
- sr_prefix = str(sample_rate)[:2]
255
- pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
256
- pd = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0D{sr_prefix}k.pth")
257
-
258
- if not os.path.exists(pg) or not os.path.exists(pd):
259
- logger.warning("Pretrained models not found, training from scratch.")
260
- pg, pd = "", ""
261
 
262
- # Patch mp.Process to run inline (single GPU only)
263
- OrigProcess = mp.Process
264
 
265
- class InlineProcess:
266
- """Runs target function inline instead of spawning a new process."""
267
- def __init__(self, target=None, args=(), kwargs=None, **kw):
268
- self.target = target
269
- self.args = args
270
- self.kwargs = kwargs or {}
271
- self.pid = os.getpid()
272
 
273
- def start(self):
274
- if self.target:
275
- self.target(*self.args, **self.kwargs)
276
-
277
- def join(self):
278
- pass
279
-
280
- train_script = os.path.join(APPLIO_DIR, "rvc", "train", "train.py")
281
-
282
- argv_args = [
283
- model_name,
284
- str(total_epochs), str(total_epochs),
285
- pg, pd,
286
- "0", str(batch_size), str(sample_rate),
287
- "True", "True", "False", "False", "50", "False", "HiFi-GAN", "False",
288
- ]
289
-
290
- logger.info(f"Training {model_name} for {total_epochs} epochs (in-process)...")
291
- start_time = time.time()
292
-
293
- old_argv = sys.argv
294
- old_cwd = os.getcwd()
295
-
296
- mp.Process = InlineProcess
297
- try:
298
- os.chdir(APPLIO_DIR)
299
- sys.argv = [train_script] + argv_args
300
- runpy.run_path(train_script, run_name="__main__")
301
- except SystemExit as e:
302
- if e.code not in (0, None):
303
- raise RuntimeError(f"Training exited with code {e.code}")
304
- finally:
305
- mp.Process = OrigProcess
306
- sys.argv = old_argv
307
- os.chdir(old_cwd)
308
-
309
- elapsed = time.time() - start_time
310
- logger.info(f"Training completed in {elapsed:.1f}s")
311
- return True
312
-
313
-
314
- def build_index(model_name: str):
315
- """Build FAISS index from extracted embeddings."""
316
- import numpy as np
317
-
318
- try:
319
- import faiss
320
- except ImportError:
321
- logger.warning("faiss not available, skipping index building.")
322
- return None
323
-
324
- exp_dir = os.path.join(LOGS_DIR, model_name)
325
- extracted_dir = os.path.join(exp_dir, "extracted")
326
-
327
- if not os.path.exists(extracted_dir):
328
- logger.warning("No extracted features found for index building.")
329
- return None
330
-
331
- # Load all embeddings
332
- embeddings = []
333
- for npy_file in sorted(glob.glob(os.path.join(extracted_dir, "*.npy"))):
334
- try:
335
- emb = np.load(npy_file)
336
- if emb.ndim == 2:
337
- embeddings.append(emb)
338
- except Exception as e:
339
- logger.warning(f"Failed to load {npy_file}: {e}")
340
-
341
- if not embeddings:
342
- logger.warning("No valid embeddings found for index.")
343
- return None
344
-
345
- all_emb = np.concatenate(embeddings, axis=0).astype(np.float32)
346
- logger.info(f"Building FAISS index from {all_emb.shape[0]} vectors ({all_emb.shape[1]}D)...")
347
-
348
- # Build IVF index for fast retrieval
349
- dim = all_emb.shape[1]
350
- n_vectors = all_emb.shape[0]
351
-
352
- if n_vectors < 40:
353
- # Too few vectors for IVF, use flat index
354
- index = faiss.IndexFlatL2(dim)
355
- else:
356
- n_clusters = min(int(np.sqrt(n_vectors)), n_vectors // 4)
357
- n_clusters = max(n_clusters, 1)
358
- quantizer = faiss.IndexFlatL2(dim)
359
- index = faiss.IndexIVFFlat(quantizer, dim, n_clusters)
360
- index.train(all_emb)
361
-
362
- index.add(all_emb)
363
-
364
- index_path = os.path.join(exp_dir, f"{model_name}.index")
365
- faiss.write_index(index, index_path)
366
-
367
- # Save raw embeddings for FAISS retrieval at inference time
368
- big_npy_path = os.path.join(exp_dir, f"{model_name}_big_npy.npy")
369
- np.save(big_npy_path, all_emb)
370
-
371
- logger.info(f"FAISS index built: {index_path} ({n_vectors} vectors)")
372
- return index_path, big_npy_path
373
-
374
-
375
- def find_trained_model(model_name: str):
376
- """Find the trained .pth model file."""
377
- exp_dir = os.path.join(LOGS_DIR, model_name)
378
-
379
- if os.path.exists(exp_dir):
380
- exact = os.path.join(exp_dir, f"{model_name}.pth")
381
- if os.path.exists(exact):
382
- return exact
383
-
384
- for f in sorted(os.listdir(exp_dir), reverse=True):
385
- if f.endswith(".pth") and f.startswith(model_name):
386
- return os.path.join(exp_dir, f)
387
-
388
- return None
389
-
390
-
391
- def find_pretrained_model(sample_rate: int = 40000):
392
- """Find the pre-trained RVC generator model."""
393
- sr_prefix = str(sample_rate)[:2]
394
- pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
395
- if os.path.exists(pg):
396
- return pg
397
- return None
398
-
399
-
400
- def _convert_to_inference_model(checkpoint_path, output_path, sample_rate=40000):
401
- """
402
- Convert a pretrained training checkpoint to RVC inference format.
403
- Training checkpoints have keys: model, optimizer, iteration, learning_rate
404
- Inference models need keys: weight, config, info, sr, f0, version
405
  """
406
- import torch
407
- import json
408
-
409
- checkpoint = torch.load(checkpoint_path, map_location="cpu")
410
-
411
- # Extract generator weights
412
- if "model" in checkpoint:
413
- state_dict = checkpoint["model"]
414
- elif "state_dict" in checkpoint:
415
- state_dict = checkpoint["state_dict"]
416
- else:
417
- state_dict = checkpoint
418
 
419
- # Remove "module." prefix if present (from DataParallel)
420
- weight = {}
421
- for k, v in state_dict.items():
422
- new_key = k.replace("module.", "")
423
- weight[new_key] = v.half()
424
 
425
- # Read config from Applio config file
426
- sr_label = "40k" if sample_rate == 40000 else "48k"
427
- config_path = os.path.join(APPLIO_DIR, "configs", "v2", f"{sr_label}.json")
428
 
429
- if os.path.exists(config_path):
430
- with open(config_path) as f:
431
- cfg = json.load(f)
432
- config = [
433
- cfg["data"]["filter_length"] // 2 + 1,
434
- cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
435
- cfg["model"]["inter_channels"],
436
- cfg["model"]["hidden_channels"],
437
- cfg["model"]["filter_channels"],
438
- cfg["model"]["n_heads"],
439
- cfg["model"]["n_layers"],
440
- cfg["model"]["kernel_size"],
441
- cfg["model"]["p_dropout"],
442
- cfg["model"]["resblock"],
443
- cfg["model"]["resblock_kernel_sizes"],
444
- cfg["model"]["resblock_dilation_sizes"],
445
- cfg["model"]["upsample_rates"],
446
- cfg["model"]["upsample_initial_channel"],
447
- cfg["model"]["upsample_kernel_sizes"],
448
- cfg["model"]["spk_embed_dim"],
449
- cfg["model"]["gin_channels"],
450
- cfg["data"]["sampling_rate"],
451
- ]
452
- else:
453
- # Fallback: standard RVC v2 40k config
454
- config = [
455
- 1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
456
- [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
457
- [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
458
- ]
459
 
460
- inference_model = {
461
- "weight": weight,
462
- "config": config,
463
- "info": f"v2_{sr_label}",
464
- "sr": sr_label,
465
- "f0": 1,
466
- "version": "v2",
467
- }
468
 
469
- torch.save(inference_model, output_path)
470
- logger.info(f"Converted checkpoint to inference format: {output_path}")
471
- return output_path
 
472
 
473
-
474
- def full_training_pipeline(
475
- audio_path: str,
476
- model_name: str,
477
- epochs: int = 10,
478
- sample_rate: int = 40000,
479
- batch_size: int = 4,
480
- progress_callback=None,
481
- ):
482
- """
483
- Run the voice model creation pipeline.
484
- On CPU: skips heavy HiFi-GAN training, uses pre-trained model + FAISS index.
485
- Returns (pth_path, index_path) on success.
486
- """
487
- import torch
488
- from pipeline.storage import upload_model, LOCAL_MODELS_DIR
489
-
490
- has_gpu = torch.cuda.is_available()
491
 
492
  if progress_callback:
493
- progress_callback(0.05, "Découpage de l'audio...")
494
-
495
- n_slices = preprocess(model_name, audio_path, sample_rate)
496
-
497
- if progress_callback:
498
- progress_callback(0.20, f"{n_slices} segments créés. Extraction des caractéristiques vocales...")
499
-
500
- extract_features(model_name, sample_rate)
501
 
502
- if progress_callback:
503
- progress_callback(0.60, "Caractéristiques extraites. Construction de l'index vocal...")
 
 
504
 
505
- # Build FAISS index (fast, CPU-friendly)
506
- result = build_index(model_name)
507
- if result is None:
508
- raise RuntimeError("Impossible de construire l'index FAISS. Pas d'embeddings extraits.")
509
- index_path, big_npy_path = result
510
 
511
- # The user's "model" is the FAISS index + embeddings.
512
- # The pretrained generator is shared by all models (loaded at inference time).
513
- # Voice identity comes from FAISS retrieval, not generator fine-tuning.
514
  if progress_callback:
515
- progress_callback(0.75, "Finalisation du modèle vocal...")
516
 
517
  # Save to local models directory
518
  local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
519
  os.makedirs(local_model_dir, exist_ok=True)
520
 
521
- # Save FAISS index
522
- local_index = os.path.join(local_model_dir, f"{model_name}.index")
523
- shutil.copy2(index_path, local_index)
524
 
525
- # Save big_npy embeddings (needed for FAISS retrieval at inference)
526
- local_big_npy = os.path.join(local_model_dir, f"{model_name}_big_npy.npy")
527
- shutil.copy2(big_npy_path, local_big_npy)
528
-
529
- # Create a minimal model marker file (no actual model weights needed)
530
- local_pth = os.path.join(local_model_dir, f"{model_name}.pth")
531
- torch.save({"type": "faiss_voice_model", "sample_rate": sample_rate}, local_pth)
 
 
532
 
533
  if progress_callback:
534
- progress_callback(0.90, "Sauvegarde du modèle...")
535
 
 
536
  try:
537
- upload_model(model_name, local_pth, local_index, local_big_npy)
538
  except Exception as e:
539
- logger.warning(f"Failed to upload to HF (non-critical): {e}")
540
 
541
  if progress_callback:
542
- progress_callback(1.0, "Modèle vocal créé !")
 
 
 
543
 
544
- return local_pth, local_index
 
1
  """
2
+ Voice model creation: save a reference audio clip for Seed-VC zero-shot conversion.
3
+ No neural network training needed - Seed-VC uses in-context learning from
4
+ reference audio at inference time.
 
5
  """
6
 
7
  import os
 
 
 
8
  import logging
9
  import shutil
 
 
10
 
11
  logger = logging.getLogger(__name__)
12
 
 
21
  return decorator
22
 
23
 
24
+ # Dummy GPU-decorated function so ZeroGPU detects a GPU function at startup
25
+ @spaces.GPU(duration=10)
26
+ def _gpu_warmup():
27
+ """Minimal GPU function for ZeroGPU detection."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  import torch
29
+ return torch.cuda.is_available() if hasattr(torch.cuda, "is_available") else False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
 
 
31
 
32
+ def save_voice_reference(
33
+ audio_path,
34
+ model_name,
35
+ progress_callback=None,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ):
37
  """
38
+ Save a voice reference audio clip as the user's 'voice model'.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ With Seed-VC, no training is needed. The reference audio (3-30 seconds)
41
+ is used directly at inference time for zero-shot voice conversion.
42
 
43
+ Args:
44
+ audio_path: Path to the uploaded voice recording
45
+ model_name: Name for the voice model
46
+ progress_callback: Optional callback for progress updates
 
 
 
47
 
48
+ Returns:
49
+ (reference_path, None) - path to saved reference audio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  """
51
+ import librosa
52
+ import soundfile as sf
53
+ import numpy as np
 
 
 
 
 
 
 
 
 
54
 
55
+ from pipeline.storage import LOCAL_MODELS_DIR, upload_model
 
 
 
 
56
 
57
+ if progress_callback:
58
+ progress_callback(0.1, "Chargement de l'audio...")
 
59
 
60
+ # Load and preprocess the reference audio
61
+ audio, sr = librosa.load(audio_path, sr=44100, mono=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ duration = len(audio) / sr
64
+ logger.info("Reference audio: {:.1f}s at {}Hz".format(duration, sr))
 
 
 
 
 
 
65
 
66
+ if duration < 2.0:
67
+ raise RuntimeError(
68
+ "Audio trop court ({:.1f}s). Minimum 3 secondes recommande.".format(duration)
69
+ )
70
 
71
+ # Limit to 30 seconds (Seed-VC max reference length)
72
+ max_samples = 30 * sr
73
+ if len(audio) > max_samples:
74
+ audio = audio[:max_samples]
75
+ logger.info("Trimmed reference to 30s (Seed-VC max).")
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  if progress_callback:
78
+ progress_callback(0.3, "Normalisation et nettoyage...")
 
 
 
 
 
 
 
79
 
80
+ # Normalize audio
81
+ peak = np.abs(audio).max()
82
+ if peak > 0:
83
+ audio = audio / peak * 0.95
84
 
85
+ # Trim silence from start and end
86
+ audio_trimmed, _ = librosa.effects.trim(audio, top_db=25)
87
+ if len(audio_trimmed) > sr * 2:
88
+ audio = audio_trimmed
 
89
 
 
 
 
90
  if progress_callback:
91
+ progress_callback(0.6, "Sauvegarde de la reference vocale...")
92
 
93
  # Save to local models directory
94
  local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
95
  os.makedirs(local_model_dir, exist_ok=True)
96
 
97
+ reference_path = os.path.join(local_model_dir, "{}_ref.wav".format(model_name))
98
+ sf.write(reference_path, audio, 44100, subtype="PCM_16")
 
99
 
100
+ # Also save a .pth marker for compatibility with storage/listing
101
+ import torch
102
+ marker_path = os.path.join(local_model_dir, "{}.pth".format(model_name))
103
+ torch.save({
104
+ "type": "seed_vc_reference",
105
+ "reference_audio": "{}_ref.wav".format(model_name),
106
+ "duration": len(audio) / sr,
107
+ "sample_rate": 44100,
108
+ }, marker_path)
109
 
110
  if progress_callback:
111
+ progress_callback(0.8, "Upload vers HuggingFace...")
112
 
113
+ # Upload to HF
114
  try:
115
+ upload_model(model_name, marker_path, reference_path=reference_path)
116
  except Exception as e:
117
+ logger.warning("Failed to upload to HF (non-critical): {}".format(e))
118
 
119
  if progress_callback:
120
+ progress_callback(1.0, "Reference vocale sauvegardee !")
121
+
122
+ final_duration = len(audio) / sr
123
+ logger.info("Voice reference saved: {} ({:.1f}s)".format(reference_path, final_duration))
124
 
125
+ return marker_path, reference_path
requirements.txt CHANGED
@@ -13,31 +13,23 @@ soundfile==0.12.1
13
  scipy>=1.11.0
14
  numpy<2.0
15
  soxr
16
- noisereduce
17
  ffmpeg-python>=0.2.0
18
  pedalboard
19
 
20
- # RVC dependencies
21
- faiss-cpu==1.9.0.post1
22
- torchcrepe
23
- torchfcpe
24
- einops
25
- transformers==4.44.2
26
-
27
  # Demucs (stem separation)
28
  demucs
29
 
30
- # Pitch extraction
31
- praat-parselmouth
32
-
33
- # ML utilities
34
- tqdm
35
  pyyaml
36
  requests
 
37
  numba
38
 
39
- # Misc
40
- tensorboard
41
- tensorboardX
42
- stftpitchshift
43
- wget
 
 
13
  scipy>=1.11.0
14
  numpy<2.0
15
  soxr
 
16
  ffmpeg-python>=0.2.0
17
  pedalboard
18
 
 
 
 
 
 
 
 
19
  # Demucs (stem separation)
20
  demucs
21
 
22
+ # Seed-VC dependencies
23
+ einops
24
+ transformers>=4.40.0
 
 
25
  pyyaml
26
  requests
27
+ tqdm
28
  numba
29
 
30
+ # Vocoder
31
+ bigvgan
32
+
33
+ # Audio codec & vocos (used by Seed-VC)
34
+ descript-audio-codec
35
+ vocos