rvc

Sleeping

ibcplateformes Claude Opus 4.6 commited on Mar 31

Commit

fea49f2

1 Parent(s): 55b9bab

Replace RVC with Seed-VC for zero-shot voice conversion

RVC required fine-tuning (250-500 epochs) incompatible with ZeroGPU's 60s limit,
resulting in poor quality. Seed-VC uses diffusion transformer + in-context learning
for zero-shot conversion with just 3-30 sec of reference audio.

- Rewrite inference.py: Seed-VC pipeline (Whisper + CAMPPlus + RMVPE + BigVGAN)
- Simplify training.py: just save reference audio (no neural network training)
- Simplify setup.py: clone Seed-VC repo instead of Applio
- Update storage.py: handle audio reference files
- Simplify app.py UI: remove epochs slider, 3-30 sec upload
- Update requirements.txt: remove RVC deps, add Seed-VC deps
- Update README.md: reflect new architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (7) hide show

README.md +22 -100
app.py +112 -172
pipeline/inference.py +286 -269
pipeline/setup.py +31 -114
pipeline/storage.py +73 -70
pipeline/training.py +70 -489
requirements.txt +10 -18

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Clone Vocal RVC
 emoji: "\U0001F3A4"
 colorFrom: purple
 colorTo: blue
@@ -10,126 +10,48 @@ app_file: app.py
 pinned: false
 license: mit
 tags:
-  - rvc
   - voice-cloning
   - demucs
   - audio
   - music
 ---
-# Clone Vocal RVC
-Outil web de **clonage vocal** basé sur **RVC v2** (Retrieval-based Voice Conversion), accessible depuis votre navigateur.
 ## Fonctionnalités
-1. **Entraînement vocal** : Uploadez un enregistrement de votre voix (3-5 min) pour créer un modèle vocal personnalisé
 2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
-3. **Conversion vocale** : Remplacement de la voix originale par votre voix clonée
-4. **Mixage final** : Remixage automatique de votre voix convertie + les instruments originaux
 5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
 ## Comment utiliser
-### Étape 1 : Entraîner votre modèle vocal
-1. Allez dans l'onglet **"Entraîner ma voix"**
-2. Uploadez un enregistrement de votre voix (WAV ou MP3, 3-5 minutes)
-   - Parlez ou chantez naturellement
-   - Évitez le bruit de fond
-3. Donnez un nom à votre modèle (ex: `ma_voix`)
-4. Choisissez le nombre d'époques (20 par défaut, suffisant pour un bon résultat)
-5. Cliquez sur **"Lancer l'entraînement"**
-6. Attendez la fin de l'entraînement (~3-5 minutes)
 ### Étape 2 : Convertir un morceau
-1. Allez dans l'onglet **"Convertir un morceau"**
-2. Sélectionnez votre modèle vocal dans la liste
-3. Uploadez le morceau de musique à convertir (WAV ou MP3)
-4. Ajustez les paramètres si besoin :
-   - **Transposition** : +/- demi-tons si votre voix est plus grave/aiguë
-   - **Taux d'index** : fidélité au timbre (0.75 par défaut)
-   - **Volumes** : équilibre voix/instruments
-5. Cliquez sur **"Convertir et mixer"**
-6. Écoutez l'aperçu et téléchargez le résultat
-### Étape 3 : Gérer vos modèles
-- L'onglet **"Mes modèles"** permet de voir, supprimer, ou importer des modèles externes
-## Déploiement
-### Prérequis
-- Un compte [HuggingFace](https://huggingface.co)
-- Un compte [GitHub](https://github.com)
-### Étapes de déploiement
-#### 1. Créer un dataset repo sur HuggingFace (pour stocker les modèles)
-1. Allez sur https://huggingface.co/new-dataset
-2. Nom : `rvc-voice-models`
-3. Visibilité : **Privé**
-4. Cliquez **Create**
-#### 2. Créer un token HuggingFace
-1. Allez sur https://huggingface.co/settings/tokens
-2. Cliquez **Create new token**
-3. Nom : `rvc-voice-cloner`
-4. Permissions : **Write**
-5. Copiez le token
-#### 3. Créer le repo GitHub
-```bash
-cd rvc-voice-cloner
-git init
-git add .
-git commit -m "Initial commit: Clone Vocal RVC"
-git remote add origin https://github.com/diamesene02/rvc-voice-cloner.git
-git push -u origin main
-```
-#### 4. Créer le HuggingFace Space
-1. Allez sur https://huggingface.co/new-space
-2. Nom : `clone-vocal-rvc`
-3. SDK : **Gradio**
-4. Hardware : **ZeroGPU** (gratuit pour les espaces publics)
-5. Cliquez **Create Space**
-#### 5. Configurer les secrets du Space
-Dans les **Settings** du Space :
-- Ajoutez `HF_TOKEN` : votre token HuggingFace (étape 2)
-- Ajoutez `HF_MODELS_REPO` : `votre-username/rvc-voice-models`
-#### 6. Déployer le code
-```bash
-# Ajouter le remote HuggingFace
-git remote add hf https://huggingface.co/spaces/votre-username/clone-vocal-rvc
-# Pousser le code
-git push hf main
-```
-#### 7. Accéder à l'outil
-Votre outil est accessible à :
-```
-https://huggingface.co/spaces/votre-username/clone-vocal-rvc
-```
 ## Architecture technique
-- **RVC v2** : Retrieval-based Voice Conversion avec HiFi-GAN
-- **Demucs** (Meta AI) : Séparation des sources audio (voix/instruments)
 - **Gradio** : Interface web
-- **ZeroGPU** : GPU H200 gratuit sur HuggingFace Spaces
-- **Applio** : Backend RVC (cloné automatiquement au démarrage)
-## Limitations
-- **Quota GPU** : ~5 minutes de GPU gratuit par jour (ZeroGPU)
-  - L'entraînement consomme ~3-4 min
-  - La conversion consomme ~1-2 min
-  - Pour plus de GPU : upgrade vers HuggingFace PRO ($9/mois, 25 min/jour)
-- Les modèles sont stockés sur HuggingFace Hub (persistance entre redémarrages)
-- Premier lancement plus lent (téléchargement des modèles pré-entraînés)
 ## Licence
-MIT - Basé sur [Applio](https://github.com/IAHispano/Applio) (MIT) et [Demucs](https://github.com/facebookresearch/demucs) (MIT)

 ---
+title: Clone Vocal
 emoji: "\U0001F3A4"
 colorFrom: purple
 colorTo: blue
 pinned: false
 license: mit
 tags:
+  - seed-vc
   - voice-cloning
   - demucs
   - audio
   - music
+  - zero-shot
 ---
+# Clone Vocal
+Outil web de **clonage vocal zero-shot** basé sur **Seed-VC** (Diffusion Transformer), accessible depuis votre navigateur.
 ## Fonctionnalités
+1. **Référence vocale** : Uploadez un court extrait de votre voix (3-30 sec) — pas d'entraînement nécessaire
 2. **Séparation audio** : Séparation automatique voix/instruments via Demucs (Meta AI)
+3. **Conversion vocale** : Remplacement de la voix originale par la vôtre (Seed-VC zero-shot)
+4. **Mixage final** : Remixage automatique voix convertie + instruments originaux
 5. **Export** : Téléchargement du résultat en WAV 44.1kHz 16-bit
 ## Comment utiliser
+### Étape 1 : Enregistrer votre référence vocale
+1. Onglet **"Ma voix"**
+2. Uploadez un extrait de votre voix (WAV ou MP3, 3 à 30 secondes)
+3. Donnez un nom (ex: `ma_voix`)
+4. Cliquez **"Sauvegarder"**
 ### Étape 2 : Convertir un morceau
+1. Onglet **"Convertir un morceau"**
+2. Sélectionnez votre profil vocal
+3. Uploadez le morceau à convertir
+4. Ajustez les paramètres si besoin (transposition, qualité, volumes)
+5. Cliquez **"Convertir et mixer"**
 ## Architecture technique
+- **Seed-VC** : Voice conversion zero-shot par diffusion transformer + in-context learning
+- **Demucs** (Meta AI) : Séparation des sources audio
 - **Gradio** : Interface web
+- **ZeroGPU** : GPU sur HuggingFace Spaces
 ## Licence
+MIT — Basé sur [Seed-VC](https://github.com/Plachtaa/seed-vc) (GPL v3) et [Demucs](https://github.com/facebookresearch/demucs) (MIT)

app.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """
-Clone Vocal RVC - Outil web de clonage vocal basé sur RVC v2
-Interface Gradio en français, déployé sur HuggingFace Spaces avec ZeroGPU.
 """
 import os
@@ -11,8 +11,7 @@ import shutil
 import gradio as gr
-# ── Monkey-patch gradio_client to fix "argument of type 'bool' is not iterable" ──
-# Bug: gradio_client/utils.py get_type() crashes when schema is a bool instead of dict
 try:
     import gradio_client.utils as _gc_utils
@@ -43,46 +42,38 @@ except Exception:
 logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
 logger = logging.getLogger(__name__)
-# ── Startup: clone Applio + download models ──────────────────────────────────
 logger.info("Initialisation de l'application...")
-from pipeline.setup import setup_applio, APPLIO_DIR
-from pipeline.storage import init_storage, list_models, download_model, delete_model
-# Setup Applio (clone + download pretrained models)
 try:
-    setup_applio()
 except Exception as e:
-    logger.error(f"Erreur lors du setup: {e}")
 # Initialize model storage
 HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
 if HF_MODELS_REPO:
     init_storage(HF_MODELS_REPO)
-    logger.info(f"Stockage HuggingFace configuré: {HF_MODELS_REPO}")
-else:
-    logger.warning(
-        "Variable HF_MODELS_REPO non définie. Les modèles seront stockés localement uniquement. "
-        "Pour la persistance, ajoutez HF_MODELS_REPO=votre-user/rvc-voice-models dans les secrets du Space."
-    )
-# ── Import GPU-decorated functions at top level for ZeroGPU detection ───────
-from pipeline.training import full_training_pipeline, extract_features
 from pipeline.separation import separate_audio
 from pipeline.inference import convert_voice
-# ── Training Tab ─────────────────────────────────────────────────────────────
-def train_voice_model(audio_file, model_name, epochs, progress=gr.Progress()):
-    """Handler for voice model training."""
     if audio_file is None:
         return "Erreur : Veuillez uploader un fichier audio.", None
     if not model_name or not model_name.strip():
-        return "Erreur : Veuillez entrer un nom pour le modèle.", None
     model_name = model_name.strip().replace(" ", "_")
@@ -90,39 +81,31 @@ def train_voice_model(audio_file, model_name, epochs, progress=gr.Progress()):
         progress(value, desc=desc)
     try:
-        progress(0.0, desc="Démarrage de l'entraînement...")
-        pth_path, index_path = full_training_pipeline(
             audio_path=audio_file,
             model_name=model_name,
-            epochs=int(epochs),
-            sample_rate=40000,
-            batch_size=8,
             progress_callback=progress_callback,
         )
-        result_msg = f"Modèle '{model_name}' entraîné avec succès !\n"
-        result_msg += f"Fichier : {os.path.basename(pth_path)}\n"
-        if index_path:
-            result_msg += f"Index : {os.path.basename(index_path)}"
-        return result_msg, pth_path
     except Exception as e:
         import traceback
         tb = traceback.format_exc()
-        logger.error(f"Erreur training: {tb}")
-        # Show last 500 chars of traceback for debugging
-        return f"Erreur lors de l'entraînement : {type(e).__name__}: {str(e)}\n\nDétails:\n{tb[-500:]}", None
-# ── Conversion Tab ───────────────────────────────────────────────────────────
 def get_model_choices():
     """Get list of trained model names for dropdown."""
     models = list_models()
     if not models:
-        return ["(aucun modèle entraîné)"]
     return models
@@ -130,7 +113,7 @@ def convert_song(
     model_choice,
     song_file,
     pitch,
-    index_rate,
     vocal_volume,
     instrumental_volume,
     progress=gr.Progress(),
@@ -139,35 +122,39 @@ def convert_song(
     if song_file is None:
         return "Erreur : Veuillez uploader un fichier audio.", None, None, None
-    if model_choice == "(aucun modèle entraîné)" or not model_choice:
-        return "Erreur : Veuillez d'abord entraîner un modèle vocal.", None, None, None
     from pipeline.mixing import mix_audio
     try:
-        # Step 1: Download model
-        progress(0.05, desc="Chargement du modèle...")
-        pth_path, index_path = download_model(model_choice)
         if not pth_path:
-            return f"Erreur : Modèle '{model_choice}' introuvable.", None, None, None
         # Step 2: Separate vocals from instruments
-        progress(0.10, desc="Séparation des pistes (Demucs)...")
         vocals_path, instruments_path = separate_audio(song_file)
-        progress(0.50, desc="Conversion vocale (RVC)...")
-        # Step 3: Convert vocals with RVC
         converted_path = convert_voice(
             audio_path=vocals_path,
-            model_path=pth_path,
-            index_path=index_path,
             pitch=int(pitch),
-            f0_method="rmvpe",
-            index_rate=float(index_rate),
         )
-        progress(0.80, desc="Mixage final...")
         # Step 4: Mix converted vocals with instruments
         final_path = mix_audio(
@@ -177,119 +164,94 @@ def convert_song(
             instrumental_volume=float(instrumental_volume),
         )
-        progress(1.0, desc="Terminé !")
         return (
-            "Conversion terminée avec succès !",
-            vocals_path,      # Preview vocals séparées
-            converted_path,   # Preview vocals converties
-            final_path,       # Résultat final
         )
     except Exception as e:
         import traceback
         tb = traceback.format_exc()
-        logger.error(f"Erreur conversion: {tb}")
-        return f"Erreur lors de la conversion : {type(e).__name__}: {str(e)}\n\nDétails:\n{tb[-500:]}", None, None, None
-# ── Models Tab ───────────────────────────────────────────────────────────────
 def refresh_models():
     """Refresh the model list as HTML."""
     models = list_models()
     if not models:
-        return "<p style='color:gray;'>Aucun modèle entraîné</p>"
-    rows = "".join(f"<tr><td>{m}</td><td>Disponible</td></tr>" for m in models)
-    return f"<table style='width:100%;border-collapse:collapse;'><tr><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Nom</th><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Statut</th></tr>{rows}</table>"
 def delete_selected_model(model_name_to_delete):
     """Delete a model."""
-    if not model_name_to_delete or model_name_to_delete == "(aucun modèle entraîné)":
-        return "Veuillez sélectionner un modèle à supprimer.", refresh_models()
     try:
         delete_model(model_name_to_delete)
-        return f"Modèle '{model_name_to_delete}' supprimé.", refresh_models()
     except Exception as e:
-        return f"Erreur : {e}", refresh_models()
-def upload_external_model(pth_file, model_name):
-    """Upload an external .pth model."""
-    if pth_file is None:
-        return "Veuillez sélectionner un fichier .pth", refresh_models()
-    if not model_name or not model_name.strip():
-        return "Veuillez entrer un nom pour le modèle.", refresh_models()
-    model_name = model_name.strip().replace(" ", "_")
-    from pipeline.storage import LOCAL_MODELS_DIR, upload_model
-    local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
-    os.makedirs(local_dir, exist_ok=True)
-    local_pth = os.path.join(local_dir, f"{model_name}.pth")
-    shutil.copy2(pth_file, local_pth)
-    try:
-        upload_model(model_name, local_pth)
-    except Exception:
-        pass  # Non-critical
-    return f"Modèle '{model_name}' importé avec succès.", refresh_models()
-# ── Build Gradio UI ──────────────────────────────────────────────────────────
 DESCRIPTION = """
-# Clone Vocal RVC
-Outil de clonage vocal basé sur **RVC v2** (Retrieval-based Voice Conversion).
 **Comment utiliser :**
-1. **Onglet "Entraîner"** : Uploadez un enregistrement de votre voix (3-5 min) pour créer votre modèle vocal
-2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la vôtre
-3. **Onglet "Modèles"** : Gérez vos modèles vocaux entraînés
-> **Note** : Cet outil utilise ZeroGPU. Le quota GPU gratuit est limité (~5 min/jour).
-> L'entraînement consomme ~3-4 min de GPU, la conversion ~1-2 min.
 """
 with gr.Blocks(
-    title="Clone Vocal RVC",
     theme=gr.themes.Soft(),
 ) as app:
     gr.Markdown(DESCRIPTION)
     with gr.Tabs():
-        # ── Tab 1: Training ──
-        with gr.TabItem("Entraîner ma voix"):
-            gr.Markdown("### Créer un modèle vocal à partir de votre voix")
             with gr.Row():
                 with gr.Column(scale=2):
                     train_audio = gr.Audio(
-                        label="Enregistrement vocal (WAV ou MP3, 3-5 minutes)",
                         type="filepath",
                         sources=["upload"],
                     )
                     train_model_name = gr.Textbox(
-                        label="Nom du modèle",
                         placeholder="ex: ma_voix",
                         max_lines=1,
                     )
-                    train_epochs = gr.Slider(
-                        minimum=5,
-                        maximum=50,
-                        value=20,
-                        step=5,
-                        label="Nombre d'époques (plus = meilleure qualité, ~3-5 min avec GPU)",
-                    )
                     train_btn = gr.Button(
-                        "Lancer l'entraînement",
                         variant="primary",
                         size="lg",
                     )
@@ -298,59 +260,59 @@ with gr.Blocks(
                     train_status = gr.Textbox(
                         label="Statut",
                         interactive=False,
-                        lines=5,
                     )
                     train_download = gr.File(
-                        label="Télécharger le modèle",
                         interactive=False,
                     )
             gr.Markdown(
                 "**Conseils :**\n"
                 "- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
-                "- Parlez ou chantez naturellement pendant 3-5 minutes\n"
-                "- Format WAV ou MP3 accepté\n"
-                "- 15-25 époques suffisent pour un bon résultat"
             )
             train_btn.click(
                 fn=train_voice_model,
-                inputs=[train_audio, train_model_name, train_epochs],
                 outputs=[train_status, train_download],
             )
-        # ── Tab 2: Conversion ──
         with gr.TabItem("Convertir un morceau"):
-            gr.Markdown("### Remplacer la voix d'un morceau par la vôtre")
             with gr.Row():
                 with gr.Column(scale=2):
                     convert_model = gr.Dropdown(
                         choices=get_model_choices(),
-                        label="Modèle vocal",
                         interactive=True,
                     )
-                    refresh_btn = gr.Button("Rafraîchir la liste", size="sm")
                     convert_audio = gr.Audio(
-                        label="Morceau à convertir (WAV ou MP3)",
                         type="filepath",
                         sources=["upload"],
                     )
-                    with gr.Accordion("Paramètres avancés", open=False):
                         convert_pitch = gr.Slider(
-                            minimum=-12,
-                            maximum=12,
                             value=0,
                             step=1,
-                            label="Transposition (demi-tons) — ajustez si votre voix est plus grave/aiguë",
                         )
-                        convert_index_rate = gr.Slider(
-                            minimum=0.0,
-                            maximum=1.0,
-                            value=0.75,
-                            step=0.05,
-                            label="Taux d'index (plus haut = plus fidèle au timbre original)",
                         )
                         convert_vocal_vol = gr.Slider(
                             minimum=0.0,
@@ -379,16 +341,16 @@ with gr.Blocks(
                         interactive=False,
                         lines=3,
                     )
-                    gr.Markdown("**Aperçu des pistes :**")
                     preview_vocals = gr.Audio(
-                        label="Voix originale (séparée)",
                         interactive=False,
                     )
                     preview_converted = gr.Audio(
                         label="Voix convertie",
                         interactive=False,
                     )
-                    gr.Markdown("**Résultat final :**")
                     final_output = gr.Audio(
                         label="Morceau final (voix + instruments)",
                         interactive=False,
@@ -405,49 +367,33 @@ with gr.Blocks(
                     convert_model,
                     convert_audio,
                     convert_pitch,
-                    convert_index_rate,
                     convert_vocal_vol,
                     convert_inst_vol,
                 ],
                 outputs=[convert_status, preview_vocals, preview_converted, final_output],
             )
-        # ── Tab 3: Models ──
-        with gr.TabItem("Mes modèles"):
-            gr.Markdown("### Gérer vos modèles vocaux")
             models_table = gr.HTML(
                 value=refresh_models(),
-                label="Modèles entraînés",
             )
             with gr.Row():
-                models_refresh_btn = gr.Button("Rafraîchir", size="sm")
                 models_delete_name = gr.Dropdown(
                     choices=get_model_choices(),
-                    label="Modèle à supprimer",
                     interactive=True,
                 )
                 models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
             models_delete_status = gr.Textbox(label="Statut", interactive=False)
-            gr.Markdown("---")
-            gr.Markdown("### Importer un modèle externe")
-            with gr.Row():
-                upload_pth = gr.File(
-                    label="Fichier .pth du modèle",
-                    file_types=[".pth"],
-                )
-                upload_name = gr.Textbox(
-                    label="Nom du modèle",
-                    placeholder="ex: voix_importee",
-                )
-                upload_btn = gr.Button("Importer", size="sm")
-            upload_status = gr.Textbox(label="Statut", interactive=False)
             models_refresh_btn.click(
                 fn=refresh_models,
                 outputs=[models_table],
@@ -463,12 +409,6 @@ with gr.Blocks(
                 outputs=[models_delete_status, models_table],
             )
-            upload_btn.click(
-                fn=upload_external_model,
-                inputs=[upload_pth, upload_name],
-                outputs=[upload_status, models_table],
-            )
 if __name__ == "__main__":
     app.launch(server_name="0.0.0.0")

 """
+Clone Vocal - Outil web de clonage vocal base sur Seed-VC (zero-shot).
+Interface Gradio en francais, deploye sur HuggingFace Spaces avec ZeroGPU.
 """
 import os
 import gradio as gr
+# Monkey-patch gradio_client to fix "argument of type 'bool' is not iterable"
 try:
     import gradio_client.utils as _gc_utils
 logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
 logger = logging.getLogger(__name__)
+# Startup: clone Seed-VC
 logger.info("Initialisation de l'application...")
+from pipeline.setup import setup_seed_vc
+from pipeline.storage import init_storage, list_models, download_model, delete_model, get_reference_path
 try:
+    setup_seed_vc()
 except Exception as e:
+    logger.error("Erreur lors du setup: {}".format(e))
 # Initialize model storage
 HF_MODELS_REPO = os.environ.get("HF_MODELS_REPO", "")
 if HF_MODELS_REPO:
     init_storage(HF_MODELS_REPO)
+    logger.info("Stockage HuggingFace configure: {}".format(HF_MODELS_REPO))
+# Import GPU-decorated functions for ZeroGPU detection
+from pipeline.training import save_voice_reference, _gpu_warmup
 from pipeline.separation import separate_audio
 from pipeline.inference import convert_voice
+# -- Training Tab --
+def train_voice_model(audio_file, model_name, progress=gr.Progress()):
+    """Handler: save voice reference."""
     if audio_file is None:
         return "Erreur : Veuillez uploader un fichier audio.", None
     if not model_name or not model_name.strip():
+        return "Erreur : Veuillez entrer un nom pour le modele.", None
     model_name = model_name.strip().replace(" ", "_")
         progress(value, desc=desc)
     try:
+        progress(0.0, desc="Demarrage...")
+        pth_path, ref_path = save_voice_reference(
             audio_path=audio_file,
             model_name=model_name,
             progress_callback=progress_callback,
         )
+        return "Reference vocale '{}' sauvegardee avec succes !".format(model_name), ref_path
     except Exception as e:
         import traceback
         tb = traceback.format_exc()
+        logger.error("Erreur training: {}".format(tb))
+        return "Erreur : {}: {}\n\nDetails:\n{}".format(
+            type(e).__name__, str(e), tb[-500:]
+        ), None
+# -- Conversion Tab --
 def get_model_choices():
     """Get list of trained model names for dropdown."""
     models = list_models()
     if not models:
+        return ["(aucun modele)"]
     return models
     model_choice,
     song_file,
     pitch,
+    diffusion_steps,
     vocal_volume,
     instrumental_volume,
     progress=gr.Progress(),
     if song_file is None:
         return "Erreur : Veuillez uploader un fichier audio.", None, None, None
+    if model_choice == "(aucun modele)" or not model_choice:
+        return "Erreur : Veuillez d'abord enregistrer une reference vocale.", None, None, None
     from pipeline.mixing import mix_audio
     try:
+        # Step 1: Download model / find reference audio
+        progress(0.05, desc="Chargement du modele...")
+        pth_path, ref_or_index = download_model(model_choice)
         if not pth_path:
+            return "Erreur : Modele '{}' introuvable.".format(model_choice), None, None, None
+        # Find the reference audio path
+        reference_path = get_reference_path(model_choice)
+        if not reference_path:
+            return "Erreur : Audio de reference introuvable pour '{}'.".format(model_choice), None, None, None
         # Step 2: Separate vocals from instruments
+        progress(0.10, desc="Separation des pistes (Demucs)...")
         vocals_path, instruments_path = separate_audio(song_file)
+        progress(0.40, desc="Conversion vocale (Seed-VC)...")
+        # Step 3: Convert vocals with Seed-VC
         converted_path = convert_voice(
             audio_path=vocals_path,
+            reference_path=reference_path,
             pitch=int(pitch),
+            diffusion_steps=int(diffusion_steps),
+            index_rate=0.7,
         )
+        progress(0.85, desc="Mixage final...")
         # Step 4: Mix converted vocals with instruments
         final_path = mix_audio(
             instrumental_volume=float(instrumental_volume),
         )
+        progress(1.0, desc="Termine !")
         return (
+            "Conversion terminee avec succes !",
+            vocals_path,
+            converted_path,
+            final_path,
         )
     except Exception as e:
         import traceback
         tb = traceback.format_exc()
+        logger.error("Erreur conversion: {}".format(tb))
+        return "Erreur : {}: {}\n\nDetails:\n{}".format(
+            type(e).__name__, str(e), tb[-500:]
+        ), None, None, None
+# -- Models Tab --
 def refresh_models():
     """Refresh the model list as HTML."""
     models = list_models()
     if not models:
+        return "<p style='color:gray;'>Aucun modele enregistre</p>"
+    rows = "".join(
+        "<tr><td>{}</td><td>Disponible</td></tr>".format(m) for m in models
+    )
+    return (
+        "<table style='width:100%;border-collapse:collapse;'>"
+        "<tr><th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Nom</th>"
+        "<th style='text-align:left;border-bottom:1px solid #555;padding:8px;'>Statut</th></tr>"
+        "{}</table>".format(rows)
+    )
 def delete_selected_model(model_name_to_delete):
     """Delete a model."""
+    if not model_name_to_delete or model_name_to_delete == "(aucun modele)":
+        return "Veuillez selectionner un modele a supprimer.", refresh_models()
     try:
         delete_model(model_name_to_delete)
+        return "Modele '{}' supprime.".format(model_name_to_delete), refresh_models()
     except Exception as e:
+        return "Erreur : {}".format(e), refresh_models()
+# -- Build Gradio UI --
 DESCRIPTION = """
+# Clone Vocal
+Outil de clonage vocal **zero-shot** base sur **Seed-VC** (Diffusion Transformer).
 **Comment utiliser :**
+1. **Onglet "Ma voix"** : Uploadez un court extrait de votre voix (3-30 sec) pour creer votre profil vocal
+2. **Onglet "Convertir"** : Uploadez un morceau de musique, l'outil remplace la voix par la votre
+3. **Onglet "Modeles"** : Gerez vos profils vocaux
+> **Zero-shot** : pas d'entrainement necessaire ! Juste 3-30 secondes de votre voix suffisent.
 """
 with gr.Blocks(
+    title="Clone Vocal",
     theme=gr.themes.Soft(),
 ) as app:
     gr.Markdown(DESCRIPTION)
     with gr.Tabs():
+        # Tab 1: Voice Reference
+        with gr.TabItem("Ma voix"):
+            gr.Markdown("### Enregistrer votre reference vocale")
             with gr.Row():
                 with gr.Column(scale=2):
                     train_audio = gr.Audio(
+                        label="Extrait de votre voix (WAV ou MP3, 3-30 secondes)",
                         type="filepath",
                         sources=["upload"],
                     )
                     train_model_name = gr.Textbox(
+                        label="Nom du profil",
                         placeholder="ex: ma_voix",
                         max_lines=1,
                     )
                     train_btn = gr.Button(
+                        "Sauvegarder",
                         variant="primary",
                         size="lg",
                     )
                     train_status = gr.Textbox(
                         label="Statut",
                         interactive=False,
+                        lines=3,
                     )
                     train_download = gr.File(
+                        label="Fichier de reference",
                         interactive=False,
                     )
             gr.Markdown(
                 "**Conseils :**\n"
                 "- Utilisez un enregistrement propre (pas de bruit de fond, pas de musique)\n"
+                "- Parlez ou chantez naturellement pendant 3 a 30 secondes\n"
+                "- Plus l'extrait est long et varie, meilleur sera le resultat\n"
+                "- Format WAV ou MP3 accepte"
             )
             train_btn.click(
                 fn=train_voice_model,
+                inputs=[train_audio, train_model_name],
                 outputs=[train_status, train_download],
             )
+        # Tab 2: Conversion
         with gr.TabItem("Convertir un morceau"):
+            gr.Markdown("### Remplacer la voix d'un morceau par la votre")
             with gr.Row():
                 with gr.Column(scale=2):
                     convert_model = gr.Dropdown(
                         choices=get_model_choices(),
+                        label="Profil vocal",
                         interactive=True,
                     )
+                    refresh_btn = gr.Button("Rafraichir la liste", size="sm")
                     convert_audio = gr.Audio(
+                        label="Morceau a convertir (WAV ou MP3)",
                         type="filepath",
                         sources=["upload"],
                     )
+                    with gr.Accordion("Parametres avances", open=False):
                         convert_pitch = gr.Slider(
+                            minimum=-24,
+                            maximum=24,
                             value=0,
                             step=1,
+                            label="Transposition (demi-tons)",
                         )
+                        convert_diffusion = gr.Slider(
+                            minimum=5,
+                            maximum=50,
+                            value=25,
+                            step=5,
+                            label="Qualite (plus haut = meilleure qualite, plus lent)",
                         )
                         convert_vocal_vol = gr.Slider(
                             minimum=0.0,
                         interactive=False,
                         lines=3,
                     )
+                    gr.Markdown("**Apercu des pistes :**")
                     preview_vocals = gr.Audio(
+                        label="Voix originale (separee)",
                         interactive=False,
                     )
                     preview_converted = gr.Audio(
                         label="Voix convertie",
                         interactive=False,
                     )
+                    gr.Markdown("**Resultat final :**")
                     final_output = gr.Audio(
                         label="Morceau final (voix + instruments)",
                         interactive=False,
                     convert_model,
                     convert_audio,
                     convert_pitch,
+                    convert_diffusion,
                     convert_vocal_vol,
                     convert_inst_vol,
                 ],
                 outputs=[convert_status, preview_vocals, preview_converted, final_output],
             )
+        # Tab 3: Models
+        with gr.TabItem("Mes modeles"):
+            gr.Markdown("### Gerer vos profils vocaux")
             models_table = gr.HTML(
                 value=refresh_models(),
+                label="Modeles enregistres",
             )
             with gr.Row():
+                models_refresh_btn = gr.Button("Rafraichir", size="sm")
                 models_delete_name = gr.Dropdown(
                     choices=get_model_choices(),
+                    label="Modele a supprimer",
                     interactive=True,
                 )
                 models_delete_btn = gr.Button("Supprimer", variant="stop", size="sm")
             models_delete_status = gr.Textbox(label="Statut", interactive=False)
             models_refresh_btn.click(
                 fn=refresh_models,
                 outputs=[models_table],
                 outputs=[models_delete_status, models_table],
             )
 if __name__ == "__main__":
     app.launch(server_name="0.0.0.0")

pipeline/inference.py CHANGED Viewed

@@ -1,16 +1,17 @@
 """
-Voice conversion module: manual RVC v2 inference pipeline.
-Uses HuBERT feature extraction + FAISS retrieval + pretrained generator.
-The voice identity comes from the FAISS index (target voice embeddings),
-not from fine-tuning the generator.
 """
 import os
 import sys
 import logging
 import numpy as np
 import torch
-import torch.nn.functional as F
 logger = logging.getLogger(__name__)
@@ -24,228 +25,165 @@ except ImportError:
                 return fn
             return decorator
-from pipeline.setup import APPLIO_DIR, ensure_applio_path
 OUTPUT_DIR = "/tmp/rvc_output"
-# Cache loaded models to avoid reloading on every call
-_cached_hubert = None
-_cached_generator = None
-_cached_rmvpe = None
-def _load_hubert(device):
-    """Load ContentVec HuBERT model for feature extraction."""
-    global _cached_hubert
-    if _cached_hubert is not None:
-        return _cached_hubert.to(device)
-    ensure_applio_path()
-    from rvc.lib.utils import load_embedding
-    model = load_embedding("contentvec", None)
-    model = model.to(device).float()
-    model.requires_grad_(False)
-    _cached_hubert = model
-    logger.info("Loaded ContentVec HuBERT model.")
-    return model
-def _load_generator(device, sample_rate=40000):
-    """Load pretrained RVC v2 generator (Synthesizer)."""
-    global _cached_generator
-    if _cached_generator is not None:
-        return _cached_generator.to(device)
-    ensure_applio_path()
-    from rvc.lib.algorithm.synthesizers import Synthesizer
-    sr_prefix = str(sample_rate)[:2]
-    model_path = os.path.join(
-        APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan",
-        "f0G{}k.pth".format(sr_prefix),
-    )
-    if not os.path.exists(model_path):
-        raise RuntimeError("Pretrained generator not found: {}".format(model_path))
-    cpt = torch.load(model_path, map_location="cpu", weights_only=False)
-    # Training checkpoint has "model" key, inference format has "weight" key
-    weights = cpt.get("weight", cpt.get("model", cpt))
-    # Read config from Applio config files
-    import json
-    config_path = os.path.join(APPLIO_DIR, "configs", "v2", "{}k.json".format(sr_prefix))
-    if os.path.exists(config_path):
-        with open(config_path) as f:
-            cfg = json.load(f)
-        config_args = [
-            cfg["data"]["filter_length"] // 2 + 1,
-            cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
-            cfg["model"]["inter_channels"],
-            cfg["model"]["hidden_channels"],
-            cfg["model"]["filter_channels"],
-            cfg["model"]["n_heads"],
-            cfg["model"]["n_layers"],
-            cfg["model"]["kernel_size"],
-            cfg["model"]["p_dropout"],
-            cfg["model"]["resblock"],
-            cfg["model"]["resblock_kernel_sizes"],
-            cfg["model"]["resblock_dilation_sizes"],
-            cfg["model"]["upsample_rates"],
-            cfg["model"]["upsample_initial_channel"],
-            cfg["model"]["upsample_kernel_sizes"],
-            cfg["model"]["spk_embed_dim"],
-            cfg["model"]["gin_channels"],
-            cfg["data"]["sampling_rate"],
-        ]
-        logger.info("Loaded generator config from Applio.")
-    else:
-        # Fallback: standard RVC v2 40k config
-        config_args = [
-            1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
-            [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
-            [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
-        ]
-    net_g = Synthesizer(*config_args, use_f0=True)
-    net_g.load_state_dict(weights, strict=False)
-    net_g.requires_grad_(False)
-    net_g.to(device)
-    _cached_generator = net_g
-    logger.info("Loaded pretrained RVC generator.")
-    return net_g
-def _extract_f0(audio_np, sr, device):
-    """Extract F0 using RMVPE. Returns f0 numpy array."""
-    global _cached_rmvpe
-    ensure_applio_path()
-    rmvpe_path = os.path.join(
-        APPLIO_DIR, "rvc", "models", "predictors", "rmvpe.pt"
-    )
-    if os.path.exists(rmvpe_path):
-        try:
-            from rvc.lib.predictors.RMVPE import RMVPE0Predictor
-            if _cached_rmvpe is None:
-                _cached_rmvpe = RMVPE0Predictor(rmvpe_path, device=device)
-                logger.info("Loaded RMVPE predictor.")
-            f0 = _cached_rmvpe.infer_from_audio(audio_np, sample_rate=sr, thred=0.03)
-            return f0
-        except Exception as e:
-            logger.warning("RMVPE failed ({}), using torchcrepe fallback.".format(e))
-    # Fallback: torchcrepe
-    import torchcrepe
-    import librosa
-    audio_16k = librosa.resample(audio_np, orig_sr=sr, target_sr=16000) if sr != 16000 else audio_np
-    audio_t = torch.from_numpy(audio_16k).float().unsqueeze(0).to(device)
-    f0 = torchcrepe.predict(
-        audio_t, 16000, hop_length=160,
-        fmin=50, fmax=1100, model="full", device=device,
     )
-    return f0[0].cpu().numpy()
-def _quantize_f0(f0):
-    """Quantize F0 to mel-scale buckets (1-255). 0 = unvoiced."""
-    f0_mel = 1127.0 * np.log(1.0 + f0 / 700.0)
-    f0_mel_min = 1127.0 * np.log(1.0 + 1.0 / 700.0)
-    f0_mel_max = 1127.0 * np.log(1.0 + 1100.0 / 700.0)
-    f0_coarse = np.copy(f0_mel)
-    voiced = f0_coarse > 0
-    f0_coarse[voiced] = (
-        (f0_coarse[voiced] - f0_mel_min) * 254.0 / (f0_mel_max - f0_mel_min) + 1.0
     )
-    f0_coarse = np.clip(f0_coarse, 0, 255).astype(np.int64)
-    f0_coarse[~voiced] = 0
-    return f0_coarse
-def _faiss_retrieval(feats, index_path, big_npy_path, index_rate, device):
-    """
-    Retrieve target voice features from FAISS index and blend with source.
-    This is the core of retrieval-based voice conversion: the voice identity
-    comes from replacing source embeddings with target voice embeddings.
-    """
-    import faiss
-    index = faiss.read_index(index_path)
-    if index.ntotal == 0:
-        logger.warning("FAISS index is empty, skipping retrieval.")
-        return feats
-    # Load precomputed embeddings array
-    if big_npy_path and os.path.exists(big_npy_path):
-        big_npy = np.load(big_npy_path)
     else:
-        # Reconstruct from index (works for IndexFlatL2)
-        logger.info("No big_npy file found, reconstructing from index...")
-        dim = feats.shape[2]
-        big_npy = np.zeros((index.ntotal, dim), dtype=np.float32)
-        try:
-            for i in range(index.ntotal):
-                big_npy[i] = index.reconstruct(i)
-        except RuntimeError:
-            logger.warning("Cannot reconstruct vectors from index, skipping retrieval.")
-            return feats
-    npy = feats[0].cpu().numpy().astype(np.float32)
-    # Search k=8 nearest neighbors for each frame
-    score, ix = index.search(npy, k=8)
-    # Weight by inverse square distance
-    weight = np.square(1.0 / (score + 1e-6))
-    weight /= weight.sum(axis=1, keepdims=True)
-    # Weighted combination of nearest neighbor embeddings
-    retrieved = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
-    # Blend retrieved (target voice) with source features
-    retrieved_t = torch.from_numpy(retrieved).unsqueeze(0).to(device).float()
-    blended = index_rate * retrieved_t + (1.0 - index_rate) * feats
-    logger.info(
-        "FAISS retrieval done: {} vectors, index_rate={}".format(
-            index.ntotal, index_rate
-        )
-    )
-    return blended
-@spaces.GPU(duration=60)
 def convert_voice(
     audio_path,
-    model_path,
     index_path=None,
     pitch=0,
     f0_method="rmvpe",
-    index_rate=0.75,
     protect=0.33,
     volume_envelope=1.0,
     output_format="WAV",
 ):
     """
-    Convert voice using the full RVC v2 pipeline:
-    1. Extract HuBERT features from source audio
-    2. Retrieve target voice features from FAISS index
-    3. Extract F0 pitch and apply shift
-    4. Run pretrained generator to synthesize converted audio
     Returns path to converted audio file.
     """
-    import librosa
     import soundfile as sf
     os.makedirs(OUTPUT_DIR, exist_ok=True)
@@ -253,92 +191,171 @@ def convert_voice(
     output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
     device = "cuda" if torch.cuda.is_available() else "cpu"
-    logger.info("Converting voice on {}: {}".format(device, audio_path))
-    logger.info("Index: {}, Pitch: {}, Index rate: {}".format(index_path, pitch, index_rate))
-    ensure_applio_path()
-    # Load source audio at 16kHz for HuBERT and F0
-    audio_16k, _ = librosa.load(audio_path, sr=16000, mono=True)
-    logger.info("Source audio: {:.1f}s".format(len(audio_16k) / 16000))
-    if len(audio_16k) < 16000 * 0.5:
-        raise RuntimeError("Audio source trop court pour la conversion (< 0.5s).")
-    # ---- Step 1: Extract HuBERT features ----
-    hubert = _load_hubert(device)
-    feats_input = torch.from_numpy(audio_16k).float().view(1, -1).to(device)
-    with torch.no_grad():
-        feats = hubert(feats_input)["last_hidden_state"]  # (1, T_50hz, 768)
-    # Upsample 2x to match F0 frame rate (50Hz -> 100Hz)
-    feats = F.interpolate(
-        feats.permute(0, 2, 1), scale_factor=2
-    ).permute(0, 2, 1)  # (1, T_100hz, 768)
-    # Keep a copy for protect blending
-    feats0 = feats.clone()
-    # ---- Step 2: FAISS retrieval ----
-    if index_path and os.path.exists(index_path):
-        big_npy_path = index_path.replace(".index", "_big_npy.npy")
-        feats = _faiss_retrieval(feats, index_path, big_npy_path, index_rate, device)
-    # Apply protect: blend original features for consonants/unvoiced parts
-    if protect < 0.5 and feats0 is not None:
-        feats = protect * feats0 + (1.0 - protect) * feats
-    # ---- Step 3: Extract F0 ----
-    f0 = _extract_f0(audio_16k, 16000, device)
-    # Apply pitch shift (in semitones)
-    if pitch != 0:
-        f0 = f0.copy()
-        voiced = f0 > 0
-        f0[voiced] *= 2.0 ** (pitch / 12.0)
-    # ---- Step 4: Match lengths ----
-    # Target: 100Hz frame rate = 16000 / 160 = 100 frames/sec
-    p_len = len(audio_16k) // 160
-    p_len = min(p_len, feats.shape[1])
-    # Interpolate F0 to match p_len if needed
-    if len(f0) != p_len:
-        f0 = np.interp(
-            np.linspace(0, len(f0) - 1, p_len),
-            np.arange(len(f0)),
-            f0,
         )
-    # Trim features to p_len
-    feats = feats[:, :p_len, :]
-    # Quantize F0 and convert to tensors
-    f0_coarse = _quantize_f0(f0)
-    pitch_t = torch.tensor(f0_coarse, device=device).unsqueeze(0).long()
-    pitchf_t = torch.tensor(f0, device=device).unsqueeze(0).float()
-    p_len_t = torch.tensor([p_len], device=device).long()
-    sid = torch.tensor([0], device=device).long()
-    # ---- Step 5: Generator inference ----
-    net_g = _load_generator(device, sample_rate=40000)
-    with torch.no_grad():
-        result = net_g.infer(feats.float(), p_len_t, pitch_t, pitchf_t, sid)
-        audio_out = result[0][0, 0].data.cpu().float().numpy()
-    # ---- Step 6: Post-processing ----
     # Normalize
     audio_max = np.abs(audio_out).max()
     if audio_max > 0.01:
         audio_out = audio_out / audio_max * 0.95
-    # Resample 40kHz -> 44.1kHz for standard output
-    audio_44k = librosa.resample(audio_out, orig_sr=40000, target_sr=44100)
-    # Save as WAV 16-bit
-    sf.write(output_path, audio_44k, 44100, subtype="PCM_16")
-    logger.info("Conversion complete: {} ({:.1f}s)".format(output_path, len(audio_44k) / 44100))
     return output_path

 """
+Voice conversion module using Seed-VC (zero-shot diffusion transformer).
+No training needed - just reference audio + source audio.
+Uses the singing voice conversion model with F0 conditioning.
 """
 import os
 import sys
 import logging
+import argparse
 import numpy as np
 import torch
+import torchaudio
+import librosa
 logger = logging.getLogger(__name__)
                 return fn
             return decorator
+from pipeline.setup import SEED_VC_DIR, ensure_seed_vc_path
 OUTPUT_DIR = "/tmp/rvc_output"
+# Cached models (loaded once, reused across calls)
+_model_cache = {}
+def _load_seed_vc_models(device):
+    """Load Seed-VC singing voice conversion models."""
+    if "model" in _model_cache:
+        return _model_cache
+    ensure_seed_vc_path()
+    # Import Seed-VC's model loading utilities
+    from modules.commons import recursive_munch, build_model, load_checkpoint
+    from hf_utils import load_custom_model_from_hf
+    import yaml
+    # Load the singing model (F0-conditioned, whisper-base, 44kHz, BigVGAN)
+    dit_checkpoint_path, dit_config_path = load_custom_model_from_hf(
+        "Plachta/Seed-VC",
+        "DiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema_v2.pth",
+        "config_dit_mel_seed_uvit_whisper_base_f0_44k.yml",
     )
+    with open(dit_config_path, "r") as f:
+        config = yaml.safe_load(f)
+    model_params = recursive_munch(config["model_params"])
+    model = build_model(model_params, stage="DiT")
+    # Load checkpoint
+    model, _, _, _ = load_checkpoint(
+        model, None, dit_checkpoint_path,
+        load_only_params=True, ignore_modules=[], is_distributed=False,
     )
+    for key in model:
+        model[key].eval()
+        model[key].to(device)
+    # FP16 for efficiency
+    for key in model:
+        if hasattr(model[key], "half"):
+            model[key] = model[key].half()
+    # Load speech tokenizer (Whisper)
+    from modules.speech_tokenizers.whisper.whisper_enc import WhisperSpeechEncoder
+    speech_tokenizer_type = config.get("model_params", {}).get(
+        "speech_tokenizer", {}
+    ).get("type", "whisper")
+    whisper_name = model_params.speech_tokenizer.get("name", "whisper-small")
+    whisper_model = WhisperSpeechEncoder.load_model(whisper_name).to(device).eval()
+    if hasattr(whisper_model, "half"):
+        whisper_model = whisper_model.half()
+    def semantic_fn(waves_16k):
+        wav = waves_16k.to(device).half() if waves_16k.dim() == 1 else waves_16k.to(device).half()
+        if wav.dim() == 1:
+            wav = wav.unsqueeze(0)
+        with torch.no_grad():
+            return whisper_model.extract_features(wav)
+    # Load vocoder (BigVGAN)
+    vocoder_type = config.get("model_params", {}).get("vocoder", {}).get("type", "bigvgan")
+    if vocoder_type == "bigvgan":
+        from modules.bigvgan import bigvgan
+        vocoder_path = os.path.join(SEED_VC_DIR, "modules", "bigvgan")
+        vocoder = bigvgan.BigVGAN.from_pretrained(
+            "nvidia/bigvgan_v2_44khz_128band_512x", use_cuda_kernel=False
+        )
+        vocoder = vocoder.to(device).eval()
+        if hasattr(vocoder, "half"):
+            vocoder = vocoder.half()
+        def vocoder_fn(mel):
+            with torch.no_grad():
+                return vocoder(mel.half())
     else:
+        from modules.vocoder import load_vocoder
+        vocoder = load_vocoder(vocoder_type, config).to(device).eval()
+        def vocoder_fn(mel):
+            with torch.no_grad():
+                return vocoder(mel)
+    # Load CAMPPlus speaker embedding model
+    from modules.campplus.DTDNN import CAMPPlus
+    campplus_ckpt_path = load_custom_model_from_hf(
+        "funasr/campplus", "campplus_cn_common.bin", config_filename=None
+    )
+    if isinstance(campplus_ckpt_path, tuple):
+        campplus_ckpt_path = campplus_ckpt_path[0]
+    campplus_model = CAMPPlus(feat_dim=80, embedding_size=192)
+    campplus_model.load_state_dict(torch.load(campplus_ckpt_path, map_location="cpu"))
+    campplus_model = campplus_model.to(device).eval().half()
+    # Load F0 extractor (RMVPE)
+    from modules.rmvpe import RMVPE
+    rmvpe_path = load_custom_model_from_hf("lj1995/VoiceConversionWebUI", "rmvpe.pt", config_filename=None)
+    if isinstance(rmvpe_path, tuple):
+        rmvpe_path = rmvpe_path[0]
+    f0_extractor = RMVPE(rmvpe_path, is_half=True, device=device)
+    def f0_fn(wav, thred=0.03):
+        return f0_extractor.infer_from_audio(wav, thred=thred)
+    # Mel spectrogram config
+    from modules.commons import build_mel_fn
+    mel_fn_args = config["preprocess_params"]["spect_params"]
+    to_mel = build_mel_fn(mel_fn_args)
+    sr = config["preprocess_params"]["sr"]
+    hop_length = mel_fn_args["hop_length"]
+    _model_cache.update({
+        "model": model,
+        "semantic_fn": semantic_fn,
+        "vocoder_fn": vocoder_fn,
+        "campplus_model": campplus_model,
+        "f0_fn": f0_fn,
+        "to_mel": to_mel,
+        "sr": sr,
+        "hop_length": hop_length,
+        "device": device,
+        "max_context_window": model_params.DiT.max_context_window,
+        "overlap_frame_len": 16,
+    })
+    logger.info(f"Seed-VC models loaded (sr={sr}, hop={hop_length})")
+    return _model_cache
+@spaces.GPU(duration=120)
 def convert_voice(
     audio_path,
+    reference_path,
     index_path=None,
     pitch=0,
     f0_method="rmvpe",
+    index_rate=0.7,
     protect=0.33,
     volume_envelope=1.0,
     output_format="WAV",
+    diffusion_steps=25,
 ):
     """
+    Convert voice using Seed-VC zero-shot singing voice conversion.
+    Args:
+        audio_path: Path to source vocals (separated by Demucs)
+        reference_path: Path to reference voice audio (3-30 sec)
+        pitch: Semitone shift (-24 to +24)
+        diffusion_steps: Quality vs speed trade-off (10=fast, 30=quality)
     Returns path to converted audio file.
     """
     import soundfile as sf
     os.makedirs(OUTPUT_DIR, exist_ok=True)
     output_path = os.path.join(OUTPUT_DIR, "{}_converted.wav".format(base_name))
     device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info("Converting voice with Seed-VC on {}".format(device))
+    logger.info("Source: {}, Reference: {}, Pitch: {}".format(audio_path, reference_path, pitch))
+    # Load models
+    cache = _load_seed_vc_models(device)
+    model = cache["model"]
+    semantic_fn = cache["semantic_fn"]
+    vocoder_fn = cache["vocoder_fn"]
+    campplus_model = cache["campplus_model"]
+    f0_fn = cache["f0_fn"]
+    to_mel = cache["to_mel"]
+    sr = cache["sr"]
+    hop_length = cache["hop_length"]
+    max_context_window = cache["max_context_window"]
+    overlap_frame_len = cache["overlap_frame_len"]
+    # Load source audio
+    source_audio = librosa.load(audio_path, sr=sr)[0]
+    source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
+    # Load reference audio
+    ref_audio = librosa.load(reference_path, sr=sr)[0]
+    # Limit reference to 30 seconds
+    max_ref_samples = 30 * sr
+    if len(ref_audio) > max_ref_samples:
+        ref_audio = ref_audio[:max_ref_samples]
+    ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)
+    # Resample to 16kHz for speech tokenizer
+    source_16k = torchaudio.functional.resample(source_audio, sr, 16000)
+    ref_16k = torchaudio.functional.resample(ref_audio, sr, 16000)
+    # Extract semantic tokens
+    S_alt = semantic_fn(source_16k[0])
+    S_ori = semantic_fn(ref_16k[0])
+    # Extract mel spectrograms
+    mel_source = to_mel(source_audio.to(device))
+    mel_ref = to_mel(ref_audio.to(device))
+    target_lengths = torch.LongTensor([mel_ref.size(2)]).to(device)
+    # Extract speaker embedding from reference
+    feat_ref = torchaudio.compliance.kaldi.fbank(
+        ref_16k[0].unsqueeze(0) if ref_16k.dim() == 2 else ref_16k,
+        num_mel_bins=80, sample_frequency=16000,
+        dither=0, window_type="hamming",
+    )
+    feat_ref = feat_ref - feat_ref.mean(dim=0, keepdim=True)
+    style_ref = campplus_model(feat_ref.unsqueeze(0).half().to(device))
+    # Extract F0 for singing
+    F0_ori = f0_fn(ref_16k[0].cpu().numpy(), thred=0.03)
+    F0_alt = f0_fn(source_16k[0].cpu().numpy(), thred=0.03)
+    F0_ori = torch.tensor(F0_ori).to(device).float()
+    F0_alt = torch.tensor(F0_alt).to(device).float()
+    # Auto-adjust F0 to match reference pitch range
+    voiced_ori = F0_ori > 1
+    voiced_alt = F0_alt > 1
+    log_f0_alt = torch.zeros_like(F0_alt)
+    log_f0_alt[voiced_alt] = torch.log(F0_alt[voiced_alt])
+    shifted_log_f0_alt = log_f0_alt.clone()
+    if voiced_ori.any() and voiced_alt.any():
+        median_log_f0_ori = torch.log(F0_ori[voiced_ori]).median()
+        median_log_f0_alt = log_f0_alt[voiced_alt].median()
+        shifted_log_f0_alt[voiced_alt] = (
+            log_f0_alt[voiced_alt] - median_log_f0_alt + median_log_f0_ori
         )
+    shifted_f0_alt = torch.zeros_like(F0_alt)
+    shifted_f0_alt[voiced_alt] = torch.exp(shifted_log_f0_alt[voiced_alt])
+    # Apply semitone pitch shift
+    if pitch != 0:
+        factor = 2.0 ** (pitch / 12.0)
+        shifted_f0_alt[voiced_alt] = shifted_f0_alt[voiced_alt] * factor
+    # Process in chunks with crossfading
+    cond = model["DiT"].prepare_concat(S_alt, mel_source)
+    # Prepare F0 conditioning
+    max_source_window = max_context_window - mel_ref.size(2)
+    overlap_wave_len = overlap_frame_len * hop_length
+    # Interpolate F0 to match mel frames
+    n_mel_frames = cond.size(1)
+    if len(shifted_f0_alt) != n_mel_frames:
+        shifted_f0_alt_interp = torch.nn.functional.interpolate(
+            shifted_f0_alt.unsqueeze(0).unsqueeze(0),
+            size=n_mel_frames, mode="nearest",
+        ).squeeze()
+    else:
+        shifted_f0_alt_interp = shifted_f0_alt
+    # Generate in chunks
+    generated_wave_chunks = []
+    processed_frames = 0
+    while processed_frames < cond.size(1):
+        chunk_end = min(processed_frames + max_source_window, cond.size(1))
+        chunk_cond = cond[:, processed_frames:chunk_end]
+        chunk_f0 = shifted_f0_alt_interp[processed_frames:chunk_end]
+        # Concatenate reference mel with source chunk
+        cat_condition = torch.cat([mel_ref, chunk_cond], dim=2)
+        cat_f0 = torch.cat([
+            torch.zeros(mel_ref.size(2)).to(device),
+            chunk_f0,
+        ])
+        with torch.no_grad():
+            vc_target = model["cfm"].inference(
+                cat_condition.half(),
+                torch.LongTensor([cat_condition.size(2)]).to(device),
+                mel_ref.half(),
+                style_ref,
+                cat_f0.unsqueeze(0).half(),
+                diffusion_steps,
+                inference_cfg_rate=index_rate,
+            )
+            vc_target = vc_target[:, :, mel_ref.size(2):]
+        # Vocoder
+        vc_wave = vocoder_fn(vc_target.float())
+        if generated_wave_chunks:
+            # Crossfade with previous chunk
+            prev = generated_wave_chunks[-1]
+            if overlap_wave_len > 0 and len(prev) >= overlap_wave_len:
+                cross_len = min(overlap_wave_len, vc_wave.size(-1))
+                fade_out = np.linspace(1, 0, cross_len)
+                fade_in = np.linspace(0, 1, cross_len)
+                prev_np = prev if isinstance(prev, np.ndarray) else prev
+                new_np = vc_wave[0].cpu().float().numpy()
+                prev_np[-cross_len:] = (
+                    prev_np[-cross_len:] * fade_out + new_np[:cross_len] * fade_in
+                )
+                generated_wave_chunks.append(new_np[cross_len:])
+            else:
+                generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
+        else:
+            generated_wave_chunks.append(vc_wave[0].cpu().float().numpy())
+        processed_frames = chunk_end - overlap_frame_len
+        if processed_frames < 0:
+            processed_frames = chunk_end
+    # Concatenate all chunks
+    audio_out = np.concatenate(generated_wave_chunks)
     # Normalize
     audio_max = np.abs(audio_out).max()
     if audio_max > 0.01:
         audio_out = audio_out / audio_max * 0.95
+    # Resample to 44.1kHz if needed and save
+    if sr != 44100:
+        audio_out = librosa.resample(audio_out, orig_sr=sr, target_sr=44100)
+    sf.write(output_path, audio_out, 44100, subtype="PCM_16")
+    logger.info("Conversion complete: {} ({:.1f}s)".format(
+        output_path, len(audio_out) / 44100
+    ))
     return output_path

pipeline/setup.py CHANGED Viewed

@@ -1,5 +1,6 @@
 """
-Setup module: clones Applio at startup and downloads pretrained models.
 """
 import os
@@ -9,134 +10,50 @@ import logging
 logger = logging.getLogger(__name__)
-APPLIO_DIR = "/tmp/Applio"
-APPLIO_REPO = "https://github.com/IAHispano/Applio.git"
-# Pretrained model URLs from HuggingFace
-HF_BASE_URL = "https://huggingface.co/IAHispano/Applio/resolve/main/Resources"
-REQUIRED_MODELS = {
-    # Pretrained v2 (HiFi-GAN) for 40k sample rate
-    "rvc/models/pretraineds/hifi-gan/f0G40k.pth": "pretrained_v2/f0G40k.pth",
-    "rvc/models/pretraineds/hifi-gan/f0D40k.pth": "pretrained_v2/f0D40k.pth",
-    # RMVPE pitch extractor
-    "rvc/models/predictors/rmvpe.pt": "predictors/rmvpe.pt",
-    # ContentVec embedder
-    "rvc/models/embedders/contentvec/pytorch_model.bin": "embedders/contentvec/pytorch_model.bin",
-    "rvc/models/embedders/contentvec/config.json": "embedders/contentvec/config.json",
-}
-def clone_applio():
-    """Clone Applio repository if not already present."""
-    if os.path.exists(os.path.join(APPLIO_DIR, "core.py")):
-        logger.info("Applio already cloned.")
         return True
-    logger.info("Cloning Applio repository...")
     try:
         subprocess.run(
-            ["git", "clone", "--depth", "1", APPLIO_REPO, APPLIO_DIR],
-            check=True,
-            capture_output=True,
-            text=True,
         )
-        logger.info("Applio cloned successfully.")
         return True
     except subprocess.CalledProcessError as e:
-        logger.error(f"Failed to clone Applio: {e.stderr}")
         return False
-def download_pretrained(local_path, remote_path):
-    """Download a single pretrained model file if not present."""
-    full_path = os.path.join(APPLIO_DIR, local_path)
-    if os.path.exists(full_path):
-        return True
-    os.makedirs(os.path.dirname(full_path), exist_ok=True)
-    url = f"{HF_BASE_URL}/{remote_path}"
-    logger.info(f"Downloading {remote_path}...")
-    try:
-        import requests
-        response = requests.get(url, stream=True, timeout=(10, 120))
-        response.raise_for_status()
-        with open(full_path, "wb") as f:
-            for chunk in response.iter_content(chunk_size=8192):
-                f.write(chunk)
-        logger.info(f"Downloaded {remote_path}")
-        return True
     except Exception as e:
-        logger.error(f"Failed to download {remote_path}: {e}")
-        return False
-def create_mute_files():
-    """Create mute audio files needed for training filelist generation."""
-    import numpy as np
-    from scipy.io import wavfile
-    sample_rate = 40000
-    mute_dir = os.path.join(APPLIO_DIR, "logs", "mute")
-    for subdir in ["sliced_audios", "sliced_audios_16k", "f0", "f0_voiced", "extracted"]:
-        os.makedirs(os.path.join(mute_dir, subdir), exist_ok=True)
-    # Create mute wav files
-    duration_samples = int(sample_rate * 0.4)
-    mute_audio = np.zeros(duration_samples, dtype=np.float32)
-    wavfile.write(
-        os.path.join(mute_dir, "sliced_audios", f"mute{sample_rate}.wav"),
-        sample_rate,
-        mute_audio,
-    )
-    wavfile.write(
-        os.path.join(mute_dir, "sliced_audios_16k", f"mute{16000}.wav"),
-        16000,
-        np.zeros(int(16000 * 0.4), dtype=np.float32),
-    )
-    # Create mute feature files
-    mute_f0 = np.zeros(int(16000 * 0.4 / 160), dtype=np.float32)
-    np.save(os.path.join(mute_dir, "f0", "mute.wav.npy"), mute_f0)
-    np.save(os.path.join(mute_dir, "f0_voiced", "mute.wav.npy"), mute_f0)
-    # Create mute embedding (768-dim contentvec)
-    mute_embed = np.zeros((int(16000 * 0.4 / 320), 768), dtype=np.float32)
-    np.save(os.path.join(mute_dir, "extracted", "mute.npy"), mute_embed)
-    logger.info("Mute files created.")
-def setup_applio():
-    """Full setup: clone + download models + create mute files."""
-    if not clone_applio():
-        raise RuntimeError("Failed to clone Applio")
-    # Add Applio to Python path
-    if APPLIO_DIR not in sys.path:
-        sys.path.insert(0, APPLIO_DIR)
-    # Download required models
-    all_ok = True
-    for local_path, remote_path in REQUIRED_MODELS.items():
-        if not download_pretrained(local_path, remote_path):
-            all_ok = False
-    if not all_ok:
-        logger.warning("Some models failed to download. Training may not work.")
-    # Create mute files for training
-    create_mute_files()
-    logger.info("Applio setup complete.")
     return True
-def ensure_applio_path():
-    """Ensure Applio is on the Python path."""
-    if APPLIO_DIR not in sys.path:
-        sys.path.insert(0, APPLIO_DIR)

 """
+Setup module: clone Seed-VC repo at startup.
+Seed-VC downloads its own pretrained models from HuggingFace on first use.
 """
 import os
 logger = logging.getLogger(__name__)
+SEED_VC_DIR = "/tmp/seed-vc"
+SEED_VC_REPO = "https://github.com/Plachtaa/seed-vc.git"
+def clone_seed_vc():
+    """Clone Seed-VC repository if not already present."""
+    if os.path.exists(os.path.join(SEED_VC_DIR, "inference.py")):
+        logger.info("Seed-VC already cloned.")
         return True
+    logger.info("Cloning Seed-VC repository...")
     try:
         subprocess.run(
+            ["git", "clone", "--depth", "1", SEED_VC_REPO, SEED_VC_DIR],
+            check=True, capture_output=True, text=True,
         )
+        logger.info("Seed-VC cloned successfully.")
         return True
     except subprocess.CalledProcessError as e:
+        logger.error(f"Failed to clone Seed-VC: {e.stderr}")
         return False
+def ensure_seed_vc_path():
+    """Ensure Seed-VC is on the Python path."""
+    if SEED_VC_DIR not in sys.path:
+        sys.path.insert(0, SEED_VC_DIR)
+def setup_seed_vc():
+    """Full setup: clone repo + add to path."""
+    if not clone_seed_vc():
+        raise RuntimeError("Failed to clone Seed-VC")
+    ensure_seed_vc_path()
+    # Install Seed-VC dependencies that might not be in our requirements.txt
+    try:
+        subprocess.run(
+            [sys.executable, "-m", "pip", "install", "-q",
+             "descript-audio-codec", "vocos", "bigvgan"],
+            capture_output=True, text=True, timeout=120,
+        )
     except Exception as e:
+        logger.warning(f"Some Seed-VC deps may be missing: {e}")
+    logger.info("Seed-VC setup complete.")
     return True

pipeline/storage.py CHANGED Viewed

@@ -1,5 +1,5 @@
 """
-Model storage module: persist trained RVC models to HuggingFace Dataset repo.
 """
 import os
@@ -8,64 +8,63 @@ from datetime import datetime
 logger = logging.getLogger(__name__)
-# Will be set from environment or app config
 MODELS_REPO_ID = None
 LOCAL_MODELS_DIR = "/tmp/rvc_models"
-def init_storage(repo_id: str):
     """Initialize storage with the HF dataset repo ID."""
     global MODELS_REPO_ID
     MODELS_REPO_ID = repo_id
     os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
-    logger.info(f"Storage initialized with repo: {repo_id}")
-def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy_path: str = None):
-    """Upload trained model files to HF dataset repo."""
     if not MODELS_REPO_ID:
         logger.warning("No HF repo configured. Model saved locally only.")
         return False
     try:
         from huggingface_hub import HfApi
         api = HfApi()
-        # Upload .pth file
-        api.upload_file(
-            path_or_fileobj=pth_path,
-            path_in_repo=f"models/{model_name}/{model_name}.pth",
-            repo_id=MODELS_REPO_ID,
-            repo_type="dataset",
-        )
-        logger.info(f"Uploaded {model_name}.pth to HF")
-        # Upload .index file if exists
-        if index_path and os.path.exists(index_path):
             api.upload_file(
-                path_or_fileobj=index_path,
-                path_in_repo=f"models/{model_name}/{model_name}.index",
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
-            logger.info(f"Uploaded {model_name}.index to HF")
-        # Upload big_npy embeddings if exists
-        if big_npy_path and os.path.exists(big_npy_path):
             api.upload_file(
-                path_or_fileobj=big_npy_path,
-                path_in_repo=f"models/{model_name}/{model_name}_big_npy.npy",
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
-            logger.info(f"Uploaded {model_name}_big_npy.npy to HF")
         # Upload metadata
         metadata = {
             "name": model_name,
             "created": datetime.now().isoformat(),
-            "sample_rate": 40000,
         }
         import json
         import tempfile
@@ -77,7 +76,7 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
         try:
             api.upload_file(
                 path_or_fileobj=meta_path,
-                path_in_repo=f"models/{model_name}/metadata.json",
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
@@ -86,14 +85,13 @@ def upload_model(model_name: str, pth_path: str, index_path: str = None, big_npy
         return True
     except Exception as e:
-        logger.error(f"Failed to upload model: {e}")
         return False
-def download_model(model_name: str):
-    """Download model from HF dataset repo. Returns (pth_path, index_path)."""
     if not MODELS_REPO_ID:
-        # Try local
         return _get_local_model(model_name)
     try:
@@ -105,58 +103,69 @@ def download_model(model_name: str):
         pth_path = hf_hub_download(
             repo_id=MODELS_REPO_ID,
             repo_type="dataset",
-            filename=f"models/{model_name}/{model_name}.pth",
             local_dir=local_dir,
         )
-        index_path = None
-        try:
-            index_path = hf_hub_download(
-                repo_id=MODELS_REPO_ID,
-                repo_type="dataset",
-                filename=f"models/{model_name}/{model_name}.index",
-                local_dir=local_dir,
-            )
-        except Exception:
-            pass  # Index file is optional
-        # Download big_npy embeddings (for FAISS retrieval)
         try:
-            hf_hub_download(
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
-                filename=f"models/{model_name}/{model_name}_big_npy.npy",
                 local_dir=local_dir,
             )
         except Exception:
-            pass  # Will reconstruct from index if missing
-        return pth_path, index_path
     except Exception as e:
-        logger.error(f"Failed to download model from HF: {e}")
         return _get_local_model(model_name)
-def _get_local_model(model_name: str):
     """Get model from local storage."""
     local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
-    pth_path = os.path.join(local_dir, f"{model_name}.pth")
-    index_path = os.path.join(local_dir, f"{model_name}.index")
     if os.path.exists(pth_path):
-        return pth_path, index_path if os.path.exists(index_path) else None
     return None, None
 def list_models():
-    """List all available models (from HF repo + local)."""
     models = set()
-    # Check HF repo
     if MODELS_REPO_ID:
         try:
             from huggingface_hub import HfApi
             api = HfApi()
             files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
             for f in files:
@@ -165,43 +174,37 @@ def list_models():
                     if len(parts) >= 3:
                         models.add(parts[1])
         except Exception as e:
-            logger.error(f"Failed to list models from HF: {e}")
-    # Check local models
     if os.path.exists(LOCAL_MODELS_DIR):
         for name in os.listdir(LOCAL_MODELS_DIR):
             model_dir = os.path.join(LOCAL_MODELS_DIR, name)
             if os.path.isdir(model_dir):
-                pth = os.path.join(model_dir, f"{name}.pth")
                 if os.path.exists(pth):
                     models.add(name)
     return sorted(models)
-def delete_model(model_name: str):
     """Delete a model from HF repo and local storage."""
-    # Delete from HF
     if MODELS_REPO_ID:
         try:
             from huggingface_hub import HfApi
             api = HfApi()
-            # Delete the entire model folder
             files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
             for f in files:
-                if f.startswith(f"models/{model_name}/"):
                     api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
-            logger.info(f"Deleted {model_name} from HF repo")
         except Exception as e:
-            logger.error(f"Failed to delete from HF: {e}")
-    # Delete local
     import shutil
     local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
     if os.path.exists(local_dir):
         shutil.rmtree(local_dir)
-        logger.info(f"Deleted {model_name} from local storage")
     return True

 """
+Model storage module: persist voice reference files to HuggingFace Dataset repo.
 """
 import os
 logger = logging.getLogger(__name__)
 MODELS_REPO_ID = None
 LOCAL_MODELS_DIR = "/tmp/rvc_models"
+def init_storage(repo_id):
     """Initialize storage with the HF dataset repo ID."""
     global MODELS_REPO_ID
     MODELS_REPO_ID = repo_id
     os.makedirs(LOCAL_MODELS_DIR, exist_ok=True)
+    logger.info("Storage initialized with repo: {}".format(repo_id))
+def upload_model(model_name, pth_path, index_path=None, big_npy_path=None, reference_path=None):
+    """Upload model files to HF dataset repo."""
     if not MODELS_REPO_ID:
         logger.warning("No HF repo configured. Model saved locally only.")
         return False
     try:
         from huggingface_hub import HfApi
         api = HfApi()
+        # Upload .pth marker
+        if pth_path and os.path.exists(pth_path):
+            api.upload_file(
+                path_or_fileobj=pth_path,
+                path_in_repo="models/{}/{}.pth".format(model_name, model_name),
+                repo_id=MODELS_REPO_ID,
+                repo_type="dataset",
+            )
+            logger.info("Uploaded {}.pth to HF".format(model_name))
+        # Upload reference audio
+        if reference_path and os.path.exists(reference_path):
+            ref_filename = os.path.basename(reference_path)
             api.upload_file(
+                path_or_fileobj=reference_path,
+                path_in_repo="models/{}/{}".format(model_name, ref_filename),
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
+            logger.info("Uploaded {} to HF".format(ref_filename))
+        # Upload .index file if exists (backward compat)
+        if index_path and os.path.exists(index_path):
             api.upload_file(
+                path_or_fileobj=index_path,
+                path_in_repo="models/{}/{}.index".format(model_name, model_name),
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
         # Upload metadata
         metadata = {
             "name": model_name,
             "created": datetime.now().isoformat(),
+            "engine": "seed-vc",
         }
         import json
         import tempfile
         try:
             api.upload_file(
                 path_or_fileobj=meta_path,
+                path_in_repo="models/{}/metadata.json".format(model_name),
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
             )
         return True
     except Exception as e:
+        logger.error("Failed to upload model: {}".format(e))
         return False
+def download_model(model_name):
+    """Download model from HF dataset repo. Returns (pth_path, reference_path)."""
     if not MODELS_REPO_ID:
         return _get_local_model(model_name)
     try:
         pth_path = hf_hub_download(
             repo_id=MODELS_REPO_ID,
             repo_type="dataset",
+            filename="models/{}/{}.pth".format(model_name, model_name),
             local_dir=local_dir,
         )
+        # Try to download reference audio
+        ref_path = None
         try:
+            ref_path = hf_hub_download(
                 repo_id=MODELS_REPO_ID,
                 repo_type="dataset",
+                filename="models/{}/{}_ref.wav".format(model_name, model_name),
                 local_dir=local_dir,
             )
         except Exception:
+            # Try .index for backward compat with old RVC models
+            try:
+                ref_path = hf_hub_download(
+                    repo_id=MODELS_REPO_ID,
+                    repo_type="dataset",
+                    filename="models/{}/{}.index".format(model_name, model_name),
+                    local_dir=local_dir,
+                )
+            except Exception:
+                pass
+        return pth_path, ref_path
     except Exception as e:
+        logger.error("Failed to download model from HF: {}".format(e))
         return _get_local_model(model_name)
+def _get_local_model(model_name):
     """Get model from local storage."""
     local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
+    pth_path = os.path.join(local_dir, "{}.pth".format(model_name))
+    ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
     if os.path.exists(pth_path):
+        return pth_path, ref_path if os.path.exists(ref_path) else None
     return None, None
+def get_reference_path(model_name):
+    """Get the reference audio path for a model."""
+    local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
+    ref_path = os.path.join(local_dir, "{}_ref.wav".format(model_name))
+    if os.path.exists(ref_path):
+        return ref_path
+    # Search in subdirectories (HF download structure)
+    for root, dirs, files in os.walk(local_dir):
+        for f in files:
+            if f.endswith("_ref.wav"):
+                return os.path.join(root, f)
+    return None
 def list_models():
+    """List all available models."""
     models = set()
     if MODELS_REPO_ID:
         try:
             from huggingface_hub import HfApi
             api = HfApi()
             files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
             for f in files:
                     if len(parts) >= 3:
                         models.add(parts[1])
         except Exception as e:
+            logger.error("Failed to list models from HF: {}".format(e))
     if os.path.exists(LOCAL_MODELS_DIR):
         for name in os.listdir(LOCAL_MODELS_DIR):
             model_dir = os.path.join(LOCAL_MODELS_DIR, name)
             if os.path.isdir(model_dir):
+                pth = os.path.join(model_dir, "{}.pth".format(name))
                 if os.path.exists(pth):
                     models.add(name)
     return sorted(models)
+def delete_model(model_name):
     """Delete a model from HF repo and local storage."""
     if MODELS_REPO_ID:
         try:
             from huggingface_hub import HfApi
             api = HfApi()
             files = api.list_repo_files(MODELS_REPO_ID, repo_type="dataset")
             for f in files:
+                if f.startswith("models/{}/".format(model_name)):
                     api.delete_file(f, MODELS_REPO_ID, repo_type="dataset")
+            logger.info("Deleted {} from HF repo".format(model_name))
         except Exception as e:
+            logger.error("Failed to delete from HF: {}".format(e))
     import shutil
     local_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
     if os.path.exists(local_dir):
         shutil.rmtree(local_dir)
+        logger.info("Deleted {} from local storage".format(model_name))
     return True

pipeline/training.py CHANGED Viewed

@@ -1,18 +1,12 @@
 """
-Training pipeline: wraps Applio's preprocess, extract, and train steps.
-All GPU-intensive operations run IN-PROCESS under @spaces.GPU decorators.
-Uses runpy.run_path to execute Applio scripts in the current process,
-ensuring ZeroGPU's GPU allocation is visible to the training code.
 """
 import os
-import sys
-import runpy
-import subprocess
 import logging
 import shutil
-import time
-import glob
 logger = logging.getLogger(__name__)
@@ -27,518 +21,105 @@ except ImportError:
             return decorator
-from pipeline.setup import APPLIO_DIR
-LOGS_DIR = os.path.join(APPLIO_DIR, "logs")
-# Prevent "context has already been set" errors from Applio/torch
-# by neutralizing mp.set_start_method calls
-import multiprocessing as _mp
-_orig_set_start_method = _mp.set_start_method
-def _safe_set_start_method(method=None, force=False):
-    try:
-        _orig_set_start_method(method, force=True)
-    except RuntimeError:
-        pass
-_mp.set_start_method = _safe_set_start_method
-def _setup_applio_env():
-    """Ensure Applio is on sys.path."""
-    if APPLIO_DIR not in sys.path:
-        sys.path.insert(0, APPLIO_DIR)
-    train_dir = os.path.join(APPLIO_DIR, "rvc", "train")
-    if train_dir not in sys.path:
-        sys.path.insert(0, train_dir)
-def preprocess(model_name: str, audio_path: str, sample_rate: int = 40000):
-    """
-    Preprocess audio: load, normalize, slice into segments, save at target SR and 16kHz.
-    Custom implementation (no Applio subprocess dependency).
-    """
-    import numpy as np
-    import librosa
-    import soundfile as sf
-    exp_dir = os.path.join(LOGS_DIR, model_name)
-    sliced_dir = os.path.join(exp_dir, "sliced_audios")
-    sliced_16k_dir = os.path.join(exp_dir, "sliced_audios_16k")
-    os.makedirs(sliced_dir, exist_ok=True)
-    os.makedirs(sliced_16k_dir, exist_ok=True)
-    logger.info(f"Preprocessing {audio_path} for model {model_name}...")
-    # Load audio at target sample rate
-    audio, sr = librosa.load(audio_path, sr=sample_rate, mono=True)
-    logger.info(f"Loaded audio: {len(audio)} samples, {len(audio)/sr:.1f}s at {sr}Hz")
-    if len(audio) < sr * 1:
-        raise RuntimeError("Audio trop court (< 1 seconde).")
-    # Normalize
-    peak = np.abs(audio).max()
-    if peak > 0:
-        audio = audio / peak * 0.95
-    # Also load at 16kHz
-    audio_16k, _ = librosa.load(audio_path, sr=16000, mono=True)
-    peak_16k = np.abs(audio_16k).max()
-    if peak_16k > 0:
-        audio_16k = audio_16k / peak_16k * 0.95
-    # Slice into segments of ~3.5 seconds with 0.3s overlap
-    segment_len = int(3.5 * sr)
-    hop = int(3.0 * sr)  # 3.5 - 0.5 overlap
-    segment_len_16k = int(3.5 * 16000)
-    hop_16k = int(3.0 * 16000)
-    MAX_SLICES = 40  # Balance quality vs GPU time (60s ZeroGPU limit)
-    n_slices = 0
-    idx = 0
-    while idx < len(audio) and n_slices < MAX_SLICES:
-        # Slice at target sample rate
-        end = min(idx + segment_len, len(audio))
-        segment = audio[idx:end]
-        # Skip very short segments (< 0.5s)
-        if len(segment) < int(0.5 * sr):
-            idx += hop
-            continue
-        # Skip silent segments
-        if np.abs(segment).max() < 0.01:
-            idx += hop
-            continue
-        # Compute corresponding 16k positions
-        ratio = 16000 / sr
-        idx_16k = int(idx * ratio)
-        end_16k = int(end * ratio)
-        segment_16k = audio_16k[idx_16k:min(end_16k, len(audio_16k))]
-        # Save slices
-        fname = f"{model_name}_{n_slices:04d}.wav"
-        sf.write(os.path.join(sliced_dir, fname), segment, sr)
-        sf.write(os.path.join(sliced_16k_dir, fname), segment_16k, 16000)
-        n_slices += 1
-        idx += hop
-    logger.info(f"Preprocessing complete: {n_slices} slices created.")
-    if n_slices == 0:
-        raise RuntimeError("Preprocessing produced no audio slices. L'audio est peut-être silencieux.")
-    return n_slices
-@spaces.GPU(duration=60)
-def extract_features(model_name: str, sample_rate: int = 40000, f0_method: str = "rmvpe"):
-    """
-    Extract F0 pitch and HuBERT embeddings.
-    Runs IN-PROCESS to access ZeroGPU's GPU allocation.
-    """
     import torch
-    import numpy as np
-    _setup_applio_env()
-    old_cwd = os.getcwd()
-    os.chdir(APPLIO_DIR)
-    try:
-        exp_dir = os.path.join(LOGS_DIR, model_name)
-        wav_path = os.path.join(exp_dir, "sliced_audios_16k")
-        os.makedirs(os.path.join(exp_dir, "f0"), exist_ok=True)
-        os.makedirs(os.path.join(exp_dir, "f0_voiced"), exist_ok=True)
-        os.makedirs(os.path.join(exp_dir, "extracted"), exist_ok=True)
-        files = []
-        for wav_file in sorted(glob.glob(os.path.join(wav_path, "*.wav"))):
-            file_name = os.path.basename(wav_file)
-            files.append([
-                wav_file,
-                os.path.join(exp_dir, "f0", file_name + ".npy"),
-                os.path.join(exp_dir, "f0_voiced", file_name + ".npy"),
-                os.path.join(exp_dir, "extracted", file_name.replace("wav", "npy")),
-            ])
-        if not files:
-            raise RuntimeError("No preprocessed audio files found for feature extraction.")
-        device = "cuda:0" if torch.cuda.is_available() else "cpu"
-        # F0 extraction
-        logger.info(f"Extracting F0 with {f0_method} on {device}...")
-        from rvc.train.extract.extract import FeatureInput
-        fe = FeatureInput(f0_method=f0_method, device=device)
-        for file_info in files:
-            fe.process_file(file_info)
-        # HuBERT embedding extraction
-        logger.info(f"Extracting embeddings on {device}...")
-        from rvc.lib.utils import load_audio_16k, load_embedding
-        emb_model = load_embedding("contentvec", None).to(device).float()
-        for file_info in files:
-            wav_file_path, _, _, out_file_path = file_info
-            if os.path.exists(out_file_path):
-                continue
-            feats = torch.from_numpy(load_audio_16k(wav_file_path)).to(device).float()
-            feats = feats.view(1, -1)
-            with torch.no_grad():
-                emb_result = emb_model(feats)["last_hidden_state"]
-            feats_out = emb_result.squeeze(0).float().cpu().numpy()
-            if not np.isnan(feats_out).any():
-                np.save(out_file_path, feats_out, allow_pickle=False)
-        # Save embedder model info
-        import json
-        model_info_path = os.path.join(exp_dir, "model_info.json")
-        model_info = {}
-        if os.path.exists(model_info_path):
-            with open(model_info_path, "r") as f:
-                model_info = json.load(f)
-        model_info["embedder_model"] = "contentvec"
-        with open(model_info_path, "w") as f:
-            json.dump(model_info, f, indent=4)
-        # Generate config and filelist
-        from rvc.train.extract.preparing_files import generate_config, generate_filelist
-        generate_config(sample_rate, exp_dir)
-        generate_filelist(exp_dir, sample_rate, include_mutes=2)
-        # Verify output
-        if len(os.listdir(os.path.join(exp_dir, "extracted"))) == 0:
-            raise RuntimeError("Feature extraction produced no embeddings.")
-        if len(os.listdir(os.path.join(exp_dir, "f0"))) == 0:
-            raise RuntimeError("F0 extraction produced no pitch files.")
-        logger.info("Feature extraction complete.")
-        return True
-    finally:
-        os.chdir(old_cwd)
-@spaces.GPU(duration=60)
-def train_model(
-    model_name: str,
-    sample_rate: int = 40000,
-    total_epochs: int = 20,
-    batch_size: int = 8,
 ):
     """
-    Train RVC v2 model. Runs IN-PROCESS with mp.Process patched to avoid
-    spawning child processes (which can't access ZeroGPU's GPU).
-    Max 300s (5 min) on ZeroGPU.
-    """
-    import torch.multiprocessing as mp
-    import json
-    _setup_applio_env()
-    # Ensure assets/config.json exists (Applio reads precision from it)
-    assets_dir = os.path.join(APPLIO_DIR, "assets")
-    os.makedirs(assets_dir, exist_ok=True)
-    config_json = os.path.join(assets_dir, "config.json")
-    if not os.path.exists(config_json):
-        with open(config_json, "w") as f:
-            json.dump({"precision": "fp32"}, f)
-    # Select pretrained models
-    sr_prefix = str(sample_rate)[:2]
-    pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
-    pd = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0D{sr_prefix}k.pth")
-    if not os.path.exists(pg) or not os.path.exists(pd):
-        logger.warning("Pretrained models not found, training from scratch.")
-        pg, pd = "", ""
-    # Patch mp.Process to run inline (single GPU only)
-    OrigProcess = mp.Process
-    class InlineProcess:
-        """Runs target function inline instead of spawning a new process."""
-        def __init__(self, target=None, args=(), kwargs=None, **kw):
-            self.target = target
-            self.args = args
-            self.kwargs = kwargs or {}
-            self.pid = os.getpid()
-        def start(self):
-            if self.target:
-                self.target(*self.args, **self.kwargs)
-        def join(self):
-            pass
-    train_script = os.path.join(APPLIO_DIR, "rvc", "train", "train.py")
-    argv_args = [
-        model_name,
-        str(total_epochs), str(total_epochs),
-        pg, pd,
-        "0", str(batch_size), str(sample_rate),
-        "True", "True", "False", "False", "50", "False", "HiFi-GAN", "False",
-    ]
-    logger.info(f"Training {model_name} for {total_epochs} epochs (in-process)...")
-    start_time = time.time()
-    old_argv = sys.argv
-    old_cwd = os.getcwd()
-    mp.Process = InlineProcess
-    try:
-        os.chdir(APPLIO_DIR)
-        sys.argv = [train_script] + argv_args
-        runpy.run_path(train_script, run_name="__main__")
-    except SystemExit as e:
-        if e.code not in (0, None):
-            raise RuntimeError(f"Training exited with code {e.code}")
-    finally:
-        mp.Process = OrigProcess
-        sys.argv = old_argv
-        os.chdir(old_cwd)
-    elapsed = time.time() - start_time
-    logger.info(f"Training completed in {elapsed:.1f}s")
-    return True
-def build_index(model_name: str):
-    """Build FAISS index from extracted embeddings."""
-    import numpy as np
-    try:
-        import faiss
-    except ImportError:
-        logger.warning("faiss not available, skipping index building.")
-        return None
-    exp_dir = os.path.join(LOGS_DIR, model_name)
-    extracted_dir = os.path.join(exp_dir, "extracted")
-    if not os.path.exists(extracted_dir):
-        logger.warning("No extracted features found for index building.")
-        return None
-    # Load all embeddings
-    embeddings = []
-    for npy_file in sorted(glob.glob(os.path.join(extracted_dir, "*.npy"))):
-        try:
-            emb = np.load(npy_file)
-            if emb.ndim == 2:
-                embeddings.append(emb)
-        except Exception as e:
-            logger.warning(f"Failed to load {npy_file}: {e}")
-    if not embeddings:
-        logger.warning("No valid embeddings found for index.")
-        return None
-    all_emb = np.concatenate(embeddings, axis=0).astype(np.float32)
-    logger.info(f"Building FAISS index from {all_emb.shape[0]} vectors ({all_emb.shape[1]}D)...")
-    # Build IVF index for fast retrieval
-    dim = all_emb.shape[1]
-    n_vectors = all_emb.shape[0]
-    if n_vectors < 40:
-        # Too few vectors for IVF, use flat index
-        index = faiss.IndexFlatL2(dim)
-    else:
-        n_clusters = min(int(np.sqrt(n_vectors)), n_vectors // 4)
-        n_clusters = max(n_clusters, 1)
-        quantizer = faiss.IndexFlatL2(dim)
-        index = faiss.IndexIVFFlat(quantizer, dim, n_clusters)
-        index.train(all_emb)
-    index.add(all_emb)
-    index_path = os.path.join(exp_dir, f"{model_name}.index")
-    faiss.write_index(index, index_path)
-    # Save raw embeddings for FAISS retrieval at inference time
-    big_npy_path = os.path.join(exp_dir, f"{model_name}_big_npy.npy")
-    np.save(big_npy_path, all_emb)
-    logger.info(f"FAISS index built: {index_path} ({n_vectors} vectors)")
-    return index_path, big_npy_path
-def find_trained_model(model_name: str):
-    """Find the trained .pth model file."""
-    exp_dir = os.path.join(LOGS_DIR, model_name)
-    if os.path.exists(exp_dir):
-        exact = os.path.join(exp_dir, f"{model_name}.pth")
-        if os.path.exists(exact):
-            return exact
-        for f in sorted(os.listdir(exp_dir), reverse=True):
-            if f.endswith(".pth") and f.startswith(model_name):
-                return os.path.join(exp_dir, f)
-    return None
-def find_pretrained_model(sample_rate: int = 40000):
-    """Find the pre-trained RVC generator model."""
-    sr_prefix = str(sample_rate)[:2]
-    pg = os.path.join(APPLIO_DIR, "rvc", "models", "pretraineds", "hifi-gan", f"f0G{sr_prefix}k.pth")
-    if os.path.exists(pg):
-        return pg
-    return None
-def _convert_to_inference_model(checkpoint_path, output_path, sample_rate=40000):
-    """
-    Convert a pretrained training checkpoint to RVC inference format.
-    Training checkpoints have keys: model, optimizer, iteration, learning_rate
-    Inference models need keys: weight, config, info, sr, f0, version
     """
-    import torch
-    import json
-    checkpoint = torch.load(checkpoint_path, map_location="cpu")
-    # Extract generator weights
-    if "model" in checkpoint:
-        state_dict = checkpoint["model"]
-    elif "state_dict" in checkpoint:
-        state_dict = checkpoint["state_dict"]
-    else:
-        state_dict = checkpoint
-    # Remove "module." prefix if present (from DataParallel)
-    weight = {}
-    for k, v in state_dict.items():
-        new_key = k.replace("module.", "")
-        weight[new_key] = v.half()
-    # Read config from Applio config file
-    sr_label = "40k" if sample_rate == 40000 else "48k"
-    config_path = os.path.join(APPLIO_DIR, "configs", "v2", f"{sr_label}.json")
-    if os.path.exists(config_path):
-        with open(config_path) as f:
-            cfg = json.load(f)
-        config = [
-            cfg["data"]["filter_length"] // 2 + 1,
-            cfg["train"]["segment_size"] // cfg["data"]["hop_length"],
-            cfg["model"]["inter_channels"],
-            cfg["model"]["hidden_channels"],
-            cfg["model"]["filter_channels"],
-            cfg["model"]["n_heads"],
-            cfg["model"]["n_layers"],
-            cfg["model"]["kernel_size"],
-            cfg["model"]["p_dropout"],
-            cfg["model"]["resblock"],
-            cfg["model"]["resblock_kernel_sizes"],
-            cfg["model"]["resblock_dilation_sizes"],
-            cfg["model"]["upsample_rates"],
-            cfg["model"]["upsample_initial_channel"],
-            cfg["model"]["upsample_kernel_sizes"],
-            cfg["model"]["spk_embed_dim"],
-            cfg["model"]["gin_channels"],
-            cfg["data"]["sampling_rate"],
-        ]
-    else:
-        # Fallback: standard RVC v2 40k config
-        config = [
-            1025, 32, 192, 192, 768, 2, 6, 3, 0, "1",
-            [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
-            [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000,
-        ]
-    inference_model = {
-        "weight": weight,
-        "config": config,
-        "info": f"v2_{sr_label}",
-        "sr": sr_label,
-        "f0": 1,
-        "version": "v2",
-    }
-    torch.save(inference_model, output_path)
-    logger.info(f"Converted checkpoint to inference format: {output_path}")
-    return output_path
-def full_training_pipeline(
-    audio_path: str,
-    model_name: str,
-    epochs: int = 10,
-    sample_rate: int = 40000,
-    batch_size: int = 4,
-    progress_callback=None,
-):
-    """
-    Run the voice model creation pipeline.
-    On CPU: skips heavy HiFi-GAN training, uses pre-trained model + FAISS index.
-    Returns (pth_path, index_path) on success.
-    """
-    import torch
-    from pipeline.storage import upload_model, LOCAL_MODELS_DIR
-    has_gpu = torch.cuda.is_available()
     if progress_callback:
-        progress_callback(0.05, "Découpage de l'audio...")
-    n_slices = preprocess(model_name, audio_path, sample_rate)
-    if progress_callback:
-        progress_callback(0.20, f"{n_slices} segments créés. Extraction des caractéristiques vocales...")
-    extract_features(model_name, sample_rate)
-    if progress_callback:
-        progress_callback(0.60, "Caractéristiques extraites. Construction de l'index vocal...")
-    # Build FAISS index (fast, CPU-friendly)
-    result = build_index(model_name)
-    if result is None:
-        raise RuntimeError("Impossible de construire l'index FAISS. Pas d'embeddings extraits.")
-    index_path, big_npy_path = result
-    # The user's "model" is the FAISS index + embeddings.
-    # The pretrained generator is shared by all models (loaded at inference time).
-    # Voice identity comes from FAISS retrieval, not generator fine-tuning.
     if progress_callback:
-        progress_callback(0.75, "Finalisation du modèle vocal...")
     # Save to local models directory
     local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
     os.makedirs(local_model_dir, exist_ok=True)
-    # Save FAISS index
-    local_index = os.path.join(local_model_dir, f"{model_name}.index")
-    shutil.copy2(index_path, local_index)
-    # Save big_npy embeddings (needed for FAISS retrieval at inference)
-    local_big_npy = os.path.join(local_model_dir, f"{model_name}_big_npy.npy")
-    shutil.copy2(big_npy_path, local_big_npy)
-    # Create a minimal model marker file (no actual model weights needed)
-    local_pth = os.path.join(local_model_dir, f"{model_name}.pth")
-    torch.save({"type": "faiss_voice_model", "sample_rate": sample_rate}, local_pth)
     if progress_callback:
-        progress_callback(0.90, "Sauvegarde du modèle...")
     try:
-        upload_model(model_name, local_pth, local_index, local_big_npy)
     except Exception as e:
-        logger.warning(f"Failed to upload to HF (non-critical): {e}")
     if progress_callback:
-        progress_callback(1.0, "Modèle vocal créé !")
-    return local_pth, local_index

 """
+Voice model creation: save a reference audio clip for Seed-VC zero-shot conversion.
+No neural network training needed - Seed-VC uses in-context learning from
+reference audio at inference time.
 """
 import os
 import logging
 import shutil
 logger = logging.getLogger(__name__)
             return decorator
+# Dummy GPU-decorated function so ZeroGPU detects a GPU function at startup
+@spaces.GPU(duration=10)
+def _gpu_warmup():
+    """Minimal GPU function for ZeroGPU detection."""
     import torch
+    return torch.cuda.is_available() if hasattr(torch.cuda, "is_available") else False
+def save_voice_reference(
+    audio_path,
+    model_name,
+    progress_callback=None,
 ):
     """
+    Save a voice reference audio clip as the user's 'voice model'.
+    With Seed-VC, no training is needed. The reference audio (3-30 seconds)
+    is used directly at inference time for zero-shot voice conversion.
+    Args:
+        audio_path: Path to the uploaded voice recording
+        model_name: Name for the voice model
+        progress_callback: Optional callback for progress updates
+    Returns:
+        (reference_path, None) - path to saved reference audio
     """
+    import librosa
+    import soundfile as sf
+    import numpy as np
+    from pipeline.storage import LOCAL_MODELS_DIR, upload_model
+    if progress_callback:
+        progress_callback(0.1, "Chargement de l'audio...")
+    # Load and preprocess the reference audio
+    audio, sr = librosa.load(audio_path, sr=44100, mono=True)
+    duration = len(audio) / sr
+    logger.info("Reference audio: {:.1f}s at {}Hz".format(duration, sr))
+    if duration < 2.0:
+        raise RuntimeError(
+            "Audio trop court ({:.1f}s). Minimum 3 secondes recommande.".format(duration)
+        )
+    # Limit to 30 seconds (Seed-VC max reference length)
+    max_samples = 30 * sr
+    if len(audio) > max_samples:
+        audio = audio[:max_samples]
+        logger.info("Trimmed reference to 30s (Seed-VC max).")
     if progress_callback:
+        progress_callback(0.3, "Normalisation et nettoyage...")
+    # Normalize audio
+    peak = np.abs(audio).max()
+    if peak > 0:
+        audio = audio / peak * 0.95
+    # Trim silence from start and end
+    audio_trimmed, _ = librosa.effects.trim(audio, top_db=25)
+    if len(audio_trimmed) > sr * 2:
+        audio = audio_trimmed
     if progress_callback:
+        progress_callback(0.6, "Sauvegarde de la reference vocale...")
     # Save to local models directory
     local_model_dir = os.path.join(LOCAL_MODELS_DIR, model_name)
     os.makedirs(local_model_dir, exist_ok=True)
+    reference_path = os.path.join(local_model_dir, "{}_ref.wav".format(model_name))
+    sf.write(reference_path, audio, 44100, subtype="PCM_16")
+    # Also save a .pth marker for compatibility with storage/listing
+    import torch
+    marker_path = os.path.join(local_model_dir, "{}.pth".format(model_name))
+    torch.save({
+        "type": "seed_vc_reference",
+        "reference_audio": "{}_ref.wav".format(model_name),
+        "duration": len(audio) / sr,
+        "sample_rate": 44100,
+    }, marker_path)
     if progress_callback:
+        progress_callback(0.8, "Upload vers HuggingFace...")
+    # Upload to HF
     try:
+        upload_model(model_name, marker_path, reference_path=reference_path)
     except Exception as e:
+        logger.warning("Failed to upload to HF (non-critical): {}".format(e))
     if progress_callback:
+        progress_callback(1.0, "Reference vocale sauvegardee !")
+    final_duration = len(audio) / sr
+    logger.info("Voice reference saved: {} ({:.1f}s)".format(reference_path, final_duration))
+    return marker_path, reference_path

requirements.txt CHANGED Viewed

@@ -13,31 +13,23 @@ soundfile==0.12.1
 scipy>=1.11.0
 numpy<2.0
 soxr
-noisereduce
 ffmpeg-python>=0.2.0
 pedalboard
-# RVC dependencies
-faiss-cpu==1.9.0.post1
-torchcrepe
-torchfcpe
-einops
-transformers==4.44.2
 # Demucs (stem separation)
 demucs
-# Pitch extraction
-praat-parselmouth
-# ML utilities
-tqdm
 pyyaml
 requests
 numba
-# Misc
-tensorboard
-tensorboardX
-stftpitchshift
-wget

 scipy>=1.11.0
 numpy<2.0
 soxr
 ffmpeg-python>=0.2.0
 pedalboard
 # Demucs (stem separation)
 demucs
+# Seed-VC dependencies
+einops
+transformers>=4.40.0
 pyyaml
 requests
+tqdm
 numba
+# Vocoder
+bigvgan
+# Audio codec & vocos (used by Seed-VC)
+descript-audio-codec
+vocos