Spaces:

Harshil748
/

VoiceAPI

Sleeping

App Files Files Community

Harshil748 commited on Mar 12

Commit

51b23f6

1 Parent(s): b0dbe7f

Add voice cloning endpoint and XTTS model integration

Browse files

Files changed (3) hide show

ARCHITECTURE.md +238 -0
src/api.py +100 -0
src/engine.py +106 -11

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,238 @@

+# 🏗️ VoiceAPI System Architecture
+## High-Level System Diagram
+```mermaid
+flowchart TB
+    subgraph Client["📱 Client Applications"]
+        Web["🌐 Web App"]
+        Mobile["📱 Mobile App"]
+        Healthcare["🏥 Healthcare Assistant"]
+    end
+    subgraph API["🚀 FastAPI Server (Port 7860)"]
+        Endpoint["/Get_Inference API"]
+        LangRouter["Language Router"]
+    end
+    subgraph Engine["⚙️ TTS Engine"]
+        Normalizer["Text Normalizer"]
+        Tokenizer["Tokenizer"]
+        StyleProc["Style Processor"]
+        subgraph Models["�� Model Types"]
+            VITS["VITS JIT Models\n(.pt files)"]
+            Coqui["Coqui TTS\n(.pth files)"]
+            MMS["Facebook MMS\n(HuggingFace)"]
+        end
+    end
+    subgraph Languages["🗣️ 11 Languages"]
+        Hindi["🇮🇳 Hindi"]
+        Bengali["🇧🇩 Bengali"]
+        Marathi["Marathi"]
+        Telugu["Telugu"]
+        Kannada["Kannada"]
+        Gujarati["Gujarati"]
+        Bhojpuri["Bhojpuri"]
+        Others["+ 4 more"]
+    end
+    subgraph Output["🔊 Audio Output"]
+        WAV["WAV File\n22050 Hz"]
+    end
+    Client -->|HTTP GET/POST| Endpoint
+    Endpoint -->|text, lang| LangRouter
+    LangRouter --> Normalizer
+    Normalizer --> Tokenizer
+    Tokenizer --> Models
+    VITS --> StyleProc
+    Coqui --> StyleProc
+    MMS --> StyleProc
+    StyleProc --> WAV
+    WAV -->|Response| Client
+    Models --> Languages
+```
+## Data Flow Diagram
+```mermaid
+sequenceDiagram
+    participant C as Client
+    participant A as API Server
+    participant E as TTS Engine
+    participant M as Model
+    participant S as Style Processor
+    C->>A: GET /Get_Inference?text=नमस्ते&lang=hindi
+    A->>A: Parse parameters
+    A->>E: synthesize(text, voice)
+    E->>E: Normalize text
+    E->>E: Tokenize to IDs
+    E->>M: Load model (if not cached)
+    M->>M: Forward pass (inference)
+    M-->>E: Raw audio tensor
+    E->>S: Apply style (pitch, speed, energy)
+    S-->>E: Processed audio
+    E-->>A: TTSOutput (audio, sample_rate)
+    A->>A: Convert to WAV bytes
+    A-->>C: audio/wav response
+```
+## Model Architecture
+```mermaid
+flowchart LR
+    subgraph Input["📝 Input"]
+        Text["Text Input"]
+    end
+    subgraph TextEncoder["🔤 Text Encoder"]
+        Embed["Character Embedding"]
+        TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
+    end
+    subgraph FlowModel["🌊 Flow Model"]
+        Prior["Prior Encoder"]
+        Flow["Normalizing Flow"]
+        Duration["Duration Predictor"]
+    end
+    subgraph Decoder["🔊 HiFi-GAN Decoder"]
+        Upsample["Upsampling Layers"]
+        ResBlocks["Residual Blocks"]
+        Output["Audio Waveform"]
+    end
+    Text --> Embed --> TransEnc
+    TransEnc --> Prior
+    TransEnc --> Duration
+    Prior --> Flow
+    Duration --> Flow
+    Flow --> Upsample --> ResBlocks --> Output
+```
+## Training Pipeline
+```mermaid
+flowchart TD
+    subgraph Data["📊 Training Data"]
+        OpenSLR["OpenSLR Datasets"]
+        CommonVoice["Mozilla Common Voice"]
+        IndicTTS["IndicTTS Corpus"]
+        AI4Bharat["AI4Bharat Indic-Voices"]
+    end
+    subgraph Prep["🔧 Data Preparation"]
+        Download["Download Audio"]
+        Normalize["Normalize to 22050 Hz"]
+        Transcript["Generate Transcripts"]
+        Split["Train/Val Split"]
+    end
+    subgraph Train["🏋️ Training"]
+        Config["Load Config YAML"]
+        VITS_Train["VITS Training\n(1000 epochs)"]
+        Checkpoint["Save Checkpoints"]
+    end
+    subgraph Export["📦 Export"]
+        JIT["JIT Trace Model"]
+        Chars["Generate chars.txt"]
+        Package["Package for Inference"]
+    end
+    Data --> Download --> Normalize --> Transcript --> Split
+    Split --> Config --> VITS_Train --> Checkpoint
+    Checkpoint --> JIT --> Chars --> Package
+```
+## Deployment Architecture
+```mermaid
+flowchart TB
+    subgraph HF["☁️ HuggingFace Infrastructure"]
+        subgraph Space["🚀 HF Space (Docker)"]
+            Docker["Docker Container"]
+            FastAPI["FastAPI Server\n:7860"]
+            Models_Dir["models/ directory"]
+        end
+        subgraph ModelRepo["📦 Model Repository"]
+            ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
+        end
+    end
+    subgraph External["🌐 External Services"]
+        MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
+    end
+    User["👤 User"] -->|HTTPS| FastAPI
+    Docker -->|Build time| ModelFiles
+    FastAPI -->|Runtime| MMS_HF
+    Models_Dir -.->|Loaded from| ModelFiles
+```
+## Voice Configuration Map
+```mermaid
+mindmap
+  root((VoiceAPI))
+    Hindi
+      hi_male
+      hi_female
+    Bengali
+      bn_male
+      bn_female
+    Marathi
+      mr_male
+      mr_female
+    Telugu
+      te_male
+      te_female
+    Kannada
+      kn_male
+      kn_female
+    Gujarati
+      gu_mms
+    Bhojpuri
+      bho_male
+      bho_female
+    Chhattisgarhi
+      hne_male
+      hne_female
+    Maithili
+      mai_male
+      mai_female
+    Magahi
+      mag_male
+      mag_female
+    English
+      en_male
+      en_female
+```
+## Component Interaction
+| Component | File | Purpose |
+|-----------|------|---------|
+| API Server | `src/api.py` | FastAPI REST endpoints |
+| TTS Engine | `src/engine.py` | Model loading & inference |
+| Tokenizer | `src/tokenizer.py` | Text → Token IDs |
+| Config | `src/config.py` | Language & model configs |
+| Model Loader | `src/model_loader.py` | Model file management |
+## Performance Characteristics
+| Metric | Value |
+|--------|-------|
+| Inference Time | ~200-500ms per sentence |
+| Model Load Time | ~2-5s per voice |
+| Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) |
+| Supported Formats | WAV |
+| Concurrent Requests | Limited by memory |
+---
+*Built for Voice Tech for All Hackathon*

src/api.py CHANGED Viewed

@@ -37,6 +37,17 @@ from .config import (
     STYLE_PRESETS,
 )
 # Language name to voice key mapping (for hackathon API)
 LANG_TO_VOICE = {
     "hindi": "hi_female",
@@ -152,6 +163,16 @@ class SynthesizeResponse(BaseModel):
     inference_time: float
 class VoiceInfo(BaseModel):
     """Information about a voice"""
@@ -332,6 +353,85 @@ async def synthesize_stream(request: SynthesizeRequest):
         raise HTTPException(status_code=500, detail=str(e))
 @app.get("/synthesize/get")
 async def synthesize_get(
     text: str = Query(

     STYLE_PRESETS,
 )
+# Language mapping for XTTS voice cloning
+XTTS_LANG_MAP = {
+    "english": "en",
+    "hindi": "hi",
+    "bengali": "bn",
+    "gujarati": "gu",
+    "marathi": "mr",
+    "telugu": "te",
+    "kannada": "kn",
+}
 # Language name to voice key mapping (for hackathon API)
 LANG_TO_VOICE = {
     "hindi": "hi_female",
     inference_time: float
+class CloneResponse(BaseModel):
+    """Response metadata for voice cloning"""
+    success: bool
+    duration: float
+    sample_rate: int
+    inference_time: float
+    language: str
 class VoiceInfo(BaseModel):
     """Information about a voice"""
         raise HTTPException(status_code=500, detail=str(e))
+@app.post("/clone", response_class=Response)
+async def clone_voice(
+    text: str = Query(..., description="Text to synthesize with cloned voice"),
+    lang: str = Query(
+        "english",
+        description="Language name (english, hindi, bengali, gujarati, marathi, telugu, kannada)",
+    ),
+    speed: float = Query(1.0, description="Speech speed", ge=0.5, le=2.0),
+    pitch: float = Query(1.0, description="Pitch", ge=0.5, le=2.0),
+    energy: float = Query(1.0, description="Energy", ge=0.5, le=2.0),
+    style: Optional[str] = Query(None, description="Style preset"),
+    speaker_wav: UploadFile = File(
+        ..., description="Reference speaker WAV (3-15 seconds recommended)"
+    ),
+):
+    """
+    Clone a custom voice from uploaded sample using XTTS v2.
+    """
+    engine = get_engine()
+    lang_lower = lang.lower().strip()
+    if lang_lower not in XTTS_LANG_MAP:
+        supported = ", ".join(sorted(XTTS_LANG_MAP.keys()))
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unsupported clone language: {lang}. Supported: {supported}",
+        )
+    temp_path = None
+    try:
+        data = await speaker_wav.read()
+        if len(data) < 44:
+            raise HTTPException(status_code=400, detail="Invalid speaker_wav file")
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
+            tmp.write(data)
+            temp_path = tmp.name
+        start_time = time.time()
+        output = engine.clone_voice(
+            text=text,
+            speaker_wav_path=temp_path,
+            language_code=XTTS_LANG_MAP[lang_lower],
+            speed=speed,
+            pitch=pitch,
+            energy=energy,
+            style=style,
+            normalize_text=True,
+        )
+        inference_time = time.time() - start_time
+        buffer = io.BytesIO()
+        sf.write(buffer, output.audio, output.sample_rate, format="WAV")
+        buffer.seek(0)
+        return Response(
+            content=buffer.read(),
+            media_type="audio/wav",
+            headers={
+                "X-Duration": str(output.duration),
+                "X-Sample-Rate": str(output.sample_rate),
+                "X-Language": lang_lower,
+                "X-Voice": "custom_cloned",
+                "X-Inference-Time": str(inference_time),
+            },
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Clone error: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+    finally:
+        if temp_path and os.path.exists(temp_path):
+            try:
+                os.remove(temp_path)
+            except OSError:
+                pass
 @app.get("/synthesize/get")
 async def synthesize_get(
     text: str = Query(

src/engine.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """
 TTS Engine for Multi-lingual Indian Language Speech Synthesis
-This engine uses VITS (Variational Inference with adversarial learning
 for Text-to-Speech) models trained on various Indian language datasets.
 Supported Languages:
@@ -25,7 +25,11 @@ from dataclasses import dataclass
 from .config import LANGUAGE_CONFIGS, LanguageConfig, MODELS_DIR, STYLE_PRESETS
 from .tokenizer import TTSTokenizer, CharactersConfig, TextNormalizer
-from .model_loader import _ensure_models_available, get_model_path, list_available_models
 logger = logging.getLogger(__name__)
@@ -33,6 +37,7 @@ logger = logging.getLogger(__name__)
 @dataclass
 class TTSOutput:
     """Output from TTS synthesis"""
     audio: np.ndarray
     sample_rate: int
     duration: float
@@ -48,13 +53,16 @@ class StyleProcessor:
     """
     @staticmethod
-    def apply_pitch_shift(audio: np.ndarray, sample_rate: int, pitch_factor: float) -> np.ndarray:
         """Shift pitch without changing duration"""
         if pitch_factor == 1.0:
             return audio
         try:
             import librosa
             semitones = 12 * np.log2(pitch_factor)
             shifted = librosa.effects.pitch_shift(
                 audio.astype(np.float32), sr=sample_rate, n_steps=semitones
@@ -62,23 +70,28 @@ class StyleProcessor:
             return shifted
         except ImportError:
             from scipy import signal
             stretched = signal.resample(audio, int(len(audio) / pitch_factor))
             return signal.resample(stretched, len(audio))
     @staticmethod
-    def apply_speed_change(audio: np.ndarray, sample_rate: int, speed_factor: float) -> np.ndarray:
         """Change speed/tempo without changing pitch"""
         if speed_factor == 1.0:
             return audio
         try:
             import librosa
             stretched = librosa.effects.time_stretch(
                 audio.astype(np.float32), rate=speed_factor
             )
             return stretched
         except ImportError:
             from scipy import signal
             target_length = int(len(audio) / speed_factor)
             return signal.resample(audio, target_length)
@@ -160,6 +173,7 @@ class TTSEngine:
         self._coqui_models: Dict[str, Any] = {}
         self._mms_models: Dict[str, Any] = {}
         self._mms_tokenizers: Dict[str, Any] = {}
         # Text normalizer
         self.normalizer = TextNormalizer()
@@ -216,7 +230,9 @@ class TTSEngine:
         else:
             raise FileNotFoundError(f"No model file found in {model_dir}")
-    def _load_jit_voice(self, voice_key: str, model_dir: Path, model_path: Path) -> bool:
         """Load a JIT traced VITS model"""
         chars_path = model_dir / "chars.txt"
         if chars_path.exists():
@@ -238,7 +254,9 @@ class TTSEngine:
         logger.info(f"Loaded voice: {voice_key}")
         return True
-    def _load_coqui_voice(self, voice_key: str, model_dir: Path, checkpoint_path: Path) -> bool:
         """Load a Coqui TTS checkpoint model"""
         config_path = model_dir / "config.json"
         if not config_path.exists():
@@ -333,6 +351,71 @@ class TTSEngine:
         torch.cuda.empty_cache() if self.device.type == "cuda" else None
         logger.info(f"Unloaded voice: {voice_key}")
     def synthesize(
         self,
         text: str,
@@ -423,7 +506,9 @@ class TTSEngine:
         """Synthesize speech and save to file"""
         import soundfile as sf
-        output = self.synthesize(text, voice, speed, pitch, energy, style, normalize_text)
         sf.write(output_path, output.audio, output.sample_rate)
         logger.info(f"Saved audio to {output_path} (duration: {output.duration:.2f}s)")
@@ -454,8 +539,14 @@ class TTSEngine:
             voices[key] = {
                 "name": config.name,
                 "code": config.code,
-                "gender": "male" if "male" in key else ("female" if "female" in key else "neutral"),
-                "loaded": key in self._models or key in self._coqui_models or key in self._mms_models,
                 "downloaded": is_mms or get_model_path(key) is not None,
                 "type": model_type,
             }
@@ -465,12 +556,16 @@ class TTSEngine:
         """Get available style presets"""
         return STYLE_PRESETS
-    def batch_synthesize(self, texts: List[str], voice: str = "hi_male", speed: float = 1.0) -> List[TTSOutput]:
         """Synthesize multiple texts"""
         return [self.synthesize(text, voice, speed) for text in texts]
-def synthesize(text: str, voice: str = "hi_male", output_path: Optional[str] = None) -> Union[TTSOutput, str]:
     """Quick synthesis function"""
     engine = TTSEngine()

 """
 TTS Engine for Multi-lingual Indian Language Speech Synthesis
+This engine uses VITS (Variational Inference with adversarial learning
 for Text-to-Speech) models trained on various Indian language datasets.
 Supported Languages:
 from .config import LANGUAGE_CONFIGS, LanguageConfig, MODELS_DIR, STYLE_PRESETS
 from .tokenizer import TTSTokenizer, CharactersConfig, TextNormalizer
+from .model_loader import (
+    _ensure_models_available,
+    get_model_path,
+    list_available_models,
+)
 logger = logging.getLogger(__name__)
 @dataclass
 class TTSOutput:
     """Output from TTS synthesis"""
     audio: np.ndarray
     sample_rate: int
     duration: float
     """
     @staticmethod
+    def apply_pitch_shift(
+        audio: np.ndarray, sample_rate: int, pitch_factor: float
+    ) -> np.ndarray:
         """Shift pitch without changing duration"""
         if pitch_factor == 1.0:
             return audio
         try:
             import librosa
             semitones = 12 * np.log2(pitch_factor)
             shifted = librosa.effects.pitch_shift(
                 audio.astype(np.float32), sr=sample_rate, n_steps=semitones
             return shifted
         except ImportError:
             from scipy import signal
             stretched = signal.resample(audio, int(len(audio) / pitch_factor))
             return signal.resample(stretched, len(audio))
     @staticmethod
+    def apply_speed_change(
+        audio: np.ndarray, sample_rate: int, speed_factor: float
+    ) -> np.ndarray:
         """Change speed/tempo without changing pitch"""
         if speed_factor == 1.0:
             return audio
         try:
             import librosa
             stretched = librosa.effects.time_stretch(
                 audio.astype(np.float32), rate=speed_factor
             )
             return stretched
         except ImportError:
             from scipy import signal
             target_length = int(len(audio) / speed_factor)
             return signal.resample(audio, target_length)
         self._coqui_models: Dict[str, Any] = {}
         self._mms_models: Dict[str, Any] = {}
         self._mms_tokenizers: Dict[str, Any] = {}
+        self._xtts_model: Optional[Any] = None
         # Text normalizer
         self.normalizer = TextNormalizer()
         else:
             raise FileNotFoundError(f"No model file found in {model_dir}")
+    def _load_jit_voice(
+        self, voice_key: str, model_dir: Path, model_path: Path
+    ) -> bool:
         """Load a JIT traced VITS model"""
         chars_path = model_dir / "chars.txt"
         if chars_path.exists():
         logger.info(f"Loaded voice: {voice_key}")
         return True
+    def _load_coqui_voice(
+        self, voice_key: str, model_dir: Path, checkpoint_path: Path
+    ) -> bool:
         """Load a Coqui TTS checkpoint model"""
         config_path = model_dir / "config.json"
         if not config_path.exists():
         torch.cuda.empty_cache() if self.device.type == "cuda" else None
         logger.info(f"Unloaded voice: {voice_key}")
+    def _get_xtts_model(self):
+        """Lazy-load Coqui XTTS model for voice cloning."""
+        if self._xtts_model is not None:
+            return self._xtts_model
+        try:
+            from TTS.api import TTS
+        except ImportError as exc:
+            raise ImportError(
+                "Coqui TTS is required for voice cloning. Install with: pip install TTS"
+            ) from exc
+        logger.info("Loading XTTS v2 voice cloning model...")
+        self._xtts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
+        if self.device.type == "cuda":
+            self._xtts_model = self._xtts_model.to("cuda")
+        logger.info("XTTS v2 loaded")
+        return self._xtts_model
+    def clone_voice(
+        self,
+        text: str,
+        speaker_wav_path: str,
+        language_code: str = "en",
+        speed: float = 1.0,
+        pitch: float = 1.0,
+        energy: float = 1.0,
+        style: Optional[str] = None,
+        normalize_text: bool = True,
+    ) -> TTSOutput:
+        """Clone a speaker voice from a reference WAV using XTTS v2."""
+        xtts = self._get_xtts_model()
+        if normalize_text:
+            text = self.normalizer.clean_text(text, language_code)
+        wav = xtts.tts(
+            text=text,
+            speaker_wav=speaker_wav_path,
+            language=language_code,
+        )
+        audio_np = np.array(wav, dtype=np.float32)
+        sample_rate = 24000
+        if style and style in STYLE_PRESETS:
+            preset = STYLE_PRESETS[style]
+            speed = speed * preset["speed"]
+            pitch = pitch * preset["pitch"]
+            energy = energy * preset["energy"]
+        audio_np = self.style_processor.apply_style(
+            audio_np, sample_rate, speed=speed, pitch=pitch, energy=energy
+        )
+        duration = len(audio_np) / sample_rate
+        return TTSOutput(
+            audio=audio_np,
+            sample_rate=sample_rate,
+            duration=duration,
+            voice="custom_cloned",
+            text=text,
+            style=style,
+        )
     def synthesize(
         self,
         text: str,
         """Synthesize speech and save to file"""
         import soundfile as sf
+        output = self.synthesize(
+            text, voice, speed, pitch, energy, style, normalize_text
+        )
         sf.write(output_path, output.audio, output.sample_rate)
         logger.info(f"Saved audio to {output_path} (duration: {output.duration:.2f}s)")
             voices[key] = {
                 "name": config.name,
                 "code": config.code,
+                "gender": (
+                    "male"
+                    if "male" in key
+                    else ("female" if "female" in key else "neutral")
+                ),
+                "loaded": key in self._models
+                or key in self._coqui_models
+                or key in self._mms_models,
                 "downloaded": is_mms or get_model_path(key) is not None,
                 "type": model_type,
             }
         """Get available style presets"""
         return STYLE_PRESETS
+    def batch_synthesize(
+        self, texts: List[str], voice: str = "hi_male", speed: float = 1.0
+    ) -> List[TTSOutput]:
         """Synthesize multiple texts"""
         return [self.synthesize(text, voice, speed) for text in texts]
+def synthesize(
+    text: str, voice: str = "hi_male", output_path: Optional[str] = None
+) -> Union[TTSOutput, str]:
     """Quick synthesis function"""
     engine = TTSEngine()