# ๐Ÿ—๏ธ VoiceAPI System Architecture ## High-Level System Diagram ```mermaid flowchart TB subgraph Client["๐Ÿ“ฑ Client Applications"] Web["๐ŸŒ Web App"] Mobile["๐Ÿ“ฑ Mobile App"] Healthcare["๐Ÿฅ Healthcare Assistant"] end subgraph API["๐Ÿš€ FastAPI Server (Port 7860)"] Endpoint["/Get_Inference API"] LangRouter["Language Router"] end subgraph Engine["โš™๏ธ TTS Engine"] Normalizer["Text Normalizer"] Tokenizer["Tokenizer"] StyleProc["Style Processor"] subgraph Models["๏ฟฝ๏ฟฝ Model Types"] VITS["VITS JIT Models\n(.pt files)"] Coqui["Coqui TTS\n(.pth files)"] MMS["Facebook MMS\n(HuggingFace)"] end end subgraph Languages["๐Ÿ—ฃ๏ธ 11 Languages"] Hindi["๐Ÿ‡ฎ๐Ÿ‡ณ Hindi"] Bengali["๐Ÿ‡ง๐Ÿ‡ฉ Bengali"] Marathi["Marathi"] Telugu["Telugu"] Kannada["Kannada"] Gujarati["Gujarati"] Bhojpuri["Bhojpuri"] Others["+ 4 more"] end subgraph Output["๐Ÿ”Š Audio Output"] WAV["WAV File\n22050 Hz"] end Client -->|HTTP GET/POST| Endpoint Endpoint -->|text, lang| LangRouter LangRouter --> Normalizer Normalizer --> Tokenizer Tokenizer --> Models VITS --> StyleProc Coqui --> StyleProc MMS --> StyleProc StyleProc --> WAV WAV -->|Response| Client Models --> Languages ``` ## Data Flow Diagram ```mermaid sequenceDiagram participant C as Client participant A as API Server participant E as TTS Engine participant M as Model participant S as Style Processor C->>A: GET /Get_Inference?text=เคจเคฎเคธเฅเคคเฅ‡&lang=hindi A->>A: Parse parameters A->>E: synthesize(text, voice) E->>E: Normalize text E->>E: Tokenize to IDs E->>M: Load model (if not cached) M->>M: Forward pass (inference) M-->>E: Raw audio tensor E->>S: Apply style (pitch, speed, energy) S-->>E: Processed audio E-->>A: TTSOutput (audio, sample_rate) A->>A: Convert to WAV bytes A-->>C: audio/wav response ``` ## Model Architecture ```mermaid flowchart LR subgraph Input["๐Ÿ“ Input"] Text["Text Input"] end subgraph TextEncoder["๐Ÿ”ค Text Encoder"] Embed["Character Embedding"] TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"] end subgraph FlowModel["๐ŸŒŠ Flow Model"] Prior["Prior Encoder"] Flow["Normalizing Flow"] Duration["Duration Predictor"] end subgraph Decoder["๐Ÿ”Š HiFi-GAN Decoder"] Upsample["Upsampling Layers"] ResBlocks["Residual Blocks"] Output["Audio Waveform"] end Text --> Embed --> TransEnc TransEnc --> Prior TransEnc --> Duration Prior --> Flow Duration --> Flow Flow --> Upsample --> ResBlocks --> Output ``` ## Training Pipeline ```mermaid flowchart TD subgraph Data["๐Ÿ“Š Training Data"] OpenSLR["OpenSLR Datasets"] CommonVoice["Mozilla Common Voice"] IndicTTS["IndicTTS Corpus"] AI4Bharat["AI4Bharat Indic-Voices"] end subgraph Prep["๐Ÿ”ง Data Preparation"] Download["Download Audio"] Normalize["Normalize to 22050 Hz"] Transcript["Generate Transcripts"] Split["Train/Val Split"] end subgraph Train["๐Ÿ‹๏ธ Training"] Config["Load Config YAML"] VITS_Train["VITS Training\n(1000 epochs)"] Checkpoint["Save Checkpoints"] end subgraph Export["๐Ÿ“ฆ Export"] JIT["JIT Trace Model"] Chars["Generate chars.txt"] Package["Package for Inference"] end Data --> Download --> Normalize --> Transcript --> Split Split --> Config --> VITS_Train --> Checkpoint Checkpoint --> JIT --> Chars --> Package ``` ## Deployment Architecture ```mermaid flowchart TB subgraph HF["โ˜๏ธ HuggingFace Infrastructure"] subgraph Space["๐Ÿš€ HF Space (Docker)"] Docker["Docker Container"] FastAPI["FastAPI Server\n:7860"] Models_Dir["models/ directory"] end subgraph ModelRepo["๐Ÿ“ฆ Model Repository"] ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"] end end subgraph External["๐ŸŒ External Services"] MMS_HF["facebook/mms-tts-guj\n(Gujarati)"] end User["๐Ÿ‘ค User"] -->|HTTPS| FastAPI Docker -->|Build time| ModelFiles FastAPI -->|Runtime| MMS_HF Models_Dir -.->|Loaded from| ModelFiles ``` ## Voice Configuration Map ```mermaid mindmap root((VoiceAPI)) Hindi hi_male hi_female Bengali bn_male bn_female Marathi mr_male mr_female Telugu te_male te_female Kannada kn_male kn_female Gujarati gu_mms Bhojpuri bho_male bho_female Chhattisgarhi hne_male hne_female Maithili mai_male mai_female Magahi mag_male mag_female English en_male en_female ``` ## Component Interaction | Component | File | Purpose | |-----------|------|---------| | API Server | `src/api.py` | FastAPI REST endpoints | | TTS Engine | `src/engine.py` | Model loading & inference | | Tokenizer | `src/tokenizer.py` | Text โ†’ Token IDs | | Config | `src/config.py` | Language & model configs | | Model Loader | `src/model_loader.py` | Model file management | ## Performance Characteristics | Metric | Value | |--------|-------| | Inference Time | ~200-500ms per sentence | | Model Load Time | ~2-5s per voice | | Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) | | Supported Formats | WAV | | Concurrent Requests | Limited by memory | --- *Built for Voice Tech for All Hackathon*