--- base_model: kenpath/svara-tts-v1 models: - kenpath/svara-tts-v1 - canopylabs/3b-hi-ft-research_release - mlx-community/svara-tts-v1 - mlx-community/svara-tts-v1-8bit license: apache-2.0 language: - hi - bn - mr - te - kn - bho - mag - hne - mai - as - brx - doi - gu - ml - pa - ta - ne - sa - en tags: - text-to-speech - speech-synthesis - multilingual - indic - orpheus - snac - mlx - mlx-audio task_categories: - text-to-speech pipeline_tag: text-to-speech pretty_name: Svara-TTS v1 (MLX, 4-bit) datasets: - SYSPIN - RASA - IndicTTS - SPICOR library_name: mlx --- # Svara-TTS v1 — MLX 4-bit > **Parent model:** [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an MLX-format quantization for inference on Apple Silicon. > > **Orpheus base:** [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from. 4-bit MLX-quantized port of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Quantized at **~4.5 bits per weight** (`q-bits=4`, `q-group-size=64`), down from ~13.2 GB bf16 to **~1.9 GB**. Built for [mlx-audio](https://github.com/Blaizzy/mlx-audio) on Apple Silicon. ## Usage Requires `mlx-audio` with TTS extras: ```bash pip install "mlx-audio[tts]" ``` ### Python ```python import numpy as np import soundfile as sf import mlx.core as mx from mlx_audio.tts.utils import load_model model = load_model("mlx-community/svara-tts-v1-4bit") chunks = [] for result in model.generate( text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।", voice="Hindi (Female)", temperature=0.75, top_p=0.9, top_k=40, repetition_penalty=1.1, max_tokens=1200, ): chunks.append(result.audio) audio = mx.concatenate(chunks, axis=0) sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate) # 24 kHz ``` ### CLI ```bash mlx_audio.tts.generate \ --model mlx-community/svara-tts-v1-4bit \ --text "नमस्ते, आप कैसे हैं?" \ --voice "Hindi (Female)" \ --temperature 0.75 \ --top_p 0.9 ``` ## Voices Use a string of the form `" ()"`: | Language | Voices | |--------------|-------------------------------------| | Hindi | `Hindi (Male)`, `Hindi (Female)` | | Bengali | `Bengali (Male)`, `Bengali (Female)`| | Marathi | `Marathi (Male)`, `Marathi (Female)`| | Telugu | `Telugu (Male)`, `Telugu (Female)` | | Kannada | `Kannada (Male)`, `Kannada (Female)`| | Tamil | `Tamil (Male)`, `Tamil (Female)` | | Malayalam | `Malayalam (Male)`, `Malayalam (Female)` | | Gujarati | `Gujarati (Male)`, `Gujarati (Female)` | | Punjabi | `Punjabi (Male)`, `Punjabi (Female)` | | Assamese | `Assamese (Male)`, `Assamese (Female)` | | Bhojpuri | `Bhojpuri (Male)`, `Bhojpuri (Female)` | | Magahi | `Magahi (Male)`, `Magahi (Female)` | | Maithili | `Maithili (Male)`, `Maithili (Female)` | | Chhattisgarhi| `Chhattisgarhi (Male)`, `Chhattisgarhi (Female)` | | Bodo | `Bodo (Male)`, `Bodo (Female)` | | Dogri | `Dogri (Male)`, `Dogri (Female)` | | Nepali | `Nepali (Male)`, `Nepali (Female)` | | Sanskrit | `Sanskrit (Male)`, `Sanskrit (Female)` | | English (Indian) | `English (Indian) (Male)`, `English (Indian) (Female)` | Total: **38 voices** across 19 languages. ## Sampling Recommendations The upstream `svara-tts-inference` repo uses these defaults; they're a good starting point: | Parameter | Value | |-----------|-------| | `temperature` | 0.75 | | `top_p` | 0.9 | | `top_k` | 40 | | `repetition_penalty` | 1.1 | | `max_tokens` | 1200–2048 | ## Architecture - **Backbone:** Llama-3.2-3B (fine-tuned from [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release), Canopy's Orpheus Hindi base). - **Codec:** [SNAC 24 kHz](https://huggingface.co/hubertsiuzdak/snac_24khz) — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by `mlx-audio`. - **Output:** 24 kHz mono PCM. ## Other Quants - bf16 MLX: [`mlx-community/svara-tts-v1`](https://huggingface.co/mlx-community/svara-tts-v1) (~6.6 GB) - 8-bit MLX: [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB) - bf16 source: [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) (~13.2 GB) ## License Apache 2.0 — see [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details.