svara-tts-v1 / README.md
shreyask's picture
docs: update self-references to mlx-community after transfer
58f5c14 verified
---
base_model: kenpath/svara-tts-v1
models:
- kenpath/svara-tts-v1
- canopylabs/3b-hi-ft-research_release
- mlx-community/svara-tts-v1-4bit
- mlx-community/svara-tts-v1-8bit
license: apache-2.0
language:
- hi
- bn
- mr
- te
- kn
- bho
- mag
- hne
- mai
- as
- brx
- doi
- gu
- ml
- pa
- ta
- ne
- sa
- en
tags:
- text-to-speech
- speech-synthesis
- multilingual
- indic
- orpheus
- snac
- mlx
- mlx-audio
task_categories:
- text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (MLX, bfloat16)
datasets:
- SYSPIN
- RASA
- IndicTTS
- SPICOR
library_name: mlx
---
# Svara-TTS v1 — MLX bfloat16
> **Parent model:** [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an MLX-format conversion for inference on Apple Silicon.
>
> **Orpheus base:** [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.
Full-precision (bfloat16) MLX port of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Same numerical precision as upstream, repackaged in MLX-native format (~6.6 GB sharded safetensors).
For smaller memory footprints, use the 4-bit or 8-bit quantized variants linked below.
Built for [mlx-audio](https://github.com/Blaizzy/mlx-audio) on Apple Silicon.
## Usage
Requires `mlx-audio` with TTS extras:
```bash
pip install "mlx-audio[tts]"
```
### Python
```python
import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/svara-tts-v1")
chunks = []
for result in model.generate(
text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
voice="Hindi (Female)",
temperature=0.75,
top_p=0.9,
top_k=40,
repetition_penalty=1.1,
max_tokens=1200,
):
chunks.append(result.audio)
audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate) # 24 kHz
```
### CLI
```bash
mlx_audio.tts.generate \
--model mlx-community/svara-tts-v1 \
--text "नमस्ते, आप कैसे हैं?" \
--voice "Hindi (Female)" \
--temperature 0.75 \
--top_p 0.9
```
## Voices
Use a string of the form `"<Language Name> (<Gender>)"`:
| Language | Voices |
|--------------|-------------------------------------|
| Hindi | `Hindi (Male)`, `Hindi (Female)` |
| Bengali | `Bengali (Male)`, `Bengali (Female)`|
| Marathi | `Marathi (Male)`, `Marathi (Female)`|
| Telugu | `Telugu (Male)`, `Telugu (Female)` |
| Kannada | `Kannada (Male)`, `Kannada (Female)`|
| Tamil | `Tamil (Male)`, `Tamil (Female)` |
| Malayalam | `Malayalam (Male)`, `Malayalam (Female)` |
| Gujarati | `Gujarati (Male)`, `Gujarati (Female)` |
| Punjabi | `Punjabi (Male)`, `Punjabi (Female)` |
| Assamese | `Assamese (Male)`, `Assamese (Female)` |
| Bhojpuri | `Bhojpuri (Male)`, `Bhojpuri (Female)` |
| Magahi | `Magahi (Male)`, `Magahi (Female)` |
| Maithili | `Maithili (Male)`, `Maithili (Female)` |
| Chhattisgarhi| `Chhattisgarhi (Male)`, `Chhattisgarhi (Female)` |
| Bodo | `Bodo (Male)`, `Bodo (Female)` |
| Dogri | `Dogri (Male)`, `Dogri (Female)` |
| Nepali | `Nepali (Male)`, `Nepali (Female)` |
| Sanskrit | `Sanskrit (Male)`, `Sanskrit (Female)` |
| English (Indian) | `English (Indian) (Male)`, `English (Indian) (Female)` |
Total: **38 voices** across 19 languages.
## Sampling Recommendations
The upstream `svara-tts-inference` repo uses these defaults; they're a good starting point:
| Parameter | Value |
|-----------|-------|
| `temperature` | 0.75 |
| `top_p` | 0.9 |
| `top_k` | 40 |
| `repetition_penalty` | 1.1 |
| `max_tokens` | 1200–2048 |
## Architecture
- **Backbone:** Llama-3.2-3B (fine-tuned from [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release), Canopy's Orpheus Hindi base).
- **Codec:** [SNAC 24 kHz](https://huggingface.co/hubertsiuzdak/snac_24khz) — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by `mlx-audio`.
- **Output:** 24 kHz mono PCM.
## Other Quants
- 8-bit MLX: [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB)
- 4-bit MLX: [`mlx-community/svara-tts-v1-4bit`](https://huggingface.co/mlx-community/svara-tts-v1-4bit) (~1.9 GB)
## License
Apache 2.0 — see [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details.