---
base_model: kenpath/svara-tts-v1
models:
- kenpath/svara-tts-v1
- canopylabs/3b-hi-ft-research_release
- mlx-community/svara-tts-v1
- mlx-community/svara-tts-v1-8bit
license: apache-2.0
language:
- hi
- bn
- mr
- te
- kn
- bho
- mag
- hne
- mai
- as
- brx
- doi
- gu
- ml
- pa
- ta
- ne
- sa
- en
tags:
- text-to-speech
- speech-synthesis
- multilingual
- indic
- orpheus
- snac
- mlx
- mlx-audio
task_categories:
- text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (MLX, 4-bit)
datasets:
- SYSPIN
- RASA
- IndicTTS
- SPICOR
library_name: mlx
---

# Svara-TTS v1 — MLX 4-bit

> **Parent model:** [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an MLX-format quantization for inference on Apple Silicon.
>
> **Orpheus base:** [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

4-bit MLX-quantized port of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Quantized at **~4.5 bits per weight** (`q-bits=4`, `q-group-size=64`), down from ~13.2 GB bf16 to **~1.9 GB**.

Built for [mlx-audio](https://github.com/Blaizzy/mlx-audio) on Apple Silicon.

## Usage

Requires `mlx-audio` with TTS extras:

```bash
pip install "mlx-audio[tts]"
```

### Python

```python
import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1-4bit")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)  # 24 kHz
```

### CLI

```bash
mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1-4bit \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9
```

## Voices

Use a string of the form `"<Language Name> (<Gender>)"`:

| Language     | Voices                              |
|--------------|-------------------------------------|
| Hindi        | `Hindi (Male)`, `Hindi (Female)`    |
| Bengali      | `Bengali (Male)`, `Bengali (Female)`|
| Marathi      | `Marathi (Male)`, `Marathi (Female)`|
| Telugu       | `Telugu (Male)`, `Telugu (Female)`  |
| Kannada      | `Kannada (Male)`, `Kannada (Female)`|
| Tamil        | `Tamil (Male)`, `Tamil (Female)`    |
| Malayalam    | `Malayalam (Male)`, `Malayalam (Female)` |
| Gujarati     | `Gujarati (Male)`, `Gujarati (Female)` |
| Punjabi      | `Punjabi (Male)`, `Punjabi (Female)` |
| Assamese     | `Assamese (Male)`, `Assamese (Female)` |
| Bhojpuri     | `Bhojpuri (Male)`, `Bhojpuri (Female)` |
| Magahi       | `Magahi (Male)`, `Magahi (Female)`  |
| Maithili     | `Maithili (Male)`, `Maithili (Female)` |
| Chhattisgarhi| `Chhattisgarhi (Male)`, `Chhattisgarhi (Female)` |
| Bodo         | `Bodo (Male)`, `Bodo (Female)`      |
| Dogri        | `Dogri (Male)`, `Dogri (Female)`    |
| Nepali       | `Nepali (Male)`, `Nepali (Female)`  |
| Sanskrit     | `Sanskrit (Male)`, `Sanskrit (Female)` |
| English (Indian) | `English (Indian) (Male)`, `English (Indian) (Female)` |

Total: **38 voices** across 19 languages.

## Sampling Recommendations

The upstream `svara-tts-inference` repo uses these defaults; they're a good starting point:

| Parameter | Value |
|-----------|-------|
| `temperature` | 0.75 |
| `top_p` | 0.9 |
| `top_k` | 40 |
| `repetition_penalty` | 1.1 |
| `max_tokens` | 1200–2048 |

## Architecture

- **Backbone:** Llama-3.2-3B (fine-tuned from [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release), Canopy's Orpheus Hindi base).
- **Codec:** [SNAC 24 kHz](https://huggingface.co/hubertsiuzdak/snac_24khz) — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by `mlx-audio`.
- **Output:** 24 kHz mono PCM.

## Other Quants

- bf16 MLX: [`mlx-community/svara-tts-v1`](https://huggingface.co/mlx-community/svara-tts-v1) (~6.6 GB)
- 8-bit MLX: [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB)
- bf16 source: [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) (~13.2 GB)

## License

Apache 2.0 — see [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details.