svara-tts-v1 / README.md
shreyask's picture
docs: update self-references to mlx-community after transfer
58f5c14 verified
metadata
base_model: kenpath/svara-tts-v1
models:
  - kenpath/svara-tts-v1
  - canopylabs/3b-hi-ft-research_release
  - mlx-community/svara-tts-v1-4bit
  - mlx-community/svara-tts-v1-8bit
license: apache-2.0
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - bho
  - mag
  - hne
  - mai
  - as
  - brx
  - doi
  - gu
  - ml
  - pa
  - ta
  - ne
  - sa
  - en
tags:
  - text-to-speech
  - speech-synthesis
  - multilingual
  - indic
  - orpheus
  - snac
  - mlx
  - mlx-audio
task_categories:
  - text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (MLX, bfloat16)
datasets:
  - SYSPIN
  - RASA
  - IndicTTS
  - SPICOR
library_name: mlx

Svara-TTS v1 — MLX bfloat16

Parent model: kenpath/svara-tts-v1 — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the Kenpath team. This repo only contains an MLX-format conversion for inference on Apple Silicon.

Orpheus base: canopylabs/3b-hi-ft-research_release — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

Full-precision (bfloat16) MLX port of kenpath/svara-tts-v1 — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Same numerical precision as upstream, repackaged in MLX-native format (~6.6 GB sharded safetensors).

For smaller memory footprints, use the 4-bit or 8-bit quantized variants linked below.

Built for mlx-audio on Apple Silicon.

Usage

Requires mlx-audio with TTS extras:

pip install "mlx-audio[tts]"

Python

import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)  # 24 kHz

CLI

mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1 \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9

Voices

Use a string of the form "<Language Name> (<Gender>)":

Language Voices
Hindi Hindi (Male), Hindi (Female)
Bengali Bengali (Male), Bengali (Female)
Marathi Marathi (Male), Marathi (Female)
Telugu Telugu (Male), Telugu (Female)
Kannada Kannada (Male), Kannada (Female)
Tamil Tamil (Male), Tamil (Female)
Malayalam Malayalam (Male), Malayalam (Female)
Gujarati Gujarati (Male), Gujarati (Female)
Punjabi Punjabi (Male), Punjabi (Female)
Assamese Assamese (Male), Assamese (Female)
Bhojpuri Bhojpuri (Male), Bhojpuri (Female)
Magahi Magahi (Male), Magahi (Female)
Maithili Maithili (Male), Maithili (Female)
Chhattisgarhi Chhattisgarhi (Male), Chhattisgarhi (Female)
Bodo Bodo (Male), Bodo (Female)
Dogri Dogri (Male), Dogri (Female)
Nepali Nepali (Male), Nepali (Female)
Sanskrit Sanskrit (Male), Sanskrit (Female)
English (Indian) English (Indian) (Male), English (Indian) (Female)

Total: 38 voices across 19 languages.

Sampling Recommendations

The upstream svara-tts-inference repo uses these defaults; they're a good starting point:

Parameter Value
temperature 0.75
top_p 0.9
top_k 40
repetition_penalty 1.1
max_tokens 1200–2048

Architecture

  • Backbone: Llama-3.2-3B (fine-tuned from canopylabs/3b-hi-ft-research_release, Canopy's Orpheus Hindi base).
  • Codec: SNAC 24 kHz — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by mlx-audio.
  • Output: 24 kHz mono PCM.

Other Quants

License

Apache 2.0 — see base model card for full details.