svara-tts-v1-4bit / README.md
shreyask's picture
docs: add bf16 sibling reference
d84be02 verified
metadata
base_model: kenpath/svara-tts-v1
models:
  - kenpath/svara-tts-v1
  - canopylabs/3b-hi-ft-research_release
  - mlx-community/svara-tts-v1
  - mlx-community/svara-tts-v1-8bit
license: apache-2.0
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - bho
  - mag
  - hne
  - mai
  - as
  - brx
  - doi
  - gu
  - ml
  - pa
  - ta
  - ne
  - sa
  - en
tags:
  - text-to-speech
  - speech-synthesis
  - multilingual
  - indic
  - orpheus
  - snac
  - mlx
  - mlx-audio
task_categories:
  - text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (MLX, 4-bit)
datasets:
  - SYSPIN
  - RASA
  - IndicTTS
  - SPICOR
library_name: mlx

Svara-TTS v1 — MLX 4-bit

Parent model: kenpath/svara-tts-v1 — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the Kenpath team. This repo only contains an MLX-format quantization for inference on Apple Silicon.

Orpheus base: canopylabs/3b-hi-ft-research_release — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

4-bit MLX-quantized port of kenpath/svara-tts-v1 — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Quantized at ~4.5 bits per weight (q-bits=4, q-group-size=64), down from 13.2 GB bf16 to **1.9 GB**.

Built for mlx-audio on Apple Silicon.

Usage

Requires mlx-audio with TTS extras:

pip install "mlx-audio[tts]"

Python

import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1-4bit")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)  # 24 kHz

CLI

mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1-4bit \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9

Voices

Use a string of the form "<Language Name> (<Gender>)":

Language Voices
Hindi Hindi (Male), Hindi (Female)
Bengali Bengali (Male), Bengali (Female)
Marathi Marathi (Male), Marathi (Female)
Telugu Telugu (Male), Telugu (Female)
Kannada Kannada (Male), Kannada (Female)
Tamil Tamil (Male), Tamil (Female)
Malayalam Malayalam (Male), Malayalam (Female)
Gujarati Gujarati (Male), Gujarati (Female)
Punjabi Punjabi (Male), Punjabi (Female)
Assamese Assamese (Male), Assamese (Female)
Bhojpuri Bhojpuri (Male), Bhojpuri (Female)
Magahi Magahi (Male), Magahi (Female)
Maithili Maithili (Male), Maithili (Female)
Chhattisgarhi Chhattisgarhi (Male), Chhattisgarhi (Female)
Bodo Bodo (Male), Bodo (Female)
Dogri Dogri (Male), Dogri (Female)
Nepali Nepali (Male), Nepali (Female)
Sanskrit Sanskrit (Male), Sanskrit (Female)
English (Indian) English (Indian) (Male), English (Indian) (Female)

Total: 38 voices across 19 languages.

Sampling Recommendations

The upstream svara-tts-inference repo uses these defaults; they're a good starting point:

Parameter Value
temperature 0.75
top_p 0.9
top_k 40
repetition_penalty 1.1
max_tokens 1200–2048

Architecture

  • Backbone: Llama-3.2-3B (fine-tuned from canopylabs/3b-hi-ft-research_release, Canopy's Orpheus Hindi base).
  • Codec: SNAC 24 kHz — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by mlx-audio.
  • Output: 24 kHz mono PCM.

Other Quants

License

Apache 2.0 — see base model card for full details.