docs: add bf16 sibling reference

d84be02 verified 28 days ago

4.93 kB

base_model: kenpath/svara-tts-v1
models:
  - kenpath/svara-tts-v1
  - canopylabs/3b-hi-ft-research_release
  - mlx-community/svara-tts-v1
  - mlx-community/svara-tts-v1-8bit
license: apache-2.0
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - bho
  - mag
  - hne
  - mai
  - as
  - brx
  - doi
  - gu
  - ml
  - pa
  - ta
  - ne
  - sa
  - en
tags:
  - text-to-speech
  - speech-synthesis
  - multilingual
  - indic
  - orpheus
  - snac
  - mlx
  - mlx-audio
task_categories:
  - text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (MLX, 4-bit)
datasets:
  - SYSPIN
  - RASA
  - IndicTTS
  - SPICOR
library_name: mlx

Svara-TTS v1 — MLX 4-bit

Parent model: kenpath/svara-tts-v1 — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the Kenpath team. This repo only contains an MLX-format quantization for inference on Apple Silicon.

Orpheus base: canopylabs/3b-hi-ft-research_release — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

4-bit MLX-quantized port of kenpath/svara-tts-v1 — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Quantized at ~4.5 bits per weight (q-bits=4, q-group-size=64), down from 13.2 GB bf16 to **1.9 GB**.

Built for mlx-audio on Apple Silicon.

Usage

Requires mlx-audio with TTS extras:

pip install "mlx-audio[tts]"

Python

import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1-4bit")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)  # 24 kHz

CLI

mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1-4bit \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9

Voices

Use a string of the form "<Language Name> (<Gender>)":

Language	Voices
Hindi	`Hindi (Male)`, `Hindi (Female)`
Bengali	`Bengali (Male)`, `Bengali (Female)`
Marathi	`Marathi (Male)`, `Marathi (Female)`
Telugu	`Telugu (Male)`, `Telugu (Female)`
Kannada	`Kannada (Male)`, `Kannada (Female)`
Tamil	`Tamil (Male)`, `Tamil (Female)`
Malayalam	`Malayalam (Male)`, `Malayalam (Female)`
Gujarati	`Gujarati (Male)`, `Gujarati (Female)`
Punjabi	`Punjabi (Male)`, `Punjabi (Female)`
Assamese	`Assamese (Male)`, `Assamese (Female)`
Bhojpuri	`Bhojpuri (Male)`, `Bhojpuri (Female)`
Magahi	`Magahi (Male)`, `Magahi (Female)`
Maithili	`Maithili (Male)`, `Maithili (Female)`
Chhattisgarhi	`Chhattisgarhi (Male)`, `Chhattisgarhi (Female)`
Bodo	`Bodo (Male)`, `Bodo (Female)`
Dogri	`Dogri (Male)`, `Dogri (Female)`
Nepali	`Nepali (Male)`, `Nepali (Female)`
Sanskrit	`Sanskrit (Male)`, `Sanskrit (Female)`
English (Indian)	`English (Indian) (Male)`, `English (Indian) (Female)`

Total: 38 voices across 19 languages.

Sampling Recommendations

The upstream svara-tts-inference repo uses these defaults; they're a good starting point:

Parameter	Value
`temperature`	0.75
`top_p`	0.9
`top_k`	40
`repetition_penalty`	1.1
`max_tokens`	1200–2048

Architecture

Backbone: Llama-3.2-3B (fine-tuned from canopylabs/3b-hi-ft-research_release, Canopy's Orpheus Hindi base).
Codec: SNAC 24 kHz — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by mlx-audio.
Output: 24 kHz mono PCM.

Other Quants

bf16 MLX: mlx-community/svara-tts-v1 (~6.6 GB)
8-bit MLX: mlx-community/svara-tts-v1-8bit (~3.5 GB)
bf16 source: kenpath/svara-tts-v1 (~13.2 GB)

License

Apache 2.0 — see base model card for full details.