docs: update self-references to mlx-community after transfer

58f5c14 verified 29 days ago

4.93 kB

	---
	base_model: kenpath/svara-tts-v1
	models:
	- kenpath/svara-tts-v1
	- canopylabs/3b-hi-ft-research_release
	- mlx-community/svara-tts-v1-4bit
	- mlx-community/svara-tts-v1-8bit
	license: apache-2.0
	language:
	- hi
	- bn
	- mr
	- te
	- kn
	- bho
	- mag
	- hne
	- mai
	- as
	- brx
	- doi
	- gu
	- ml
	- pa
	- ta
	- ne
	- sa
	- en
	tags:
	- text-to-speech
	- speech-synthesis
	- multilingual
	- indic
	- orpheus
	- snac
	- mlx
	- mlx-audio
	task_categories:
	- text-to-speech
	pipeline_tag: text-to-speech
	pretty_name: Svara-TTS v1 (MLX, bfloat16)
	datasets:
	- SYSPIN
	- RASA
	- IndicTTS
	- SPICOR
	library_name: mlx
	---

	# Svara-TTS v1 — MLX bfloat16

	> Parent model: [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an MLX-format conversion for inference on Apple Silicon.
	>
	> Orpheus base: [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

	Full-precision (bfloat16) MLX port of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — an autoregressive multilingual text-to-speech model for 19 Indian languages, in the Orpheus / SNAC family. Same numerical precision as upstream, repackaged in MLX-native format (~6.6 GB sharded safetensors).

	For smaller memory footprints, use the 4-bit or 8-bit quantized variants linked below.

	Built for [mlx-audio](https://github.com/Blaizzy/mlx-audio) on Apple Silicon.

	## Usage

	Requires `mlx-audio` with TTS extras:

	```bash
	pip install "mlx-audio[tts]"
	```

	### Python

	```python
	import numpy as np
	import soundfile as sf
	import mlx.core as mx
	from mlx_audio.tts.utils import load_model

	model = load_model("mlx-community/svara-tts-v1")

	chunks = []
	for result in model.generate(
	text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
	voice="Hindi (Female)",
	temperature=0.75,
	top_p=0.9,
	top_k=40,
	repetition_penalty=1.1,
	max_tokens=1200,
	):
	chunks.append(result.audio)

	audio = mx.concatenate(chunks, axis=0)
	sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate) # 24 kHz
	```

	### CLI

	```bash
	mlx_audio.tts.generate \
	--model mlx-community/svara-tts-v1 \
	--text "नमस्ते, आप कैसे हैं?" \
	--voice "Hindi (Female)" \
	--temperature 0.75 \
	--top_p 0.9
	```

	## Voices

	Use a string of the form `"<Language Name> (<Gender>)"`:

	\| Language \| Voices \|
	\|--------------\|-------------------------------------\|
	\| Hindi \| `Hindi (Male)`, `Hindi (Female)` \|
	\| Bengali \| `Bengali (Male)`, `Bengali (Female)`\|
	\| Marathi \| `Marathi (Male)`, `Marathi (Female)`\|
	\| Telugu \| `Telugu (Male)`, `Telugu (Female)` \|
	\| Kannada \| `Kannada (Male)`, `Kannada (Female)`\|
	\| Tamil \| `Tamil (Male)`, `Tamil (Female)` \|
	\| Malayalam \| `Malayalam (Male)`, `Malayalam (Female)` \|
	\| Gujarati \| `Gujarati (Male)`, `Gujarati (Female)` \|
	\| Punjabi \| `Punjabi (Male)`, `Punjabi (Female)` \|
	\| Assamese \| `Assamese (Male)`, `Assamese (Female)` \|
	\| Bhojpuri \| `Bhojpuri (Male)`, `Bhojpuri (Female)` \|
	\| Magahi \| `Magahi (Male)`, `Magahi (Female)` \|
	\| Maithili \| `Maithili (Male)`, `Maithili (Female)` \|
	\| Chhattisgarhi\| `Chhattisgarhi (Male)`, `Chhattisgarhi (Female)` \|
	\| Bodo \| `Bodo (Male)`, `Bodo (Female)` \|
	\| Dogri \| `Dogri (Male)`, `Dogri (Female)` \|
	\| Nepali \| `Nepali (Male)`, `Nepali (Female)` \|
	\| Sanskrit \| `Sanskrit (Male)`, `Sanskrit (Female)` \|
	\| English (Indian) \| `English (Indian) (Male)`, `English (Indian) (Female)` \|

	Total: 38 voices across 19 languages.

	## Sampling Recommendations

	The upstream `svara-tts-inference` repo uses these defaults; they're a good starting point:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| `temperature` \| 0.75 \|
	\| `top_p` \| 0.9 \|
	\| `top_k` \| 40 \|
	\| `repetition_penalty` \| 1.1 \|
	\| `max_tokens` \| 1200–2048 \|

	## Architecture

	- Backbone: Llama-3.2-3B (fine-tuned from [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release), Canopy's Orpheus Hindi base).
	- Codec: [SNAC 24 kHz](https://huggingface.co/hubertsiuzdak/snac_24khz) — 3-level hierarchical RVQ, 7 codes per ~10 ms frame. Loaded automatically by `mlx-audio`.
	- Output: 24 kHz mono PCM.

	## Other Quants

	- 8-bit MLX: [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB)
	- 4-bit MLX: [`mlx-community/svara-tts-v1-4bit`](https://huggingface.co/mlx-community/svara-tts-v1-4bit) (~1.9 GB)

	## License

	Apache 2.0 — see [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details.