Text-to-Speech
Transformers.js
ONNX
llama
text-generation
speech-synthesis
multilingual
indic
orpheus
snac
webgpu
Instructions to use shreyask/svara-tts-v1-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use shreyask/svara-tts-v1-ONNX with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-to-speech', 'shreyask/svara-tts-v1-ONNX');
| base_model: kenpath/svara-tts-v1 | |
| library_name: transformers.js | |
| models: | |
| - kenpath/svara-tts-v1 | |
| - canopylabs/3b-hi-ft-research_release | |
| - onnx-community/snac_24khz-ONNX | |
| license: apache-2.0 | |
| language: | |
| - hi | |
| - bn | |
| - mr | |
| - te | |
| - kn | |
| - bho | |
| - mag | |
| - hne | |
| - mai | |
| - as | |
| - brx | |
| - doi | |
| - gu | |
| - ml | |
| - pa | |
| - ta | |
| - ne | |
| - sa | |
| - en | |
| tags: | |
| - text-to-speech | |
| - speech-synthesis | |
| - multilingual | |
| - indic | |
| - orpheus | |
| - snac | |
| - onnx | |
| - transformers.js | |
| - webgpu | |
| task_categories: | |
| - text-to-speech | |
| pipeline_tag: text-to-speech | |
| pretty_name: Svara-TTS v1 (ONNX) | |
| # Svara-TTS v1 โ ONNX | |
| > **Parent model:** [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) โ full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an ONNX-format export for browser / cross-platform inference. | |
| > | |
| > **Orpheus base:** [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) โ Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from. | |
| > | |
| > **Codec:** Pair this LM with [`onnx-community/snac_24khz-ONNX`](https://huggingface.co/onnx-community/snac_24khz-ONNX) for SNAC decoding. | |
| ONNX export of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) for [transformers.js v4](https://huggingface.co/blog/transformersjs-v4) WebGPU inference. Two quantization levels are published: | |
| | dtype | Size | Use case | | |
| |-------|------|----------| | |
| | **q4f16** | ~1.95 GB (single file) | Default for browsers โ int4 MatMul + Gather via `com.microsoft.MatMulNBits`, fp16 activations, `block_size=128`, symmetric. | | |
| | **q8** | ~4.32 GB (3 sharded files <2 GB each) | Higher fidelity โ int8 MatMul + Gather, fp16 activations, same block layout. Use when bandwidth allows; closer in quality to bf16. | | |
| Down from ~13.2 GB bf16 source. | |
| ## Usage with transformers.js v4 | |
| ```js | |
| import { AutoTokenizer, AutoModelForCausalLM, Tensor } from "@huggingface/transformers"; | |
| const model_id = "shreyask/svara-tts-v1-ONNX"; | |
| const tokenizer = await AutoTokenizer.from_pretrained(model_id); | |
| const model = await AutoModelForCausalLM.from_pretrained(model_id, { | |
| dtype: "q4f16", // or "q8" for higher fidelity | |
| device: "webgpu", | |
| use_external_data_format: true, | |
| }); | |
| // Prompt format (matches mlx-audio's Llama TTS loader): | |
| // [SOH=128259, BOS=128000, "<voice>: <text>" tokens, EOT=128009, EOH=128260] | |
| // The model predicts SOAI, SOS itself, then the audio token stream, then EOS. | |
| const text = "เคจเคฎเคธเฅเคคเฅ, เคเคช เคเฅเคธเฅ เคนเฅเค?"; | |
| const voice = "Hindi (Female)"; | |
| const body = tokenizer.encode(`${voice}: ${text}`, { add_special_tokens: false }); | |
| const ids = [128259, tokenizer.bos_token_id, ...body, 128009, 128260]; | |
| const out = await model.generate({ | |
| inputs: new Tensor("int64", BigInt64Array.from(ids.map(BigInt)), [1, ids.length]), | |
| max_new_tokens: 2048, | |
| do_sample: true, | |
| temperature: 0.6, // 0.75 with q8 / fp16; lower with q4f16 to compensate for quant noise | |
| top_p: 0.9, | |
| top_k: 40, | |
| repetition_penalty: 1.0, // MUST be 1.0 -- transformers.js applies penalty across the whole stream | |
| // and progressively suppresses naturally-recurring audio codes -> quieter | |
| // output over time. | |
| eos_token_id: 128258, | |
| }); | |
| // out.data is the LM token stream. Audio tokens fall in [128266, 156938). | |
| // SNAC code = token_id - 128266 - (position % 7) * 4096. | |
| // Group every 7 codes into a frame, then split into the three hierarchical | |
| // levels expected by SNAC (band 0 -> level 0; bands 1,4 -> level 1; | |
| // bands 2,3,5,6 -> level 2) and decode with the SNAC ONNX decoder. | |
| ``` | |
| A complete worked example (React + Vite + WebGPU) lives at [`shreyaskarnik/svara-tts-webgpu`](https://github.com/shreyaskarnik/svara-tts-webgpu). | |
| ## Repository Layout | |
| ``` | |
| . | |
| โโโ README.md | |
| โโโ config.json (transformers config, model_type=llama) | |
| โโโ generation_config.json | |
| โโโ special_tokens_map.json | |
| โโโ tokenizer.json (Llama-3 BPE + Svara special tokens) | |
| โโโ tokenizer_config.json | |
| โโโ chat_template.jinja | |
| โโโ onnx/ | |
| โโโ model_q4f16.onnx (graph, ~1.3 MB) | |
| โโโ model_q4f16.onnx_data (weights, ~1.95 GB single file) | |
| โโโ model_quantized.onnx (graph, ~1.3 MB) | |
| โโโ model_quantized.onnx_data (weights chunk 1, ~1.99 GB) | |
| โโโ model_quantized.onnx_data_1 (weights chunk 2, ~1.84 GB) | |
| โโโ model_quantized.onnx_data_2 (weights chunk 3, ~0.49 GB) | |
| ``` | |
| q8 is sharded into <2 GB chunks to fit browser ArrayBuffer ceilings (matches the [`onnx-community/gpt-oss-20b-ONNX`](https://huggingface.co/onnx-community/gpt-oss-20b-ONNX) layout convention). | |
| ## Voices | |
| Voice prefix format: `"<Language Name> (<Gender>)"`. **38 voices across 19 Indian languages** โ Hindi, Bengali, Marathi, Telugu, Kannada, Tamil, Malayalam, Gujarati, Punjabi, Assamese, Bhojpuri, Magahi, Maithili, Chhattisgarhi, Bodo, Dogri, Nepali, Sanskrit, English (Indian) โ male + female each. | |
| ## Architecture | |
| - **Backbone:** Llama-3.2-3B fine-tuned from Canopy Labs' Orpheus Hindi base. Native KV-cache via `text-generation-with-past` ONNX export. GQA (8 KV heads), llama3 NTK rope scaling (factor=32), tied word embeddings, vocab_size=156940. | |
| - **Audio token layout:** 7 token bands, each 4096 codes, totalling 7ร4096=28672 audio token IDs at offset 128266. Each generated token's "band" is `position_since_SOS % 7`, decoded SNAC code = `token_id - 128266 - band * 4096`. | |
| - **Codec:** SNAC 24 kHz, 3-level hierarchical RVQ. Use [`onnx-community/snac_24khz-ONNX`](https://huggingface.co/onnx-community/snac_24khz-ONNX) `decoder_model_fp16.onnx` (~26 MB). Decoder input layout: `audio_codes.0`=band 0, `audio_codes.1`=bands 1,4, `audio_codes.2`=bands 2,3,5,6 (chronological per coarse frame). Output: 24 kHz mono PCM. | |
| ## Reproducing the export | |
| ```bash | |
| # 1. fp16 graph + external weights from the upstream Svara repo | |
| optimum-cli export onnx \ | |
| --model kenpath/svara-tts-v1 \ | |
| --task text-generation-with-past \ | |
| --dtype fp16 \ | |
| --device cpu \ | |
| ./svara-onnx | |
| # 2. quantize using onnxruntime.quantization.MatMulNBitsQuantizer | |
| # (block=128, symmetric, accuracy_level=4 for fp16 accumulation; | |
| # op_types_to_quantize=("MatMul", "Gather") so the embed table is | |
| # quantized too -- otherwise it dominates the file size). | |
| ``` | |
| ## MLX variants (Apple Silicon native) | |
| If you're running on Apple Silicon natively (not browser), the MLX builds run ~2ร faster: | |
| - [`mlx-community/svara-tts-v1`](https://huggingface.co/mlx-community/svara-tts-v1) (bf16, ~6.6 GB) | |
| - [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB) | |
| - [`mlx-community/svara-tts-v1-4bit`](https://huggingface.co/mlx-community/svara-tts-v1-4bit) (~1.9 GB) | |
| ## License | |
| Apache 2.0 โ see the [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details, training data, and evaluation. | |