docs: rename q8 file refs to model_quantized.* (matches transformers.js v4 _quantized suffix for dtype=q8)

3931df6 verified 28 days ago

7.09 kB

	---
	base_model: kenpath/svara-tts-v1
	library_name: transformers.js
	models:
	- kenpath/svara-tts-v1
	- canopylabs/3b-hi-ft-research_release
	- onnx-community/snac_24khz-ONNX
	license: apache-2.0
	language:
	- hi
	- bn
	- mr
	- te
	- kn
	- bho
	- mag
	- hne
	- mai
	- as
	- brx
	- doi
	- gu
	- ml
	- pa
	- ta
	- ne
	- sa
	- en
	tags:
	- text-to-speech
	- speech-synthesis
	- multilingual
	- indic
	- orpheus
	- snac
	- onnx
	- transformers.js
	- webgpu
	task_categories:
	- text-to-speech
	pipeline_tag: text-to-speech
	pretty_name: Svara-TTS v1 (ONNX)
	---

	# Svara-TTS v1 — ONNX

	> Parent model: [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the [Kenpath](https://huggingface.co/kenpath) team. This repo only contains an ONNX-format export for browser / cross-platform inference.
	>
	> Orpheus base: [`canopylabs/3b-hi-ft-research_release`](https://huggingface.co/canopylabs/3b-hi-ft-research_release) — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.
	>
	> Codec: Pair this LM with [`onnx-community/snac_24khz-ONNX`](https://huggingface.co/onnx-community/snac_24khz-ONNX) for SNAC decoding.

	ONNX export of [`kenpath/svara-tts-v1`](https://huggingface.co/kenpath/svara-tts-v1) for [transformers.js v4](https://huggingface.co/blog/transformersjs-v4) WebGPU inference. Two quantization levels are published:

	\| dtype \| Size \| Use case \|
	\|-------\|------\|----------\|
	\| q4f16 \| ~1.95 GB (single file) \| Default for browsers — int4 MatMul + Gather via `com.microsoft.MatMulNBits`, fp16 activations, `block_size=128`, symmetric. \|
	\| q8 \| ~4.32 GB (3 sharded files <2 GB each) \| Higher fidelity — int8 MatMul + Gather, fp16 activations, same block layout. Use when bandwidth allows; closer in quality to bf16. \|

	Down from ~13.2 GB bf16 source.

	## Usage with transformers.js v4

	```js
	import { AutoTokenizer, AutoModelForCausalLM, Tensor } from "@huggingface/transformers";

	const model_id = "shreyask/svara-tts-v1-ONNX";
	const tokenizer = await AutoTokenizer.from_pretrained(model_id);
	const model = await AutoModelForCausalLM.from_pretrained(model_id, {
	dtype: "q4f16", // or "q8" for higher fidelity
	device: "webgpu",
	use_external_data_format: true,
	});

	// Prompt format (matches mlx-audio's Llama TTS loader):
	// [SOH=128259, BOS=128000, "<voice>: <text>" tokens, EOT=128009, EOH=128260]
	// The model predicts SOAI, SOS itself, then the audio token stream, then EOS.
	const text = "नमस्ते, आप कैसे हैं?";
	const voice = "Hindi (Female)";
	const body = tokenizer.encode(`${voice}: ${text}`, { add_special_tokens: false });
	const ids = [128259, tokenizer.bos_token_id, ...body, 128009, 128260];

	const out = await model.generate({
	inputs: new Tensor("int64", BigInt64Array.from(ids.map(BigInt)), [1, ids.length]),
	max_new_tokens: 2048,
	do_sample: true,
	temperature: 0.6, // 0.75 with q8 / fp16; lower with q4f16 to compensate for quant noise
	top_p: 0.9,
	top_k: 40,
	repetition_penalty: 1.0, // MUST be 1.0 -- transformers.js applies penalty across the whole stream
	// and progressively suppresses naturally-recurring audio codes -> quieter
	// output over time.
	eos_token_id: 128258,
	});
	// out.data is the LM token stream. Audio tokens fall in [128266, 156938).
	// SNAC code = token_id - 128266 - (position % 7) * 4096.
	// Group every 7 codes into a frame, then split into the three hierarchical
	// levels expected by SNAC (band 0 -> level 0; bands 1,4 -> level 1;
	// bands 2,3,5,6 -> level 2) and decode with the SNAC ONNX decoder.
	```

	A complete worked example (React + Vite + WebGPU) lives at [`shreyaskarnik/svara-tts-webgpu`](https://github.com/shreyaskarnik/svara-tts-webgpu).

	## Repository Layout

	```
	.
	├── README.md
	├── config.json (transformers config, model_type=llama)
	├── generation_config.json
	├── special_tokens_map.json
	├── tokenizer.json (Llama-3 BPE + Svara special tokens)
	├── tokenizer_config.json
	├── chat_template.jinja
	└── onnx/
	├── model_q4f16.onnx (graph, ~1.3 MB)
	├── model_q4f16.onnx_data (weights, ~1.95 GB single file)
	├── model_quantized.onnx (graph, ~1.3 MB)
	├── model_quantized.onnx_data (weights chunk 1, ~1.99 GB)
	├── model_quantized.onnx_data_1 (weights chunk 2, ~1.84 GB)
	└── model_quantized.onnx_data_2 (weights chunk 3, ~0.49 GB)
	```

	q8 is sharded into <2 GB chunks to fit browser ArrayBuffer ceilings (matches the [`onnx-community/gpt-oss-20b-ONNX`](https://huggingface.co/onnx-community/gpt-oss-20b-ONNX) layout convention).

	## Voices

	Voice prefix format: `"<Language Name> (<Gender>)"`. 38 voices across 19 Indian languages — Hindi, Bengali, Marathi, Telugu, Kannada, Tamil, Malayalam, Gujarati, Punjabi, Assamese, Bhojpuri, Magahi, Maithili, Chhattisgarhi, Bodo, Dogri, Nepali, Sanskrit, English (Indian) — male + female each.

	## Architecture

	- Backbone: Llama-3.2-3B fine-tuned from Canopy Labs' Orpheus Hindi base. Native KV-cache via `text-generation-with-past` ONNX export. GQA (8 KV heads), llama3 NTK rope scaling (factor=32), tied word embeddings, vocab_size=156940.
	- Audio token layout: 7 token bands, each 4096 codes, totalling 7×4096=28672 audio token IDs at offset 128266. Each generated token's "band" is `position_since_SOS % 7`, decoded SNAC code = `token_id - 128266 - band * 4096`.
	- Codec: SNAC 24 kHz, 3-level hierarchical RVQ. Use [`onnx-community/snac_24khz-ONNX`](https://huggingface.co/onnx-community/snac_24khz-ONNX) `decoder_model_fp16.onnx` (~26 MB). Decoder input layout: `audio_codes.0`=band 0, `audio_codes.1`=bands 1,4, `audio_codes.2`=bands 2,3,5,6 (chronological per coarse frame). Output: 24 kHz mono PCM.

	## Reproducing the export

	```bash
	# 1. fp16 graph + external weights from the upstream Svara repo
	optimum-cli export onnx \
	--model kenpath/svara-tts-v1 \
	--task text-generation-with-past \
	--dtype fp16 \
	--device cpu \
	./svara-onnx

	# 2. quantize using onnxruntime.quantization.MatMulNBitsQuantizer
	# (block=128, symmetric, accuracy_level=4 for fp16 accumulation;
	# op_types_to_quantize=("MatMul", "Gather") so the embed table is
	# quantized too -- otherwise it dominates the file size).
	```

	## MLX variants (Apple Silicon native)

	If you're running on Apple Silicon natively (not browser), the MLX builds run ~2× faster:

	- [`mlx-community/svara-tts-v1`](https://huggingface.co/mlx-community/svara-tts-v1) (bf16, ~6.6 GB)
	- [`mlx-community/svara-tts-v1-8bit`](https://huggingface.co/mlx-community/svara-tts-v1-8bit) (~3.5 GB)
	- [`mlx-community/svara-tts-v1-4bit`](https://huggingface.co/mlx-community/svara-tts-v1-4bit) (~1.9 GB)

	## License

	Apache 2.0 — see the [base model card](https://huggingface.co/kenpath/svara-tts-v1) for full details, training data, and evaluation.