docs: rename q8 file refs to model_quantized.* (matches transformers.js v4 _quantized suffix for dtype=q8)

3931df6 verified 28 days ago

7.09 kB

base_model: kenpath/svara-tts-v1
library_name: transformers.js
models:
  - kenpath/svara-tts-v1
  - canopylabs/3b-hi-ft-research_release
  - onnx-community/snac_24khz-ONNX
license: apache-2.0
language:
  - hi
  - bn
  - mr
  - te
  - kn
  - bho
  - mag
  - hne
  - mai
  - as
  - brx
  - doi
  - gu
  - ml
  - pa
  - ta
  - ne
  - sa
  - en
tags:
  - text-to-speech
  - speech-synthesis
  - multilingual
  - indic
  - orpheus
  - snac
  - onnx
  - transformers.js
  - webgpu
task_categories:
  - text-to-speech
pipeline_tag: text-to-speech
pretty_name: Svara-TTS v1 (ONNX)

Svara-TTS v1 — ONNX

Parent model: kenpath/svara-tts-v1 — full upstream weights, model card, training data, and evaluation. All credit for the model itself goes to the Kenpath team. This repo only contains an ONNX-format export for browser / cross-platform inference.

Orpheus base: canopylabs/3b-hi-ft-research_release — Canopy Labs' Orpheus Hindi research release, which Svara was fine-tuned from.

Codec: Pair this LM with onnx-community/snac_24khz-ONNX for SNAC decoding.

ONNX export of kenpath/svara-tts-v1 for transformers.js v4 WebGPU inference. Two quantization levels are published:

dtype	Size	Use case
q4f16	~1.95 GB (single file)	Default for browsers — int4 MatMul + Gather via `com.microsoft.MatMulNBits`, fp16 activations, `block_size=128`, symmetric.
q8	~4.32 GB (3 sharded files <2 GB each)	Higher fidelity — int8 MatMul + Gather, fp16 activations, same block layout. Use when bandwidth allows; closer in quality to bf16.

Down from ~13.2 GB bf16 source.

Usage with transformers.js v4

import { AutoTokenizer, AutoModelForCausalLM, Tensor } from "@huggingface/transformers";

const model_id = "shreyask/svara-tts-v1-ONNX";
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
    dtype: "q4f16",          // or "q8" for higher fidelity
    device: "webgpu",
    use_external_data_format: true,
});

// Prompt format (matches mlx-audio's Llama TTS loader):
//   [SOH=128259, BOS=128000, "<voice>: <text>" tokens, EOT=128009, EOH=128260]
// The model predicts SOAI, SOS itself, then the audio token stream, then EOS.
const text = "नमस्ते, आप कैसे हैं?";
const voice = "Hindi (Female)";
const body = tokenizer.encode(`${voice}: ${text}`, { add_special_tokens: false });
const ids = [128259, tokenizer.bos_token_id, ...body, 128009, 128260];

const out = await model.generate({
    inputs: new Tensor("int64", BigInt64Array.from(ids.map(BigInt)), [1, ids.length]),
    max_new_tokens: 2048,
    do_sample: true,
    temperature: 0.6,        // 0.75 with q8 / fp16; lower with q4f16 to compensate for quant noise
    top_p: 0.9,
    top_k: 40,
    repetition_penalty: 1.0, // MUST be 1.0 -- transformers.js applies penalty across the whole stream
                             // and progressively suppresses naturally-recurring audio codes -> quieter
                             // output over time.
    eos_token_id: 128258,
});
// out.data is the LM token stream. Audio tokens fall in [128266, 156938).
// SNAC code = token_id - 128266 - (position % 7) * 4096.
// Group every 7 codes into a frame, then split into the three hierarchical
// levels expected by SNAC (band 0 -> level 0; bands 1,4 -> level 1;
// bands 2,3,5,6 -> level 2) and decode with the SNAC ONNX decoder.

A complete worked example (React + Vite + WebGPU) lives at shreyaskarnik/svara-tts-webgpu.

Repository Layout

.
├── README.md
├── config.json                  (transformers config, model_type=llama)
├── generation_config.json
├── special_tokens_map.json
├── tokenizer.json               (Llama-3 BPE + Svara special tokens)
├── tokenizer_config.json
├── chat_template.jinja
└── onnx/
    ├── model_q4f16.onnx         (graph, ~1.3 MB)
    ├── model_q4f16.onnx_data    (weights, ~1.95 GB single file)
    ├── model_quantized.onnx            (graph, ~1.3 MB)
    ├── model_quantized.onnx_data       (weights chunk 1, ~1.99 GB)
    ├── model_quantized.onnx_data_1     (weights chunk 2, ~1.84 GB)
    └── model_quantized.onnx_data_2     (weights chunk 3, ~0.49 GB)

q8 is sharded into <2 GB chunks to fit browser ArrayBuffer ceilings (matches the onnx-community/gpt-oss-20b-ONNX layout convention).

Voices

Voice prefix format: "<Language Name> (<Gender>)". 38 voices across 19 Indian languages — Hindi, Bengali, Marathi, Telugu, Kannada, Tamil, Malayalam, Gujarati, Punjabi, Assamese, Bhojpuri, Magahi, Maithili, Chhattisgarhi, Bodo, Dogri, Nepali, Sanskrit, English (Indian) — male + female each.

Architecture

Backbone: Llama-3.2-3B fine-tuned from Canopy Labs' Orpheus Hindi base. Native KV-cache via text-generation-with-past ONNX export. GQA (8 KV heads), llama3 NTK rope scaling (factor=32), tied word embeddings, vocab_size=156940.
Audio token layout: 7 token bands, each 4096 codes, totalling 7×4096=28672 audio token IDs at offset 128266. Each generated token's "band" is position_since_SOS % 7, decoded SNAC code = token_id - 128266 - band * 4096.
Codec: SNAC 24 kHz, 3-level hierarchical RVQ. Use onnx-community/snac_24khz-ONNX decoder_model_fp16.onnx (~26 MB). Decoder input layout: audio_codes.0=band 0, audio_codes.1=bands 1,4, audio_codes.2=bands 2,3,5,6 (chronological per coarse frame). Output: 24 kHz mono PCM.

Reproducing the export

# 1. fp16 graph + external weights from the upstream Svara repo
optimum-cli export onnx \
    --model kenpath/svara-tts-v1 \
    --task text-generation-with-past \
    --dtype fp16 \
    --device cpu \
    ./svara-onnx

# 2. quantize using onnxruntime.quantization.MatMulNBitsQuantizer
#    (block=128, symmetric, accuracy_level=4 for fp16 accumulation;
#    op_types_to_quantize=("MatMul", "Gather") so the embed table is
#    quantized too -- otherwise it dominates the file size).

MLX variants (Apple Silicon native)

If you're running on Apple Silicon natively (not browser), the MLX builds run ~2× faster:

mlx-community/svara-tts-v1 (bf16, ~6.6 GB)
mlx-community/svara-tts-v1-8bit (~3.5 GB)
mlx-community/svara-tts-v1-4bit (~1.9 GB)

License

Apache 2.0 — see the base model card for full details, training data, and evaluation.