Text-to-Speech
LiteRT
ONNX
LiteRT
ai-edge-litert
tensorflow-lite
tts
audio
diffusion
flow-matching
on-device
mobile
android
int4
int8
weight-only-quantization
quantized
Instructions to use Reza2kn/supertonic-3-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use Reza2kn/supertonic-3-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 5,849 Bytes
dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d dbcccfe c47c32d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: openrail
language:
- en
- ja
- zh
- ko
- es
- fr
- de
- multilingual
library_name: ai-edge-litert
tags:
- litert
- tflite
- tensorflow-lite
- text-to-speech
- tts
- audio
- diffusion
- flow-matching
- on-device
- mobile
- android
- int4
- int8
- weight-only-quantization
- quantized
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized
---
# Supertonic-3 — LiteRT (.tflite, INT4) + ONNX vector_estimator
LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
a 99M-parameter multilingual TTS model. 3 of the 4 components convert
cleanly to true INT4 weight-only quantization via Google's
[ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer)
and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX —
its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
inference, and `litert_torch.convert` deadlocks in MLIR lowering when
fed the model with loaded weights. The ONNX VE is shipped in both fp32
(`vector_estimator.onnx`) and **INT8 dynamic quantization**
(`vector_estimator_int8.onnx`, 65 MB) — INT8 is the recommended config.
## Configurations
| Config | Components | Size | Notes |
| --- | --- | ---: | --- |
| **int4 + INT8 VE (recommended)** | `int4/{dp,te}.tflite` + `vector_estimator_int8.onnx` + `int8/vocoder.tflite` | **106 MB** | smallest viable; **65% smaller than fp32 VE config** |
| int4 + fp32 VE | `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` | 310 MB | larger but auditory-identical to INT8 VE |
| fp32 | `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 398 MB | float reference |
| Component file | Size |
| --- | ---: |
| `fp32/duration_predictor.tflite` | 4 MB |
| `fp32/text_encoder.tflite` | 37 MB |
| `fp32/vocoder.tflite` | 101 MB |
| `int4/duration_predictor.tflite` | 2.5 MB |
| `int4/text_encoder.tflite` | 13 MB |
| `int8/vocoder.tflite` (recommended) | 26 MB |
| **`vector_estimator_int8.onnx` (recommended)** | **65 MB** |
| `vector_estimator.onnx` (full fp32) | 256 MB |
## Quickstart
```bash
pip install ai-edge-litert onnxruntime soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-litert
cd supertonic-3-litert
# Recommended INT4 + INT8 VE config (default)
python inference.py --text "Hello, world." --voice F1 --out hello.wav
# Long prompt — use --auto-pad for full content rendering
python inference.py \
--text "<longer prompt>" \
--voice F5 --auto-pad --out long.wav
# Explicit FP32 baseline (uses fp32 vector_estimator.onnx)
python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32 --ve-fp32
```
10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
31 languages supported via `unicode_indexer.json`.
## ⚠️ Known limitation: rushed pacing on long prompts (vs CoreML build)
The supertonic-3 model has a soft content cap per utterance (~13.7 s of
speech for the included long_en_F5 prompt). The LiteRT pipeline runs
`vector_estimator` at native input shapes via ONNX Runtime, which
respects the model's hard limit and **truncates** long prompts.
The [CoreML build of this same model](https://huggingface.co/Reza2kn/supertonic-3-coreml)
benefits from an accidental "bucket-leak" in the CoreML conversion
(padded latent positions leak through ConvNeXt's dilated convolutions),
which extends content by ~3 s and gives more natural pacing. **This
extension does not exist in LiteRT** — we tested padding the ONNX VE
inputs to the same bucket: 13.00s → 13.05s (essentially no extension).
In practice:
- Short prompts (under ~10 s of speech): fine.
- Long prompts (over ~13 s): LiteRT will sound rushed and may truncate
the last words. Use the CoreML build for those if you're on Apple.
`--auto-pad` is still useful — it appends a filler sentence that the
model partially renders, then trims at the silence gap. It recovers
some content but cannot match CoreML's bucket-leak extension.
## Conversion pipeline
```
Supertone/supertonic-3 (ONNX)
-> onnxsim.simplify (T=L=320)
-> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20)
-> onnx2tf -kt -coion (TF SavedModel)
-> tf.lite.TFLiteConverter (fp32 .tflite)
-> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
-> ai_edge_litert.Interpreter at runtime
vector_estimator:
-> onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)
(4× compression, kept ONNX because onnx2tf/litert_torch both
fail on the rotary multi-head attention)
```
The **GELU fuse** is the key unlock for INT4 LiteRT. Without it,
`onnx2tf` emits FlexErf ops which disqualify the model from
`ai_edge_litert` (the runtime that supports INT4). Replacing the
Erf-based GELU expansion with a single ONNX `Gelu` op (opset 20) keeps
the model in pure-TFLite ops and unblocks INT4 inference.
`vector_estimator` is kept as ONNX because onnx2tf's transpose
optimization breaks rotary attention masking, and `litert_torch.convert`
deadlocks on its loaded weights. INT8 dynamic quantization via
`onnxruntime.quantization.quantize_dynamic` works cleanly on Conv +
MatMul ops and gives 4× compression with audio-identical output to fp32.
## License
OpenRAIL — same as the original Supertone/supertonic-3.
## Credits
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
- LiteRT conversion + auto-pad workflow: this repo
- Companion CoreML build: [Reza2kn/supertonic-3-coreml](https://huggingface.co/Reza2kn/supertonic-3-coreml)
- Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer), `onnxruntime.quantization`
- Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert), `onnxruntime`
|