Text-to-Speech
LiteRT
ONNX
LiteRT
ai-edge-litert
tensorflow-lite
tts
audio
diffusion
flow-matching
on-device
mobile
android
int4
int8
weight-only-quantization
quantized
Instructions to use Reza2kn/supertonic-3-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use Reza2kn/supertonic-3-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: openrail | |
| language: | |
| - en | |
| - ja | |
| - zh | |
| - ko | |
| - es | |
| - fr | |
| - de | |
| - multilingual | |
| library_name: ai-edge-litert | |
| tags: | |
| - litert | |
| - tflite | |
| - tensorflow-lite | |
| - text-to-speech | |
| - tts | |
| - audio | |
| - diffusion | |
| - flow-matching | |
| - on-device | |
| - mobile | |
| - android | |
| - int4 | |
| - int8 | |
| - weight-only-quantization | |
| - quantized | |
| pipeline_tag: text-to-speech | |
| base_model: Supertone/supertonic-3 | |
| base_model_relation: quantized | |
| # Supertonic-3 — LiteRT (.tflite, INT4) + ONNX vector_estimator | |
| LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3), | |
| a 99M-parameter multilingual TTS model. 3 of the 4 components convert | |
| cleanly to true INT4 weight-only quantization via Google's | |
| [ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer) | |
| and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert) | |
| runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX — | |
| its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape | |
| inference, and `litert_torch.convert` deadlocks in MLIR lowering when | |
| fed the model with loaded weights. The ONNX VE is shipped in both fp32 | |
| (`vector_estimator.onnx`) and **INT8 dynamic quantization** | |
| (`vector_estimator_int8.onnx`, 65 MB) — INT8 is the recommended config. | |
| ## Configurations | |
| | Config | Components | Size | Notes | | |
| | --- | --- | ---: | --- | | |
| | **int4 + INT8 VE (recommended)** | `int4/{dp,te}.tflite` + `vector_estimator_int8.onnx` + `int8/vocoder.tflite` | **106 MB** | smallest viable; **65% smaller than fp32 VE config** | | |
| | int4 + fp32 VE | `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` | 310 MB | larger but auditory-identical to INT8 VE | | |
| | fp32 | `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 398 MB | float reference | | |
| | Component file | Size | | |
| | --- | ---: | | |
| | `fp32/duration_predictor.tflite` | 4 MB | | |
| | `fp32/text_encoder.tflite` | 37 MB | | |
| | `fp32/vocoder.tflite` | 101 MB | | |
| | `int4/duration_predictor.tflite` | 2.5 MB | | |
| | `int4/text_encoder.tflite` | 13 MB | | |
| | `int8/vocoder.tflite` (recommended) | 26 MB | | |
| | **`vector_estimator_int8.onnx` (recommended)** | **65 MB** | | |
| | `vector_estimator.onnx` (full fp32) | 256 MB | | |
| ## Quickstart | |
| ```bash | |
| pip install ai-edge-litert onnxruntime soundfile numpy supertonic | |
| git clone https://huggingface.co/Reza2kn/supertonic-3-litert | |
| cd supertonic-3-litert | |
| # Recommended INT4 + INT8 VE config (default) | |
| python inference.py --text "Hello, world." --voice F1 --out hello.wav | |
| # Long prompt — use --auto-pad for full content rendering | |
| python inference.py \ | |
| --text "<longer prompt>" \ | |
| --voice F5 --auto-pad --out long.wav | |
| # Explicit FP32 baseline (uses fp32 vector_estimator.onnx) | |
| python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32 --ve-fp32 | |
| ``` | |
| 10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male). | |
| 31 languages supported via `unicode_indexer.json`. | |
| ## ⚠️ Known limitation: rushed pacing on long prompts (vs CoreML build) | |
| The supertonic-3 model has a soft content cap per utterance (~13.7 s of | |
| speech for the included long_en_F5 prompt). The LiteRT pipeline runs | |
| `vector_estimator` at native input shapes via ONNX Runtime, which | |
| respects the model's hard limit and **truncates** long prompts. | |
| The [CoreML build of this same model](https://huggingface.co/Reza2kn/supertonic-3-coreml) | |
| benefits from an accidental "bucket-leak" in the CoreML conversion | |
| (padded latent positions leak through ConvNeXt's dilated convolutions), | |
| which extends content by ~3 s and gives more natural pacing. **This | |
| extension does not exist in LiteRT** — we tested padding the ONNX VE | |
| inputs to the same bucket: 13.00s → 13.05s (essentially no extension). | |
| In practice: | |
| - Short prompts (under ~10 s of speech): fine. | |
| - Long prompts (over ~13 s): LiteRT will sound rushed and may truncate | |
| the last words. Use the CoreML build for those if you're on Apple. | |
| `--auto-pad` is still useful — it appends a filler sentence that the | |
| model partially renders, then trims at the silence gap. It recovers | |
| some content but cannot match CoreML's bucket-leak extension. | |
| ## Conversion pipeline | |
| ``` | |
| Supertone/supertonic-3 (ONNX) | |
| -> onnxsim.simplify (T=L=320) | |
| -> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20) | |
| -> onnx2tf -kt -coion (TF SavedModel) | |
| -> tf.lite.TFLiteConverter (fp32 .tflite) | |
| -> ai-edge-quantizer weight_only_wi4_afp32() (true INT4) | |
| -> ai_edge_litert.Interpreter at runtime | |
| vector_estimator: | |
| -> onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True) | |
| (4× compression, kept ONNX because onnx2tf/litert_torch both | |
| fail on the rotary multi-head attention) | |
| ``` | |
| The **GELU fuse** is the key unlock for INT4 LiteRT. Without it, | |
| `onnx2tf` emits FlexErf ops which disqualify the model from | |
| `ai_edge_litert` (the runtime that supports INT4). Replacing the | |
| Erf-based GELU expansion with a single ONNX `Gelu` op (opset 20) keeps | |
| the model in pure-TFLite ops and unblocks INT4 inference. | |
| `vector_estimator` is kept as ONNX because onnx2tf's transpose | |
| optimization breaks rotary attention masking, and `litert_torch.convert` | |
| deadlocks on its loaded weights. INT8 dynamic quantization via | |
| `onnxruntime.quantization.quantize_dynamic` works cleanly on Conv + | |
| MatMul ops and gives 4× compression with audio-identical output to fp32. | |
| ## License | |
| OpenRAIL — same as the original Supertone/supertonic-3. | |
| ## Credits | |
| - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) | |
| - LiteRT conversion + auto-pad workflow: this repo | |
| - Companion CoreML build: [Reza2kn/supertonic-3-coreml](https://huggingface.co/Reza2kn/supertonic-3-coreml) | |
| - Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer), `onnxruntime.quantization` | |
| - Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert), `onnxruntime` | |