Instructions to use Reza2kn/supertonic-3-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use Reza2kn/supertonic-3-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Supertonic-3 — LiteRT (.tflite, INT4) + ONNX vector_estimator
LiteRT / TensorFlow Lite conversion of Supertone/supertonic-3,
a 99M-parameter multilingual TTS model. 3 of the 4 components convert
cleanly to true INT4 weight-only quantization via Google's
ai-edge-quantizer
and run on the ai_edge_litert
runtime. vector_estimator (the diffusion denoiser) is kept as ONNX —
its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
inference, and litert_torch.convert deadlocks in MLIR lowering when
fed the model with loaded weights. The ONNX VE is shipped in both fp32
(vector_estimator.onnx) and INT8 dynamic quantization
(vector_estimator_int8.onnx, 65 MB) — INT8 is the recommended config.
Configurations
| Config | Components | Size | Notes |
|---|---|---|---|
| int4 + INT8 VE (recommended) | int4/{dp,te}.tflite + vector_estimator_int8.onnx + int8/vocoder.tflite |
106 MB | smallest viable; 65% smaller than fp32 VE config |
| int4 + fp32 VE | int4/{dp,te}.tflite + vector_estimator.onnx + int8/vocoder.tflite |
310 MB | larger but auditory-identical to INT8 VE |
| fp32 | fp32/{dp,te,vocoder}.tflite + vector_estimator.onnx |
398 MB | float reference |
| Component file | Size |
|---|---|
fp32/duration_predictor.tflite |
4 MB |
fp32/text_encoder.tflite |
37 MB |
fp32/vocoder.tflite |
101 MB |
int4/duration_predictor.tflite |
2.5 MB |
int4/text_encoder.tflite |
13 MB |
int8/vocoder.tflite (recommended) |
26 MB |
vector_estimator_int8.onnx (recommended) |
65 MB |
vector_estimator.onnx (full fp32) |
256 MB |
Quickstart
pip install ai-edge-litert onnxruntime soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-litert
cd supertonic-3-litert
# Recommended INT4 + INT8 VE config (default)
python inference.py --text "Hello, world." --voice F1 --out hello.wav
# Long prompt — use --auto-pad for full content rendering
python inference.py \
--text "<longer prompt>" \
--voice F5 --auto-pad --out long.wav
# Explicit FP32 baseline (uses fp32 vector_estimator.onnx)
python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32 --ve-fp32
10 voice styles ship in voice_styles/: F1–F5 (female), M1–M5 (male).
31 languages supported via unicode_indexer.json.
⚠️ Known limitation: rushed pacing on long prompts (vs CoreML build)
The supertonic-3 model has a soft content cap per utterance (~13.7 s of
speech for the included long_en_F5 prompt). The LiteRT pipeline runs
vector_estimator at native input shapes via ONNX Runtime, which
respects the model's hard limit and truncates long prompts.
The CoreML build of this same model benefits from an accidental "bucket-leak" in the CoreML conversion (padded latent positions leak through ConvNeXt's dilated convolutions), which extends content by ~3 s and gives more natural pacing. This extension does not exist in LiteRT — we tested padding the ONNX VE inputs to the same bucket: 13.00s → 13.05s (essentially no extension).
In practice:
- Short prompts (under ~10 s of speech): fine.
- Long prompts (over ~13 s): LiteRT will sound rushed and may truncate the last words. Use the CoreML build for those if you're on Apple.
--auto-pad is still useful — it appends a filler sentence that the
model partially renders, then trims at the silence gap. It recovers
some content but cannot match CoreML's bucket-leak extension.
Conversion pipeline
Supertone/supertonic-3 (ONNX)
-> onnxsim.simplify (T=L=320)
-> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20)
-> onnx2tf -kt -coion (TF SavedModel)
-> tf.lite.TFLiteConverter (fp32 .tflite)
-> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
-> ai_edge_litert.Interpreter at runtime
vector_estimator:
-> onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)
(4× compression, kept ONNX because onnx2tf/litert_torch both
fail on the rotary multi-head attention)
The GELU fuse is the key unlock for INT4 LiteRT. Without it,
onnx2tf emits FlexErf ops which disqualify the model from
ai_edge_litert (the runtime that supports INT4). Replacing the
Erf-based GELU expansion with a single ONNX Gelu op (opset 20) keeps
the model in pure-TFLite ops and unblocks INT4 inference.
vector_estimator is kept as ONNX because onnx2tf's transpose
optimization breaks rotary attention masking, and litert_torch.convert
deadlocks on its loaded weights. INT8 dynamic quantization via
onnxruntime.quantization.quantize_dynamic works cleanly on Conv +
MatMul ops and gives 4× compression with audio-identical output to fp32.
License
OpenRAIL — same as the original Supertone/supertonic-3.
Credits
- Original model: Supertone/supertonic-3
- LiteRT conversion + auto-pad workflow: this repo
- Companion CoreML build: Reza2kn/supertonic-3-coreml
- Quantization:
ai-edge-quantizer,onnxruntime.quantization - Runtime:
ai_edge_litert,onnxruntime
- Downloads last month
- 49
Model tree for Reza2kn/supertonic-3-litert
Base model
Supertone/supertonic-3