Add INT8 ONNX vector_estimator (65 MB), update README + tags

c47c32d verified 3 days ago

5.85 kB

	---
	license: openrail
	language:
	- en
	- ja
	- zh
	- ko
	- es
	- fr
	- de
	- multilingual
	library_name: ai-edge-litert
	tags:
	- litert
	- tflite
	- tensorflow-lite
	- text-to-speech
	- tts
	- audio
	- diffusion
	- flow-matching
	- on-device
	- mobile
	- android
	- int4
	- int8
	- weight-only-quantization
	- quantized
	pipeline_tag: text-to-speech
	base_model: Supertone/supertonic-3
	base_model_relation: quantized
	---

	# Supertonic-3 — LiteRT (.tflite, INT4) + ONNX vector_estimator

	LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
	a 99M-parameter multilingual TTS model. 3 of the 4 components convert
	cleanly to true INT4 weight-only quantization via Google's
	[ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer)
	and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
	runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX —
	its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
	inference, and `litert_torch.convert` deadlocks in MLIR lowering when
	fed the model with loaded weights. The ONNX VE is shipped in both fp32
	(`vector_estimator.onnx`) and INT8 dynamic quantization
	(`vector_estimator_int8.onnx`, 65 MB) — INT8 is the recommended config.

	## Configurations

	\| Config \| Components \| Size \| Notes \|
	\| --- \| --- \| ---: \| --- \|
	\| int4 + INT8 VE (recommended) \| `int4/{dp,te}.tflite` + `vector_estimator_int8.onnx` + `int8/vocoder.tflite` \| 106 MB \| smallest viable; 65% smaller than fp32 VE config \|
	\| int4 + fp32 VE \| `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` \| 310 MB \| larger but auditory-identical to INT8 VE \|
	\| fp32 \| `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` \| 398 MB \| float reference \|

	\| Component file \| Size \|
	\| --- \| ---: \|
	\| `fp32/duration_predictor.tflite` \| 4 MB \|
	\| `fp32/text_encoder.tflite` \| 37 MB \|
	\| `fp32/vocoder.tflite` \| 101 MB \|
	\| `int4/duration_predictor.tflite` \| 2.5 MB \|
	\| `int4/text_encoder.tflite` \| 13 MB \|
	\| `int8/vocoder.tflite` (recommended) \| 26 MB \|
	\| `vector_estimator_int8.onnx` (recommended) \| 65 MB \|
	\| `vector_estimator.onnx` (full fp32) \| 256 MB \|

	## Quickstart

	```bash
	pip install ai-edge-litert onnxruntime soundfile numpy supertonic
	git clone https://huggingface.co/Reza2kn/supertonic-3-litert
	cd supertonic-3-litert

	# Recommended INT4 + INT8 VE config (default)
	python inference.py --text "Hello, world." --voice F1 --out hello.wav

	# Long prompt — use --auto-pad for full content rendering
	python inference.py \
	--text "<longer prompt>" \
	--voice F5 --auto-pad --out long.wav

	# Explicit FP32 baseline (uses fp32 vector_estimator.onnx)
	python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32 --ve-fp32
	```

	10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
	31 languages supported via `unicode_indexer.json`.

	## ⚠️ Known limitation: rushed pacing on long prompts (vs CoreML build)

	The supertonic-3 model has a soft content cap per utterance (~13.7 s of
	speech for the included long_en_F5 prompt). The LiteRT pipeline runs
	`vector_estimator` at native input shapes via ONNX Runtime, which
	respects the model's hard limit and truncates long prompts.

	The [CoreML build of this same model](https://huggingface.co/Reza2kn/supertonic-3-coreml)
	benefits from an accidental "bucket-leak" in the CoreML conversion
	(padded latent positions leak through ConvNeXt's dilated convolutions),
	which extends content by ~3 s and gives more natural pacing. **This
	extension does not exist in LiteRT** — we tested padding the ONNX VE
	inputs to the same bucket: 13.00s → 13.05s (essentially no extension).

	In practice:
	- Short prompts (under ~10 s of speech): fine.
	- Long prompts (over ~13 s): LiteRT will sound rushed and may truncate
	the last words. Use the CoreML build for those if you're on Apple.

	`--auto-pad` is still useful — it appends a filler sentence that the
	model partially renders, then trims at the silence gap. It recovers
	some content but cannot match CoreML's bucket-leak extension.

	## Conversion pipeline

	```
	Supertone/supertonic-3 (ONNX)
	-> onnxsim.simplify (T=L=320)
	-> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20)
	-> onnx2tf -kt -coion (TF SavedModel)
	-> tf.lite.TFLiteConverter (fp32 .tflite)
	-> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
	-> ai_edge_litert.Interpreter at runtime

	vector_estimator:
	-> onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)
	(4× compression, kept ONNX because onnx2tf/litert_torch both
	fail on the rotary multi-head attention)
	```

	The GELU fuse is the key unlock for INT4 LiteRT. Without it,
	`onnx2tf` emits FlexErf ops which disqualify the model from
	`ai_edge_litert` (the runtime that supports INT4). Replacing the
	Erf-based GELU expansion with a single ONNX `Gelu` op (opset 20) keeps
	the model in pure-TFLite ops and unblocks INT4 inference.

	`vector_estimator` is kept as ONNX because onnx2tf's transpose
	optimization breaks rotary attention masking, and `litert_torch.convert`
	deadlocks on its loaded weights. INT8 dynamic quantization via
	`onnxruntime.quantization.quantize_dynamic` works cleanly on Conv +
	MatMul ops and gives 4× compression with audio-identical output to fp32.

	## License

	OpenRAIL — same as the original Supertone/supertonic-3.

	## Credits

	- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
	- LiteRT conversion + auto-pad workflow: this repo
	- Companion CoreML build: [Reza2kn/supertonic-3-coreml](https://huggingface.co/Reza2kn/supertonic-3-coreml)
	- Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer), `onnxruntime.quantization`
	- Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert), `onnxruntime`