File size: 5,849 Bytes
dbcccfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c47c32d
dbcccfe
c47c32d
dbcccfe
 
c47c32d
dbcccfe
 
c47c32d
dbcccfe
 
 
 
 
 
 
 
 
c47c32d
 
 
dbcccfe
 
 
 
 
c47c32d
 
dbcccfe
 
 
 
 
 
 
 
 
 
c47c32d
 
dbcccfe
 
 
 
 
 
 
 
c47c32d
dbcccfe
 
 
 
c47c32d
dbcccfe
 
c47c32d
 
dbcccfe
 
 
 
 
c47c32d
dbcccfe
c47c32d
 
 
 
dbcccfe
c47c32d
 
 
 
 
 
dbcccfe
c47c32d
 
 
 
dbcccfe
c47c32d
 
 
dbcccfe
 
 
 
 
 
c47c32d
dbcccfe
 
 
 
c47c32d
 
 
 
 
dbcccfe
 
c47c32d
 
 
 
 
dbcccfe
 
 
c47c32d
 
 
dbcccfe
 
 
 
 
 
 
 
 
c47c32d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: openrail
language:
- en
- ja
- zh
- ko
- es
- fr
- de
- multilingual
library_name: ai-edge-litert
tags:
- litert
- tflite
- tensorflow-lite
- text-to-speech
- tts
- audio
- diffusion
- flow-matching
- on-device
- mobile
- android
- int4
- int8
- weight-only-quantization
- quantized
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized
---

# Supertonic-3 — LiteRT (.tflite, INT4) + ONNX vector_estimator

LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
a 99M-parameter multilingual TTS model. 3 of the 4 components convert
cleanly to true INT4 weight-only quantization via Google's
[ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer)
and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX —
its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
inference, and `litert_torch.convert` deadlocks in MLIR lowering when
fed the model with loaded weights. The ONNX VE is shipped in both fp32
(`vector_estimator.onnx`) and **INT8 dynamic quantization**
(`vector_estimator_int8.onnx`, 65 MB) — INT8 is the recommended config.

## Configurations

| Config | Components | Size | Notes |
| --- | --- | ---: | --- |
| **int4 + INT8 VE (recommended)** | `int4/{dp,te}.tflite` + `vector_estimator_int8.onnx` + `int8/vocoder.tflite` | **106 MB** | smallest viable; **65% smaller than fp32 VE config** |
| int4 + fp32 VE | `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` | 310 MB | larger but auditory-identical to INT8 VE |
| fp32 | `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 398 MB | float reference |

| Component file | Size |
| --- | ---: |
| `fp32/duration_predictor.tflite` | 4 MB |
| `fp32/text_encoder.tflite` | 37 MB |
| `fp32/vocoder.tflite` | 101 MB |
| `int4/duration_predictor.tflite` | 2.5 MB |
| `int4/text_encoder.tflite` | 13 MB |
| `int8/vocoder.tflite` (recommended) | 26 MB |
| **`vector_estimator_int8.onnx` (recommended)** | **65 MB** |
| `vector_estimator.onnx` (full fp32) | 256 MB |

## Quickstart

```bash
pip install ai-edge-litert onnxruntime soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-litert
cd supertonic-3-litert

# Recommended INT4 + INT8 VE config (default)
python inference.py --text "Hello, world." --voice F1 --out hello.wav

# Long prompt — use --auto-pad for full content rendering
python inference.py \
  --text "<longer prompt>" \
  --voice F5 --auto-pad --out long.wav

# Explicit FP32 baseline (uses fp32 vector_estimator.onnx)
python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32 --ve-fp32
```

10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
31 languages supported via `unicode_indexer.json`.

## ⚠️ Known limitation: rushed pacing on long prompts (vs CoreML build)

The supertonic-3 model has a soft content cap per utterance (~13.7 s of
speech for the included long_en_F5 prompt). The LiteRT pipeline runs
`vector_estimator` at native input shapes via ONNX Runtime, which
respects the model's hard limit and **truncates** long prompts.

The [CoreML build of this same model](https://huggingface.co/Reza2kn/supertonic-3-coreml)
benefits from an accidental "bucket-leak" in the CoreML conversion
(padded latent positions leak through ConvNeXt's dilated convolutions),
which extends content by ~3 s and gives more natural pacing. **This
extension does not exist in LiteRT** — we tested padding the ONNX VE
inputs to the same bucket: 13.00s → 13.05s (essentially no extension).

In practice:
- Short prompts (under ~10 s of speech): fine.
- Long prompts (over ~13 s): LiteRT will sound rushed and may truncate
  the last words. Use the CoreML build for those if you're on Apple.

`--auto-pad` is still useful — it appends a filler sentence that the
model partially renders, then trims at the silence gap. It recovers
some content but cannot match CoreML's bucket-leak extension.

## Conversion pipeline

```
Supertone/supertonic-3 (ONNX)
  -> onnxsim.simplify (T=L=320)
  -> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20)
  -> onnx2tf -kt -coion (TF SavedModel)
  -> tf.lite.TFLiteConverter (fp32 .tflite)
  -> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
  -> ai_edge_litert.Interpreter at runtime

vector_estimator:
  -> onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)
     (4× compression, kept ONNX because onnx2tf/litert_torch both
      fail on the rotary multi-head attention)
```

The **GELU fuse** is the key unlock for INT4 LiteRT. Without it,
`onnx2tf` emits FlexErf ops which disqualify the model from
`ai_edge_litert` (the runtime that supports INT4). Replacing the
Erf-based GELU expansion with a single ONNX `Gelu` op (opset 20) keeps
the model in pure-TFLite ops and unblocks INT4 inference.

`vector_estimator` is kept as ONNX because onnx2tf's transpose
optimization breaks rotary attention masking, and `litert_torch.convert`
deadlocks on its loaded weights. INT8 dynamic quantization via
`onnxruntime.quantization.quantize_dynamic` works cleanly on Conv +
MatMul ops and gives 4× compression with audio-identical output to fp32.

## License

OpenRAIL — same as the original Supertone/supertonic-3.

## Credits

- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
- LiteRT conversion + auto-pad workflow: this repo
- Companion CoreML build: [Reza2kn/supertonic-3-coreml](https://huggingface.co/Reza2kn/supertonic-3-coreml)
- Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer), `onnxruntime.quantization`
- Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert), `onnxruntime`