File size: 6,426 Bytes
25a89bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a9df0d
 
 
25a89bd
 
2a9df0d
25a89bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a9df0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25a89bd
 
 
 
2a9df0d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: openrail
language:
- en
- ja
- zh
- ko
- es
- fr
- de
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- text-to-speech
- tts
- audio
- diffusion
- flow-matching
- on-device
- ios
- macos
- fp16
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized
---

# Supertonic-3 β€” CoreML (fp16, ANE-ready)

CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
a 99M-parameter multilingual TTS model. All 4 components run on the
Apple Neural Engine (1.8–3.7Γ— faster than CPU on M-series chips).

| Component | Size | Role |
| --- | ---: | --- |
| `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count |
| `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent |
| `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) |
| `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform |
| **Total** | **272 MB** | (originals: ~400 MB ONNX) |

## Quickstart

```bash
pip install coremltools soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
cd supertonic-3-coreml

# Short prompt
python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav

# Long prompt β€” use --auto-pad for full content rendering
python inference.py \
  --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
  --voice F5 --lang en --auto-pad --out long.wav
```

10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
31 languages supported via `unicode_indexer.json`.

## The auto-pad trick (why `--auto-pad` matters)

The supertonic-3 model has a soft cap on how much speech it renders per
utterance. For long inputs (more than ~13 s of natural speech) the model
truncates the prompt and emits a low-amplitude filler tone for the rest
of the budget. The CoreML conversion's static bucket (T=L=320) extends
this cap by ~3 s due to the way the bucket's padded positions leak into
the real positions through ConvNeXt's dilated convolutions β€” that's
**why CoreML inference sounds more natural than the original ONNX
library** (proper word separation, intonation), but it still cuts off
mid-sentence on long prompts.

`--auto-pad` is a two-pass workaround:

1. **Pass 1** synthesizes the prompt alone at full bucket length to find
   where the model's content naturally stops (`t_orig`).
2. **Pass 2** appends a long filler sentence
   (`" And with that, the gentle silence wrapped itself around the room."`)
   that gives the model extra frames to fully render the original
   prompt, then renders the filler sentence, then drops into the filler
   tone.
3. The longest clean-silence gap after `t_orig` is the boundary between
   the original prompt and the appended filler. The pipeline trims
   there and tail-pads with 0.5 s of true silence.

Cost: ~2Γ— synthesis time. Worth it for any prompt over ~5 s.

## ANE engagement

All 4 components compile to ANE-resident programs when loaded with
`compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:

| Component | ANE speedup |
| --- | --- |
| duration_predictor | 1.9Γ— |
| text_encoder | 2.8Γ— |
| vector_estimator | 2.4Γ— (per step; 8 steps total) |
| vocoder | 3.7Γ— |

Verify ANE engagement with:

```bash
xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
```

## Conversion notes

- Static bucket: T=320 (text length), L=320 (latent length). Inputs are
  zero-padded on the right and masked. Bucket = 22.3 s of audio.
- `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
  in PyTorch from the ONNX initializers, then traced to CoreML.
  Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
  (text_encoder), cos 0.9998 (vocoder).
- `vector_estimator` (the heavy diffusion model) goes through
  `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
  -> coremltools. Cos 0.998 vs ONNX per diffusion step.
- The diffusion sampler stays host-side (8 Euler steps over the single
  step graph). All 4 components are individually quantizable.

## License

This conversion follows the original Supertone/supertonic-3 license
(OpenRAIL). See `LICENSE` (or the upstream model card).

## Why fp16 and not INT4?

We attempted to ship an INT4 variant. After exhaustive testing
([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
architecture caps at INT8 minimum for the vocoder and vector_estimator:

| Component | INT4 result | Why |
| --- | --- | --- |
| **vocoder** | cos β‰ˆ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive β€” INT8 (cos 0.99) is the floor. |
| **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
| **duration_predictor** | smallest drift was 0.11s with pt_uniform β€” but enough to shift L_real β†’ bucket-leak boundary moves β†’ pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
| **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |

The best achievable mixed config (only voc INT8, others fp16) saves
~25 MB out of 272 MB β€” not worth a separate variant. The fp16 build
shipped here is the final deliverable.

## Companion build

The cross-platform LiteRT version is at
[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
(65 MB instead of 256 MB) β€” but LiteRT can't reproduce the
CoreML-only "bucket-leak" extension, so long prompts sound rushed on
LiteRT. Use CoreML for full quality on Apple platforms.

## Credits

- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
- CoreML conversion + auto-pad workflow: this repo
- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)