supertonic-3-coreml / README.md
Reza2kn's picture
Update README: explain INT4 sweep findings, add base_model_relation tag
2a9df0d verified
---
license: openrail
language:
- en
- ja
- zh
- ko
- es
- fr
- de
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- text-to-speech
- tts
- audio
- diffusion
- flow-matching
- on-device
- ios
- macos
- fp16
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized
---
# Supertonic-3 β€” CoreML (fp16, ANE-ready)
CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
a 99M-parameter multilingual TTS model. All 4 components run on the
Apple Neural Engine (1.8–3.7Γ— faster than CPU on M-series chips).
| Component | Size | Role |
| --- | ---: | --- |
| `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count |
| `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent |
| `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) |
| `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform |
| **Total** | **272 MB** | (originals: ~400 MB ONNX) |
## Quickstart
```bash
pip install coremltools soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
cd supertonic-3-coreml
# Short prompt
python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav
# Long prompt β€” use --auto-pad for full content rendering
python inference.py \
--text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
--voice F5 --lang en --auto-pad --out long.wav
```
10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
31 languages supported via `unicode_indexer.json`.
## The auto-pad trick (why `--auto-pad` matters)
The supertonic-3 model has a soft cap on how much speech it renders per
utterance. For long inputs (more than ~13 s of natural speech) the model
truncates the prompt and emits a low-amplitude filler tone for the rest
of the budget. The CoreML conversion's static bucket (T=L=320) extends
this cap by ~3 s due to the way the bucket's padded positions leak into
the real positions through ConvNeXt's dilated convolutions β€” that's
**why CoreML inference sounds more natural than the original ONNX
library** (proper word separation, intonation), but it still cuts off
mid-sentence on long prompts.
`--auto-pad` is a two-pass workaround:
1. **Pass 1** synthesizes the prompt alone at full bucket length to find
where the model's content naturally stops (`t_orig`).
2. **Pass 2** appends a long filler sentence
(`" And with that, the gentle silence wrapped itself around the room."`)
that gives the model extra frames to fully render the original
prompt, then renders the filler sentence, then drops into the filler
tone.
3. The longest clean-silence gap after `t_orig` is the boundary between
the original prompt and the appended filler. The pipeline trims
there and tail-pads with 0.5 s of true silence.
Cost: ~2Γ— synthesis time. Worth it for any prompt over ~5 s.
## ANE engagement
All 4 components compile to ANE-resident programs when loaded with
`compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:
| Component | ANE speedup |
| --- | --- |
| duration_predictor | 1.9Γ— |
| text_encoder | 2.8Γ— |
| vector_estimator | 2.4Γ— (per step; 8 steps total) |
| vocoder | 3.7Γ— |
Verify ANE engagement with:
```bash
xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
```
## Conversion notes
- Static bucket: T=320 (text length), L=320 (latent length). Inputs are
zero-padded on the right and masked. Bucket = 22.3 s of audio.
- `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
in PyTorch from the ONNX initializers, then traced to CoreML.
Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
(text_encoder), cos 0.9998 (vocoder).
- `vector_estimator` (the heavy diffusion model) goes through
`onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
-> coremltools. Cos 0.998 vs ONNX per diffusion step.
- The diffusion sampler stays host-side (8 Euler steps over the single
step graph). All 4 components are individually quantizable.
## License
This conversion follows the original Supertone/supertonic-3 license
(OpenRAIL). See `LICENSE` (or the upstream model card).
## Why fp16 and not INT4?
We attempted to ship an INT4 variant. After exhaustive testing
([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
architecture caps at INT8 minimum for the vocoder and vector_estimator:
| Component | INT4 result | Why |
| --- | --- | --- |
| **vocoder** | cos β‰ˆ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive β€” INT8 (cos 0.99) is the floor. |
| **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
| **duration_predictor** | smallest drift was 0.11s with pt_uniform β€” but enough to shift L_real β†’ bucket-leak boundary moves β†’ pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
| **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |
The best achievable mixed config (only voc INT8, others fp16) saves
~25 MB out of 272 MB β€” not worth a separate variant. The fp16 build
shipped here is the final deliverable.
## Companion build
The cross-platform LiteRT version is at
[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
(65 MB instead of 256 MB) β€” but LiteRT can't reproduce the
CoreML-only "bucket-leak" extension, so long prompts sound rushed on
LiteRT. Use CoreML for full quality on Apple platforms.
## Credits
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
- CoreML conversion + auto-pad workflow: this repo
- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)