File size: 6,426 Bytes

---
license: openrail
language:
- en
- ja
- zh
- ko
- es
- fr
- de
- multilingual
library_name: coremltools
tags:
- coreml
- ane
- apple-neural-engine
- text-to-speech
- tts
- audio
- diffusion
- flow-matching
- on-device
- ios
- macos
- fp16
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized
---

# Supertonic-3 — CoreML (fp16, ANE-ready)

CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
a 99M-parameter multilingual TTS model. All 4 components run on the
Apple Neural Engine (1.8–3.7× faster than CPU on M-series chips).

| Component | Size | Role |
| --- | ---: | --- |
| `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count |
| `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent |
| `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) |
| `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform |
| **Total** | **272 MB** | (originals: ~400 MB ONNX) |

## Quickstart

```bash
pip install coremltools soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
cd supertonic-3-coreml

# Short prompt
python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav

# Long prompt — use --auto-pad for full content rendering
python inference.py \
  --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
  --voice F5 --lang en --auto-pad --out long.wav
```

10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
31 languages supported via `unicode_indexer.json`.

## The auto-pad trick (why `--auto-pad` matters)

The supertonic-3 model has a soft cap on how much speech it renders per
utterance. For long inputs (more than ~13 s of natural speech) the model
truncates the prompt and emits a low-amplitude filler tone for the rest
of the budget. The CoreML conversion's static bucket (T=L=320) extends
this cap by ~3 s due to the way the bucket's padded positions leak into
the real positions through ConvNeXt's dilated convolutions — that's
**why CoreML inference sounds more natural than the original ONNX
library** (proper word separation, intonation), but it still cuts off
mid-sentence on long prompts.

`--auto-pad` is a two-pass workaround:

1. **Pass 1** synthesizes the prompt alone at full bucket length to find
   where the model's content naturally stops (`t_orig`).
2. **Pass 2** appends a long filler sentence
   (`" And with that, the gentle silence wrapped itself around the room."`)
   that gives the model extra frames to fully render the original
   prompt, then renders the filler sentence, then drops into the filler
   tone.
3. The longest clean-silence gap after `t_orig` is the boundary between
   the original prompt and the appended filler. The pipeline trims
   there and tail-pads with 0.5 s of true silence.

Cost: ~2× synthesis time. Worth it for any prompt over ~5 s.

## ANE engagement

All 4 components compile to ANE-resident programs when loaded with
`compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:

| Component | ANE speedup |
| --- | --- |
| duration_predictor | 1.9× |
| text_encoder | 2.8× |
| vector_estimator | 2.4× (per step; 8 steps total) |
| vocoder | 3.7× |

Verify ANE engagement with:

```bash
xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
```

## Conversion notes

- Static bucket: T=320 (text length), L=320 (latent length). Inputs are
  zero-padded on the right and masked. Bucket = 22.3 s of audio.
- `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
  in PyTorch from the ONNX initializers, then traced to CoreML.
  Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
  (text_encoder), cos 0.9998 (vocoder).
- `vector_estimator` (the heavy diffusion model) goes through
  `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
  -> coremltools. Cos 0.998 vs ONNX per diffusion step.
- The diffusion sampler stays host-side (8 Euler steps over the single
  step graph). All 4 components are individually quantizable.

## License

This conversion follows the original Supertone/supertonic-3 license
(OpenRAIL). See `LICENSE` (or the upstream model card).

## Why fp16 and not INT4?

We attempted to ship an INT4 variant. After exhaustive testing
([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
architecture caps at INT8 minimum for the vocoder and vector_estimator:

| Component | INT4 result | Why |
| --- | --- | --- |
| **vocoder** | cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. |
| **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
| **duration_predictor** | smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
| **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |

The best achievable mixed config (only voc INT8, others fp16) saves
~25 MB out of 272 MB — not worth a separate variant. The fp16 build
shipped here is the final deliverable.

## Companion build

The cross-platform LiteRT version is at
[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
(65 MB instead of 256 MB) — but LiteRT can't reproduce the
CoreML-only "bucket-leak" extension, so long prompts sound rushed on
LiteRT. Use CoreML for full quality on Apple platforms.

## Credits

- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
- CoreML conversion + auto-pad workflow: this repo
- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)