supertonic-3-coreml / README.md
Reza2kn's picture
Update README: explain INT4 sweep findings, add base_model_relation tag
2a9df0d verified
metadata
license: openrail
language:
  - en
  - ja
  - zh
  - ko
  - es
  - fr
  - de
  - multilingual
library_name: coremltools
tags:
  - coreml
  - ane
  - apple-neural-engine
  - text-to-speech
  - tts
  - audio
  - diffusion
  - flow-matching
  - on-device
  - ios
  - macos
  - fp16
pipeline_tag: text-to-speech
base_model: Supertone/supertonic-3
base_model_relation: quantized

Supertonic-3 β€” CoreML (fp16, ANE-ready)

CoreML conversion of Supertone/supertonic-3, a 99M-parameter multilingual TTS model. All 4 components run on the Apple Neural Engine (1.8–3.7Γ— faster than CPU on M-series chips).

Component Size Role
fp16/duration_predictor.mlpackage 15 MB text -> frame count
fp16/text_encoder.mlpackage 71 MB text -> conditioning latent
fp16/vector_estimator.mlpackage 135 MB flow-matching denoiser (8 steps)
fp16/vocoder.mlpackage 51 MB latent -> 44.1 kHz waveform
Total 272 MB (originals: ~400 MB ONNX)

Quickstart

pip install coremltools soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
cd supertonic-3-coreml

# Short prompt
python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav

# Long prompt β€” use --auto-pad for full content rendering
python inference.py \
  --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
  --voice F5 --lang en --auto-pad --out long.wav

10 voice styles ship in voice_styles/: F1–F5 (female), M1–M5 (male). 31 languages supported via unicode_indexer.json.

The auto-pad trick (why --auto-pad matters)

The supertonic-3 model has a soft cap on how much speech it renders per utterance. For long inputs (more than ~13 s of natural speech) the model truncates the prompt and emits a low-amplitude filler tone for the rest of the budget. The CoreML conversion's static bucket (T=L=320) extends this cap by ~3 s due to the way the bucket's padded positions leak into the real positions through ConvNeXt's dilated convolutions β€” that's why CoreML inference sounds more natural than the original ONNX library (proper word separation, intonation), but it still cuts off mid-sentence on long prompts.

--auto-pad is a two-pass workaround:

  1. Pass 1 synthesizes the prompt alone at full bucket length to find where the model's content naturally stops (t_orig).
  2. Pass 2 appends a long filler sentence (" And with that, the gentle silence wrapped itself around the room.") that gives the model extra frames to fully render the original prompt, then renders the filler sentence, then drops into the filler tone.
  3. The longest clean-silence gap after t_orig is the boundary between the original prompt and the appended filler. The pipeline trims there and tail-pads with 0.5 s of true silence.

Cost: ~2Γ— synthesis time. Worth it for any prompt over ~5 s.

ANE engagement

All 4 components compile to ANE-resident programs when loaded with compute_units=ALL (default). Measured speedups on M2 Pro vs CPU:

Component ANE speedup
duration_predictor 1.9Γ—
text_encoder 2.8Γ—
vector_estimator 2.4Γ— (per step; 8 steps total)
vocoder 3.7Γ—

Verify ANE engagement with:

xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"

Conversion notes

  • Static bucket: T=320 (text length), L=320 (latent length). Inputs are zero-padded on the right and masked. Bucket = 22.3 s of audio.
  • duration_predictor, text_encoder, vocoder are hand-reimplemented in PyTorch from the ONNX initializers, then traced to CoreML. Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998 (text_encoder), cos 0.9998 (vocoder).
  • vector_estimator (the heavy diffusion model) goes through onnxsim.simplify(T=L=320) -> onnx2torch.convert -> torch.jit.trace -> coremltools. Cos 0.998 vs ONNX per diffusion step.
  • The diffusion sampler stays host-side (8 Euler steps over the single step graph). All 4 components are individually quantizable.

License

This conversion follows the original Supertone/supertonic-3 license (OpenRAIL). See LICENSE (or the upstream model card).

Why fp16 and not INT4?

We attempted to ship an INT4 variant. After exhaustive testing (INT4 sweep notes below), the supertonic-3 architecture caps at INT8 minimum for the vocoder and vector_estimator:

Component INT4 result Why
vocoder cos β‰ˆ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. HiFi-GAN-style upsampling is uniformly sensitive β€” INT8 (cos 0.99) is the floor.
vector_estimator per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8).
duration_predictor smallest drift was 0.11s with pt_uniform β€” but enough to shift L_real β†’ bucket-leak boundary moves β†’ pacing perceptibly breaks. dp output sets the diffusion frame budget; any drift propagates.
text_encoder cos 0.97 at pgc_g32 (works alone). Conditioning quality compounds with VE drift.

The best achievable mixed config (only voc INT8, others fp16) saves ~25 MB out of 272 MB β€” not worth a separate variant. The fp16 build shipped here is the final deliverable.

Companion build

The cross-platform LiteRT version is at Reza2kn/supertonic-3-litert. LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator (65 MB instead of 256 MB) β€” but LiteRT can't reproduce the CoreML-only "bucket-leak" extension, so long prompts sound rushed on LiteRT. Use CoreML for full quality on Apple platforms.

Credits