| --- |
| license: openrail |
| language: |
| - en |
| - ja |
| - zh |
| - ko |
| - es |
| - fr |
| - de |
| - multilingual |
| library_name: coremltools |
| tags: |
| - coreml |
| - ane |
| - apple-neural-engine |
| - text-to-speech |
| - tts |
| - audio |
| - diffusion |
| - flow-matching |
| - on-device |
| - ios |
| - macos |
| - fp16 |
| pipeline_tag: text-to-speech |
| base_model: Supertone/supertonic-3 |
| base_model_relation: quantized |
| --- |
| |
| # Supertonic-3 β CoreML (fp16, ANE-ready) |
|
|
| CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3), |
| a 99M-parameter multilingual TTS model. All 4 components run on the |
| Apple Neural Engine (1.8β3.7Γ faster than CPU on M-series chips). |
|
|
| | Component | Size | Role | |
| | --- | ---: | --- | |
| | `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count | |
| | `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent | |
| | `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) | |
| | `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform | |
| | **Total** | **272 MB** | (originals: ~400 MB ONNX) | |
|
|
| ## Quickstart |
|
|
| ```bash |
| pip install coremltools soundfile numpy supertonic |
| git clone https://huggingface.co/Reza2kn/supertonic-3-coreml |
| cd supertonic-3-coreml |
| |
| # Short prompt |
| python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav |
| |
| # Long prompt β use --auto-pad for full content rendering |
| python inference.py \ |
| --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \ |
| --voice F5 --lang en --auto-pad --out long.wav |
| ``` |
|
|
| 10 voice styles ship in `voice_styles/`: F1βF5 (female), M1βM5 (male). |
| 31 languages supported via `unicode_indexer.json`. |
|
|
| ## The auto-pad trick (why `--auto-pad` matters) |
|
|
| The supertonic-3 model has a soft cap on how much speech it renders per |
| utterance. For long inputs (more than ~13 s of natural speech) the model |
| truncates the prompt and emits a low-amplitude filler tone for the rest |
| of the budget. The CoreML conversion's static bucket (T=L=320) extends |
| this cap by ~3 s due to the way the bucket's padded positions leak into |
| the real positions through ConvNeXt's dilated convolutions β that's |
| **why CoreML inference sounds more natural than the original ONNX |
| library** (proper word separation, intonation), but it still cuts off |
| mid-sentence on long prompts. |
|
|
| `--auto-pad` is a two-pass workaround: |
|
|
| 1. **Pass 1** synthesizes the prompt alone at full bucket length to find |
| where the model's content naturally stops (`t_orig`). |
| 2. **Pass 2** appends a long filler sentence |
| (`" And with that, the gentle silence wrapped itself around the room."`) |
| that gives the model extra frames to fully render the original |
| prompt, then renders the filler sentence, then drops into the filler |
| tone. |
| 3. The longest clean-silence gap after `t_orig` is the boundary between |
| the original prompt and the appended filler. The pipeline trims |
| there and tail-pads with 0.5 s of true silence. |
|
|
| Cost: ~2Γ synthesis time. Worth it for any prompt over ~5 s. |
|
|
| ## ANE engagement |
|
|
| All 4 components compile to ANE-resident programs when loaded with |
| `compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU: |
|
|
| | Component | ANE speedup | |
| | --- | --- | |
| | duration_predictor | 1.9Γ | |
| | text_encoder | 2.8Γ | |
| | vector_estimator | 2.4Γ (per step; 8 steps total) | |
| | vocoder | 3.7Γ | |
| |
| Verify ANE engagement with: |
| |
| ```bash |
| xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test" |
| ``` |
| |
| ## Conversion notes |
| |
| - Static bucket: T=320 (text length), L=320 (latent length). Inputs are |
| zero-padded on the right and masked. Bucket = 22.3 s of audio. |
| - `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented |
| in PyTorch from the ONNX initializers, then traced to CoreML. |
| Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998 |
| (text_encoder), cos 0.9998 (vocoder). |
| - `vector_estimator` (the heavy diffusion model) goes through |
| `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace` |
| -> coremltools. Cos 0.998 vs ONNX per diffusion step. |
| - The diffusion sampler stays host-side (8 Euler steps over the single |
| step graph). All 4 components are individually quantizable. |
|
|
| ## License |
|
|
| This conversion follows the original Supertone/supertonic-3 license |
| (OpenRAIL). See `LICENSE` (or the upstream model card). |
|
|
| ## Why fp16 and not INT4? |
|
|
| We attempted to ship an INT4 variant. After exhaustive testing |
| ([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3 |
| architecture caps at INT8 minimum for the vocoder and vector_estimator: |
| |
| | Component | INT4 result | Why | |
| | --- | --- | --- | |
| | **vocoder** | cos β 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive β INT8 (cos 0.99) is the floor. | |
| | **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). | |
| | **duration_predictor** | smallest drift was 0.11s with pt_uniform β but enough to shift L_real β bucket-leak boundary moves β pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. | |
| | **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. | |
| |
| The best achievable mixed config (only voc INT8, others fp16) saves |
| ~25 MB out of 272 MB β not worth a separate variant. The fp16 build |
| shipped here is the final deliverable. |
| |
| ## Companion build |
| |
| The cross-platform LiteRT version is at |
| [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert). |
| LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator |
| (65 MB instead of 256 MB) β but LiteRT can't reproduce the |
| CoreML-only "bucket-leak" extension, so long prompts sound rushed on |
| LiteRT. Use CoreML for full quality on Apple platforms. |
| |
| ## Credits |
| |
| - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) |
| - CoreML conversion + auto-pad workflow: this repo |
| - Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert) |
| |