--- license: openrail language: - en - ja - zh - ko - es - fr - de - multilingual library_name: coremltools tags: - coreml - ane - apple-neural-engine - text-to-speech - tts - audio - diffusion - flow-matching - on-device - ios - macos - fp16 pipeline_tag: text-to-speech base_model: Supertone/supertonic-3 base_model_relation: quantized --- # Supertonic-3 — CoreML (fp16, ANE-ready) CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3), a 99M-parameter multilingual TTS model. All 4 components run on the Apple Neural Engine (1.8–3.7× faster than CPU on M-series chips). | Component | Size | Role | | --- | ---: | --- | | `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count | | `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent | | `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) | | `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform | | **Total** | **272 MB** | (originals: ~400 MB ONNX) | ## Quickstart ```bash pip install coremltools soundfile numpy supertonic git clone https://huggingface.co/Reza2kn/supertonic-3-coreml cd supertonic-3-coreml # Short prompt python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav # Long prompt — use --auto-pad for full content rendering python inference.py \ --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \ --voice F5 --lang en --auto-pad --out long.wav ``` 10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male). 31 languages supported via `unicode_indexer.json`. ## The auto-pad trick (why `--auto-pad` matters) The supertonic-3 model has a soft cap on how much speech it renders per utterance. For long inputs (more than ~13 s of natural speech) the model truncates the prompt and emits a low-amplitude filler tone for the rest of the budget. The CoreML conversion's static bucket (T=L=320) extends this cap by ~3 s due to the way the bucket's padded positions leak into the real positions through ConvNeXt's dilated convolutions — that's **why CoreML inference sounds more natural than the original ONNX library** (proper word separation, intonation), but it still cuts off mid-sentence on long prompts. `--auto-pad` is a two-pass workaround: 1. **Pass 1** synthesizes the prompt alone at full bucket length to find where the model's content naturally stops (`t_orig`). 2. **Pass 2** appends a long filler sentence (`" And with that, the gentle silence wrapped itself around the room."`) that gives the model extra frames to fully render the original prompt, then renders the filler sentence, then drops into the filler tone. 3. The longest clean-silence gap after `t_orig` is the boundary between the original prompt and the appended filler. The pipeline trims there and tail-pads with 0.5 s of true silence. Cost: ~2× synthesis time. Worth it for any prompt over ~5 s. ## ANE engagement All 4 components compile to ANE-resident programs when loaded with `compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU: | Component | ANE speedup | | --- | --- | | duration_predictor | 1.9× | | text_encoder | 2.8× | | vector_estimator | 2.4× (per step; 8 steps total) | | vocoder | 3.7× | Verify ANE engagement with: ```bash xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test" ``` ## Conversion notes - Static bucket: T=320 (text length), L=320 (latent length). Inputs are zero-padded on the right and masked. Bucket = 22.3 s of audio. - `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented in PyTorch from the ONNX initializers, then traced to CoreML. Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998 (text_encoder), cos 0.9998 (vocoder). - `vector_estimator` (the heavy diffusion model) goes through `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace` -> coremltools. Cos 0.998 vs ONNX per diffusion step. - The diffusion sampler stays host-side (8 Euler steps over the single step graph). All 4 components are individually quantizable. ## License This conversion follows the original Supertone/supertonic-3 license (OpenRAIL). See `LICENSE` (or the upstream model card). ## Why fp16 and not INT4? We attempted to ship an INT4 variant. After exhaustive testing ([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3 architecture caps at INT8 minimum for the vocoder and vector_estimator: | Component | INT4 result | Why | | --- | --- | --- | | **vocoder** | cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. | | **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). | | **duration_predictor** | smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. | | **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. | The best achievable mixed config (only voc INT8, others fp16) saves ~25 MB out of 272 MB — not worth a separate variant. The fp16 build shipped here is the final deliverable. ## Companion build The cross-platform LiteRT version is at [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert). LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator (65 MB instead of 256 MB) — but LiteRT can't reproduce the CoreML-only "bucket-leak" extension, so long prompts sound rushed on LiteRT. Use CoreML for full quality on Apple platforms. ## Credits - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) - CoreML conversion + auto-pad workflow: this repo - Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)