Reza2kn
/

supertonic-3-coreml

@@ -20,8 +20,12 @@ tags:
 - diffusion
 - flow-matching
 - on-device
 pipeline_tag: text-to-speech
 base_model: Supertone/supertonic-3
 ---
 # Supertonic-3 — CoreML (fp16, ANE-ready)
@@ -121,9 +125,34 @@ xctrace record --template "Core ML" --output trace.trace -- python inference.py
 This conversion follows the original Supertone/supertonic-3 license
 (OpenRAIL). See `LICENSE` (or the upstream model card).
 ## Credits
 - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
 - CoreML conversion + auto-pad workflow: this repo
-INT4 quantized variants coming next.

 - diffusion
 - flow-matching
 - on-device
+- ios
+- macos
+- fp16
 pipeline_tag: text-to-speech
 base_model: Supertone/supertonic-3
+base_model_relation: quantized
 ---
 # Supertonic-3 — CoreML (fp16, ANE-ready)
 This conversion follows the original Supertone/supertonic-3 license
 (OpenRAIL). See `LICENSE` (or the upstream model card).
+## Why fp16 and not INT4?
+We attempted to ship an INT4 variant. After exhaustive testing
+([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
+architecture caps at INT8 minimum for the vocoder and vector_estimator:
+| Component | INT4 result | Why |
+| --- | --- | --- |
+| **vocoder** | cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. |
+| **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
+| **duration_predictor** | smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
+| **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |
+The best achievable mixed config (only voc INT8, others fp16) saves
+~25 MB out of 272 MB — not worth a separate variant. The fp16 build
+shipped here is the final deliverable.
+## Companion build
+The cross-platform LiteRT version is at
+[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
+LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
+(65 MB instead of 256 MB) — but LiteRT can't reproduce the
+CoreML-only "bucket-leak" extension, so long prompts sound rushed on
+LiteRT. Use CoreML for full quality on Apple platforms.
 ## Credits
 - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
 - CoreML conversion + auto-pad workflow: this repo
+- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)