Update README: explain INT4 sweep findings, add base_model_relation tag
Browse files
README.md
CHANGED
|
@@ -20,8 +20,12 @@ tags:
|
|
| 20 |
- diffusion
|
| 21 |
- flow-matching
|
| 22 |
- on-device
|
|
|
|
|
|
|
|
|
|
| 23 |
pipeline_tag: text-to-speech
|
| 24 |
base_model: Supertone/supertonic-3
|
|
|
|
| 25 |
---
|
| 26 |
|
| 27 |
# Supertonic-3 — CoreML (fp16, ANE-ready)
|
|
@@ -121,9 +125,34 @@ xctrace record --template "Core ML" --output trace.trace -- python inference.py
|
|
| 121 |
This conversion follows the original Supertone/supertonic-3 license
|
| 122 |
(OpenRAIL). See `LICENSE` (or the upstream model card).
|
| 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
## Credits
|
| 125 |
|
| 126 |
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
|
| 127 |
- CoreML conversion + auto-pad workflow: this repo
|
| 128 |
-
|
| 129 |
-
INT4 quantized variants coming next.
|
|
|
|
| 20 |
- diffusion
|
| 21 |
- flow-matching
|
| 22 |
- on-device
|
| 23 |
+
- ios
|
| 24 |
+
- macos
|
| 25 |
+
- fp16
|
| 26 |
pipeline_tag: text-to-speech
|
| 27 |
base_model: Supertone/supertonic-3
|
| 28 |
+
base_model_relation: quantized
|
| 29 |
---
|
| 30 |
|
| 31 |
# Supertonic-3 — CoreML (fp16, ANE-ready)
|
|
|
|
| 125 |
This conversion follows the original Supertone/supertonic-3 license
|
| 126 |
(OpenRAIL). See `LICENSE` (or the upstream model card).
|
| 127 |
|
| 128 |
+
## Why fp16 and not INT4?
|
| 129 |
+
|
| 130 |
+
We attempted to ship an INT4 variant. After exhaustive testing
|
| 131 |
+
([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
|
| 132 |
+
architecture caps at INT8 minimum for the vocoder and vector_estimator:
|
| 133 |
+
|
| 134 |
+
| Component | INT4 result | Why |
|
| 135 |
+
| --- | --- | --- |
|
| 136 |
+
| **vocoder** | cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. |
|
| 137 |
+
| **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
|
| 138 |
+
| **duration_predictor** | smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
|
| 139 |
+
| **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |
|
| 140 |
+
|
| 141 |
+
The best achievable mixed config (only voc INT8, others fp16) saves
|
| 142 |
+
~25 MB out of 272 MB — not worth a separate variant. The fp16 build
|
| 143 |
+
shipped here is the final deliverable.
|
| 144 |
+
|
| 145 |
+
## Companion build
|
| 146 |
+
|
| 147 |
+
The cross-platform LiteRT version is at
|
| 148 |
+
[Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
|
| 149 |
+
LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
|
| 150 |
+
(65 MB instead of 256 MB) — but LiteRT can't reproduce the
|
| 151 |
+
CoreML-only "bucket-leak" extension, so long prompts sound rushed on
|
| 152 |
+
LiteRT. Use CoreML for full quality on Apple platforms.
|
| 153 |
+
|
| 154 |
## Credits
|
| 155 |
|
| 156 |
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
|
| 157 |
- CoreML conversion + auto-pad workflow: this repo
|
| 158 |
+
- Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
|
|
|