Reza2kn commited on
Commit
2a9df0d
·
verified ·
1 Parent(s): 25a89bd

Update README: explain INT4 sweep findings, add base_model_relation tag

Browse files
Files changed (1) hide show
  1. README.md +31 -2
README.md CHANGED
@@ -20,8 +20,12 @@ tags:
20
  - diffusion
21
  - flow-matching
22
  - on-device
 
 
 
23
  pipeline_tag: text-to-speech
24
  base_model: Supertone/supertonic-3
 
25
  ---
26
 
27
  # Supertonic-3 — CoreML (fp16, ANE-ready)
@@ -121,9 +125,34 @@ xctrace record --template "Core ML" --output trace.trace -- python inference.py
121
  This conversion follows the original Supertone/supertonic-3 license
122
  (OpenRAIL). See `LICENSE` (or the upstream model card).
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ## Credits
125
 
126
  - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
127
  - CoreML conversion + auto-pad workflow: this repo
128
-
129
- INT4 quantized variants coming next.
 
20
  - diffusion
21
  - flow-matching
22
  - on-device
23
+ - ios
24
+ - macos
25
+ - fp16
26
  pipeline_tag: text-to-speech
27
  base_model: Supertone/supertonic-3
28
+ base_model_relation: quantized
29
  ---
30
 
31
  # Supertonic-3 — CoreML (fp16, ANE-ready)
 
125
  This conversion follows the original Supertone/supertonic-3 license
126
  (OpenRAIL). See `LICENSE` (or the upstream model card).
127
 
128
+ ## Why fp16 and not INT4?
129
+
130
+ We attempted to ship an INT4 variant. After exhaustive testing
131
+ ([INT4 sweep notes below](#int4-sweep-results)), the supertonic-3
132
+ architecture caps at INT8 minimum for the vocoder and vector_estimator:
133
+
134
+ | Component | INT4 result | Why |
135
+ | --- | --- | --- |
136
+ | **vocoder** | cos ≈ 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive — INT8 (cos 0.99) is the floor. |
137
+ | **vector_estimator** | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
138
+ | **duration_predictor** | smallest drift was 0.11s with pt_uniform — but enough to shift L_real → bucket-leak boundary moves → pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
139
+ | **text_encoder** | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |
140
+
141
+ The best achievable mixed config (only voc INT8, others fp16) saves
142
+ ~25 MB out of 272 MB — not worth a separate variant. The fp16 build
143
+ shipped here is the final deliverable.
144
+
145
+ ## Companion build
146
+
147
+ The cross-platform LiteRT version is at
148
+ [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
149
+ LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator
150
+ (65 MB instead of 256 MB) — but LiteRT can't reproduce the
151
+ CoreML-only "bucket-leak" extension, so long prompts sound rushed on
152
+ LiteRT. Use CoreML for full quality on Apple platforms.
153
+
154
  ## Credits
155
 
156
  - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
157
  - CoreML conversion + auto-pad workflow: this repo
158
+ - Companion LiteRT build: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)