Reza2kn commited on
Commit
25a89bd
·
verified ·
1 Parent(s): 71acad6

Initial upload: fp16 CoreML + auto-pad inference + README

Browse files
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: openrail
3
+ language:
4
+ - en
5
+ - ja
6
+ - zh
7
+ - ko
8
+ - es
9
+ - fr
10
+ - de
11
+ - multilingual
12
+ library_name: coremltools
13
+ tags:
14
+ - coreml
15
+ - ane
16
+ - apple-neural-engine
17
+ - text-to-speech
18
+ - tts
19
+ - audio
20
+ - diffusion
21
+ - flow-matching
22
+ - on-device
23
+ pipeline_tag: text-to-speech
24
+ base_model: Supertone/supertonic-3
25
+ ---
26
+
27
+ # Supertonic-3 — CoreML (fp16, ANE-ready)
28
+
29
+ CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
30
+ a 99M-parameter multilingual TTS model. All 4 components run on the
31
+ Apple Neural Engine (1.8–3.7× faster than CPU on M-series chips).
32
+
33
+ | Component | Size | Role |
34
+ | --- | ---: | --- |
35
+ | `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count |
36
+ | `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent |
37
+ | `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) |
38
+ | `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform |
39
+ | **Total** | **272 MB** | (originals: ~400 MB ONNX) |
40
+
41
+ ## Quickstart
42
+
43
+ ```bash
44
+ pip install coremltools soundfile numpy supertonic
45
+ git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
46
+ cd supertonic-3-coreml
47
+
48
+ # Short prompt
49
+ python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav
50
+
51
+ # Long prompt — use --auto-pad for full content rendering
52
+ python inference.py \
53
+ --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
54
+ --voice F5 --lang en --auto-pad --out long.wav
55
+ ```
56
+
57
+ 10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
58
+ 31 languages supported via `unicode_indexer.json`.
59
+
60
+ ## The auto-pad trick (why `--auto-pad` matters)
61
+
62
+ The supertonic-3 model has a soft cap on how much speech it renders per
63
+ utterance. For long inputs (more than ~13 s of natural speech) the model
64
+ truncates the prompt and emits a low-amplitude filler tone for the rest
65
+ of the budget. The CoreML conversion's static bucket (T=L=320) extends
66
+ this cap by ~3 s due to the way the bucket's padded positions leak into
67
+ the real positions through ConvNeXt's dilated convolutions — that's
68
+ **why CoreML inference sounds more natural than the original ONNX
69
+ library** (proper word separation, intonation), but it still cuts off
70
+ mid-sentence on long prompts.
71
+
72
+ `--auto-pad` is a two-pass workaround:
73
+
74
+ 1. **Pass 1** synthesizes the prompt alone at full bucket length to find
75
+ where the model's content naturally stops (`t_orig`).
76
+ 2. **Pass 2** appends a long filler sentence
77
+ (`" And with that, the gentle silence wrapped itself around the room."`)
78
+ that gives the model extra frames to fully render the original
79
+ prompt, then renders the filler sentence, then drops into the filler
80
+ tone.
81
+ 3. The longest clean-silence gap after `t_orig` is the boundary between
82
+ the original prompt and the appended filler. The pipeline trims
83
+ there and tail-pads with 0.5 s of true silence.
84
+
85
+ Cost: ~2× synthesis time. Worth it for any prompt over ~5 s.
86
+
87
+ ## ANE engagement
88
+
89
+ All 4 components compile to ANE-resident programs when loaded with
90
+ `compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:
91
+
92
+ | Component | ANE speedup |
93
+ | --- | --- |
94
+ | duration_predictor | 1.9× |
95
+ | text_encoder | 2.8× |
96
+ | vector_estimator | 2.4× (per step; 8 steps total) |
97
+ | vocoder | 3.7× |
98
+
99
+ Verify ANE engagement with:
100
+
101
+ ```bash
102
+ xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
103
+ ```
104
+
105
+ ## Conversion notes
106
+
107
+ - Static bucket: T=320 (text length), L=320 (latent length). Inputs are
108
+ zero-padded on the right and masked. Bucket = 22.3 s of audio.
109
+ - `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
110
+ in PyTorch from the ONNX initializers, then traced to CoreML.
111
+ Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
112
+ (text_encoder), cos 0.9998 (vocoder).
113
+ - `vector_estimator` (the heavy diffusion model) goes through
114
+ `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
115
+ -> coremltools. Cos 0.998 vs ONNX per diffusion step.
116
+ - The diffusion sampler stays host-side (8 Euler steps over the single
117
+ step graph). All 4 components are individually quantizable.
118
+
119
+ ## License
120
+
121
+ This conversion follows the original Supertone/supertonic-3 license
122
+ (OpenRAIL). See `LICENSE` (or the upstream model card).
123
+
124
+ ## Credits
125
+
126
+ - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
127
+ - CoreML conversion + auto-pad workflow: this repo
128
+
129
+ INT4 quantized variants coming next.
fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e21c025c5b03d75cf4bee36f1143d313acc3f6126554fee795d3ce81215bb5b
3
+ size 292296
fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c09b368b045582ade6d153550d64d42ce3642fd27d270d539b923a0054c80d53
3
+ size 15128768
fp16/duration_predictor.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "CC4B4953-3EB5-4D7E-A01F-10DBAF546402": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Weights",
7
+ "name": "weights",
8
+ "path": "com.apple.CoreML/weights"
9
+ },
10
+ "DD17ECC9-71D9-4AF3-92F7-F6695D27F1F9": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Specification",
13
+ "name": "model.mlmodel",
14
+ "path": "com.apple.CoreML/model.mlmodel"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "DD17ECC9-71D9-4AF3-92F7-F6695D27F1F9"
18
+ }
fp16/text_encoder.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ec80244148794cc813893607a52af0b55c7a7596649039cc142b686944d0de9
3
+ size 540341
fp16/text_encoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f94addefeb3a05c1c5f28ff594cb976fc50f8596c982fb37f53c6ff200c2f14
3
+ size 70399168
fp16/text_encoder.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "1FEF1B1A-5400-4BCB-807C-F65684E0B270": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Specification",
7
+ "name": "model.mlmodel",
8
+ "path": "com.apple.CoreML/model.mlmodel"
9
+ },
10
+ "CF6B5B08-E226-47FF-9857-965147569173": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Weights",
13
+ "name": "weights",
14
+ "path": "com.apple.CoreML/weights"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "1FEF1B1A-5400-4BCB-807C-F65684E0B270"
18
+ }
fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f02c1334d3bbbcc96718dd862d9ef6252d0b27dd9915b1402062ffc13c965159
3
+ size 362380
fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfb90a6f12980209b627aeca9641866826a3085ac6dda6590d19847806c8d4f0
3
+ size 134491840
fp16/vector_estimator.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "2441BE30-7851-43DD-8CF1-66C3121FBF9C": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Weights",
7
+ "name": "weights",
8
+ "path": "com.apple.CoreML/weights"
9
+ },
10
+ "F97584CB-19BF-403B-80AF-E194801E6AC9": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Specification",
13
+ "name": "model.mlmodel",
14
+ "path": "com.apple.CoreML/model.mlmodel"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "F97584CB-19BF-403B-80AF-E194801E6AC9"
18
+ }
fp16/vocoder.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71ab85bf16a8e088a5511c4dbcc67e7b167563d643d22d4dc3f873100359e1d7
3
+ size 70325
fp16/vocoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b45e4c17a45cc4de30dcedae9f662548aa35f8f9763632c342b8bbef3089fcee
3
+ size 50672512
fp16/vocoder.mlpackage/Manifest.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fileFormatVersion": "1.0.0",
3
+ "itemInfoEntries": {
4
+ "1AAA4924-9DF4-4960-8DFF-D575A8583886": {
5
+ "author": "com.apple.CoreML",
6
+ "description": "CoreML Model Weights",
7
+ "name": "weights",
8
+ "path": "com.apple.CoreML/weights"
9
+ },
10
+ "250BB268-FD46-4919-822F-7B1BB10FCECC": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Specification",
13
+ "name": "model.mlmodel",
14
+ "path": "com.apple.CoreML/model.mlmodel"
15
+ }
16
+ },
17
+ "rootModelIdentifier": "250BB268-FD46-4919-822F-7B1BB10FCECC"
18
+ }
inference.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """End-to-end TTS inference using the 4 CoreML components.
2
+
3
+ Pipeline (mirrors supertonic.core.Supertonic):
4
+ text -> tokenize
5
+ -> duration_predictor -> frame count
6
+ -> text_encoder -> text embedding
7
+ -> sample noisy latent ~ N(0, I)
8
+ -> vector_estimator x 8 (flow-matching ODE step, runs on ANE)
9
+ -> vocoder -> 44.1 kHz waveform
10
+
11
+ All four mlpackages are static-shape buckets at T=L=320. The driver pads
12
+ inputs to that bucket and trims outputs.
13
+
14
+ The supertonic-3 model truncates long prompts at its content limit
15
+ (~13.7s natural; CoreML's bucket-leak extends this to ~16.7s but still
16
+ short for long inputs). The `--auto-pad` mode does a two-pass synthesis
17
+ (once unpadded to find the natural endpoint, once with a long filler
18
+ sentence appended that gives the model more frames to render the full
19
+ original prompt), then trims at the silence gap between original and
20
+ appended content. Recommended for prompts longer than ~5s.
21
+
22
+ Usage:
23
+ python inference.py --text "Hello, world." --voice F1 --lang en
24
+ python inference.py --text "<longer prompt>" --voice F5 --lang en --auto-pad
25
+ """
26
+ from __future__ import annotations
27
+
28
+ import argparse
29
+ import json
30
+ import sys
31
+ import time
32
+ from pathlib import Path
33
+
34
+ import coremltools as ct
35
+ import numpy as np
36
+ import soundfile as sf
37
+
38
+ HERE = Path(__file__).parent
39
+ T_BUCKET = 320
40
+ L_BUCKET = 320
41
+ SAMPLE_RATE = 44_100
42
+ LATENT_DIM = 24
43
+ CHUNK_COMPRESS_FACTOR = 6
44
+ BASE_CHUNK_SIZE = 512
45
+ DEFAULT_TOTAL_STEPS = 8
46
+ DEFAULT_SPEED = 1.05
47
+ DEFAULT_AUTO_PAD = " And with that, the gentle silence wrapped itself around the room."
48
+
49
+
50
+ def _pad(arr: np.ndarray, axis: int, target: int) -> np.ndarray:
51
+ if arr.shape[axis] >= target:
52
+ return arr
53
+ pad = [(0, 0)] * arr.ndim
54
+ pad[axis] = (0, target - arr.shape[axis])
55
+ return np.pad(arr, pad)
56
+
57
+
58
+ def _load_voice(name: str) -> tuple[np.ndarray, np.ndarray]:
59
+ j = json.loads((HERE / "voice_styles" / f"{name}.json").read_text())
60
+ def r(part): return np.array(part["data"], dtype=np.float32).reshape(*part["dims"])
61
+ return r(j["style_ttl"]), r(j["style_dp"])
62
+
63
+
64
+ def _load_tokenizer(indexer_path: Path):
65
+ """Reuse the official supertonic UnicodeProcessor (handles the 31
66
+ languages, abbreviation expansion, punctuation rules, etc.).
67
+ Install with: pip install supertonic
68
+ """
69
+ try:
70
+ from supertonic.core import UnicodeProcessor
71
+ except ImportError as e:
72
+ raise RuntimeError(
73
+ "supertonic package is required for tokenization. "
74
+ "Install with: pip install supertonic"
75
+ ) from e
76
+ return UnicodeProcessor(str(indexer_path))
77
+
78
+
79
+ def _last_loud_window(audio: np.ndarray, thresh: float = 0.025, win_s: float = 0.05) -> int:
80
+ win = int(win_s * SAMPLE_RATE)
81
+ n = len(audio) // win
82
+ rms = np.sqrt(np.mean(audio[: n * win].reshape(n, win) ** 2, axis=1))
83
+ loud = np.where(rms > thresh)[0]
84
+ return int(loud[-1]) if len(loud) else 0
85
+
86
+
87
+ def trim_padded(unpad: np.ndarray, padded: np.ndarray) -> np.ndarray:
88
+ """Trim padded synthesis at the longest clean silence between original
89
+ prompt and appended suffix. Tail-pad with 0.5 s of true silence."""
90
+ win = int(0.05 * SAMPLE_RATE)
91
+ n = len(padded) // win
92
+ rms = np.sqrt(np.mean(padded[: n * win].reshape(n, win) ** 2, axis=1))
93
+ floor = _last_loud_window(unpad)
94
+ ceil_ = _last_loud_window(padded) + 1
95
+ candidates = []
96
+ j = floor
97
+ while j < ceil_ - 1:
98
+ if rms[j] < 0.025 and rms[j + 1] < 0.025:
99
+ start = j; total = 0.0; cnt = 0
100
+ while j < ceil_ and rms[j] < 0.025:
101
+ total += float(rms[j]); cnt += 1; j += 1
102
+ candidates.append((start, cnt, total / max(cnt, 1)))
103
+ else:
104
+ j += 1
105
+ if not candidates:
106
+ return padded[: ceil_ * win]
107
+ start_win, length, avg = max(candidates, key=lambda c: (c[1], -c[0]))
108
+ end_samples = start_win * win
109
+ out = padded[:end_samples].copy()
110
+ fade = min(int(0.06 * SAMPLE_RATE), len(out))
111
+ out[-fade:] *= np.linspace(1.0, 0.0, fade, dtype=np.float32)
112
+ return np.concatenate([out, np.zeros(int(0.5 * SAMPLE_RATE), dtype=np.float32)])
113
+
114
+
115
+ class Supertonic3CoreML:
116
+ def __init__(self, quant: str = "fp16"):
117
+ d = HERE / quant
118
+ self.dp = ct.models.MLModel(str(d / "duration_predictor.mlpackage"))
119
+ self.te = ct.models.MLModel(str(d / "text_encoder.mlpackage"))
120
+ self.ve = ct.models.MLModel(str(d / "vector_estimator.mlpackage"))
121
+ self.voc = ct.models.MLModel(str(d / "vocoder.mlpackage"))
122
+ self.tok = _load_tokenizer(HERE / "unicode_indexer.json")
123
+
124
+ def _synth(self, text: str, voice: str, lang: str, seed: int,
125
+ total_steps: int, speed: float, full_bucket: bool) -> np.ndarray:
126
+ text_ids, text_mask = self.tok([text], lang)
127
+ text_ids = text_ids.astype(np.int64); text_mask = text_mask.astype(np.float32)
128
+ style_ttl, style_dp = _load_voice(voice)
129
+ text_ids_p = _pad(text_ids.astype(np.int32), 1, T_BUCKET)
130
+ text_mask_p = _pad(text_mask, 2, T_BUCKET)
131
+ dur = float(self.dp.predict({"text_ids": text_ids_p, "style_dp": style_dp,
132
+ "text_mask": text_mask_p})["duration"][0]) / speed
133
+ text_emb = self.te.predict({"text_ids": text_ids_p, "style_ttl": style_ttl,
134
+ "text_mask": text_mask_p})["text_emb"]
135
+ L_real = max(1, min(L_BUCKET, (int(dur * SAMPLE_RATE) + BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR - 1) // (BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR)))
136
+ np.random.seed(seed)
137
+ xt = (np.random.randn(1, LATENT_DIM * CHUNK_COMPRESS_FACTOR, L_real)).astype(np.float32)
138
+ latent_mask = np.ones((1, 1, L_real), dtype=np.float32)
139
+ xt = xt * latent_mask
140
+ xt = _pad(xt, 2, L_BUCKET)
141
+ latent_mask = _pad(latent_mask, 2, L_BUCKET)
142
+ total_step_arr = np.array([float(total_steps)], dtype=np.float32)
143
+ for step in range(total_steps):
144
+ xt = self.ve.predict({
145
+ "noisy_latent": xt, "text_emb": text_emb, "style_ttl": style_ttl,
146
+ "text_mask": text_mask_p, "latent_mask": latent_mask,
147
+ "current_step": np.array([float(step)], dtype=np.float32),
148
+ "total_step": total_step_arr,
149
+ })["denoised_latent"]
150
+ wav = self.voc.predict({"latent": xt})["wav_tts"][0]
151
+ if full_bucket:
152
+ return wav
153
+ return wav[: L_real * CHUNK_COMPRESS_FACTOR * BASE_CHUNK_SIZE]
154
+
155
+ def synthesize(self, text: str, voice: str = "F1", lang: str = "en", seed: int = 0,
156
+ total_steps: int = DEFAULT_TOTAL_STEPS, speed: float = DEFAULT_SPEED,
157
+ auto_pad: str | None = DEFAULT_AUTO_PAD) -> np.ndarray:
158
+ """Synthesize speech. With ``auto_pad`` set, runs the 2-pass auto-pad
159
+ flow for full content rendering on longer prompts."""
160
+ if auto_pad is None:
161
+ return self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=False)
162
+ unpad_audio = self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=True)
163
+ pad_audio = self._synth(text + auto_pad, voice, lang, seed, total_steps, speed, full_bucket=True)
164
+ return trim_padded(unpad_audio, pad_audio)
165
+
166
+
167
+ def main() -> int:
168
+ ap = argparse.ArgumentParser()
169
+ ap.add_argument("--text", required=True, help="Text to synthesize")
170
+ ap.add_argument("--voice", default="F1", choices=[f"F{i}" for i in range(1, 6)] + [f"M{i}" for i in range(1, 6)])
171
+ ap.add_argument("--lang", default="en")
172
+ ap.add_argument("--seed", type=int, default=0)
173
+ ap.add_argument("--total-steps", type=int, default=DEFAULT_TOTAL_STEPS)
174
+ ap.add_argument("--auto-pad", nargs="?", const=DEFAULT_AUTO_PAD, default=None,
175
+ help="2-pass synthesis with filler suffix + auto-trim (recommended).")
176
+ ap.add_argument("--quant", default="fp16", choices=["fp16"])
177
+ ap.add_argument("--out", default="out.wav")
178
+ args = ap.parse_args()
179
+
180
+ t0 = time.time()
181
+ tts = Supertonic3CoreML(quant=args.quant)
182
+ print(f"Loaded models in {time.time() - t0:.2f}s")
183
+
184
+ t0 = time.time()
185
+ audio = tts.synthesize(args.text, voice=args.voice, lang=args.lang, seed=args.seed,
186
+ total_steps=args.total_steps, auto_pad=args.auto_pad)
187
+ dur = len(audio) / SAMPLE_RATE
188
+ sf.write(args.out, audio, SAMPLE_RATE)
189
+ print(f"Synthesized {dur:.2f}s of audio in {time.time() - t0:.2f}s -> {args.out}")
190
+ return 0
191
+
192
+
193
+ if __name__ == "__main__":
194
+ sys.exit(main())
tts.json ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tts_version": "v1.7.3",
3
+ "split": "opensource-multilingual",
4
+ "ttl": {
5
+ "latent_dim": 24,
6
+ "chunk_compress_factor": 6,
7
+ "batch_expander": {
8
+ "n_batch_expand": 6
9
+ },
10
+ "normalizer": {
11
+ "scale": 0.25
12
+ },
13
+ "text_encoder": {
14
+ "n_langs": 0,
15
+ "lang_emb_dim": 0,
16
+ "text_embedder": {
17
+ "char_emb_dim": 256
18
+ },
19
+ "convnext": {
20
+ "idim": 256,
21
+ "ksz": 5,
22
+ "intermediate_dim": 1024,
23
+ "num_layers": 6,
24
+ "dilation_lst": [
25
+ 1,
26
+ 1,
27
+ 2,
28
+ 2,
29
+ 4,
30
+ 4
31
+ ]
32
+ },
33
+ "attn_encoder": {
34
+ "hidden_channels": 256,
35
+ "filter_channels": 1024,
36
+ "n_heads": 4,
37
+ "n_layers": 4,
38
+ "p_dropout": 0.0
39
+ },
40
+ "proj_out": {
41
+ "idim": 256,
42
+ "odim": 256
43
+ }
44
+ },
45
+ "flow_matching": {
46
+ "sig_min": 1e-08
47
+ },
48
+ "style_encoder": {
49
+ "proj_in": {
50
+ "ldim": 24,
51
+ "chunk_compress_factor": 6,
52
+ "odim": 256
53
+ },
54
+ "convnext": {
55
+ "idim": 256,
56
+ "ksz": 5,
57
+ "intermediate_dim": 1024,
58
+ "num_layers": 6,
59
+ "dilation_lst": [
60
+ 1,
61
+ 1,
62
+ 1,
63
+ 1,
64
+ 1,
65
+ 1
66
+ ]
67
+ },
68
+ "style_token_layer": {
69
+ "input_dim": 256,
70
+ "n_style": 50,
71
+ "style_key_dim": 256,
72
+ "style_value_dim": 256,
73
+ "prototype_dim": 256,
74
+ "n_units": 256,
75
+ "n_heads": 2
76
+ }
77
+ },
78
+ "speech_prompted_text_encoder": {
79
+ "text_dim": 256,
80
+ "style_dim": 256,
81
+ "n_units": 256,
82
+ "n_heads": 2
83
+ },
84
+ "uncond_masker": {
85
+ "prob_both_uncond": 0.04,
86
+ "prob_text_uncond": 0.01,
87
+ "std": 0.1,
88
+ "text_dim": 256,
89
+ "n_style": 50,
90
+ "style_key_dim": 256,
91
+ "style_value_dim": 256
92
+ },
93
+ "vector_field": {
94
+ "n_langs": 0,
95
+ "lang_emb_dim": 0,
96
+ "proj_in": {
97
+ "ldim": 24,
98
+ "chunk_compress_factor": 6,
99
+ "odim": 512
100
+ },
101
+ "time_encoder": {
102
+ "time_dim": 64,
103
+ "hdim": 256
104
+ },
105
+ "main_blocks": {
106
+ "n_blocks": 4,
107
+ "time_cond_layer": {
108
+ "idim": 512,
109
+ "time_dim": 64
110
+ },
111
+ "style_cond_layer": {
112
+ "idim": 512,
113
+ "style_dim": 256
114
+ },
115
+ "text_cond_layer": {
116
+ "idim": 512,
117
+ "text_dim": 256,
118
+ "n_heads": 8,
119
+ "n_units": 512,
120
+ "use_residual": true,
121
+ "rotary_base": 10000,
122
+ "rotary_scale": 10
123
+ },
124
+ "convnext_0": {
125
+ "idim": 512,
126
+ "ksz": 5,
127
+ "intermediate_dim": 2048,
128
+ "num_layers": 4,
129
+ "dilation_lst": [
130
+ 1,
131
+ 2,
132
+ 4,
133
+ 8
134
+ ]
135
+ },
136
+ "convnext_1": {
137
+ "idim": 512,
138
+ "ksz": 5,
139
+ "intermediate_dim": 2048,
140
+ "num_layers": 1,
141
+ "dilation_lst": [
142
+ 1
143
+ ]
144
+ },
145
+ "convnext_2": {
146
+ "idim": 512,
147
+ "ksz": 5,
148
+ "intermediate_dim": 2048,
149
+ "num_layers": 1,
150
+ "dilation_lst": [
151
+ 1
152
+ ]
153
+ }
154
+ },
155
+ "last_convnext": {
156
+ "idim": 512,
157
+ "ksz": 5,
158
+ "intermediate_dim": 2048,
159
+ "num_layers": 4,
160
+ "dilation_lst": [
161
+ 1,
162
+ 1,
163
+ 1,
164
+ 1
165
+ ]
166
+ },
167
+ "proj_out": {
168
+ "idim": 512,
169
+ "chunk_compress_factor": 6,
170
+ "ldim": 24
171
+ }
172
+ }
173
+ },
174
+ "ae": {
175
+ "sample_rate": 44100,
176
+ "n_delay": 0,
177
+ "base_chunk_size": 512,
178
+ "chunk_compress_factor": 1,
179
+ "ldim": 24,
180
+ "encoder": {
181
+ "spec_processor": {
182
+ "n_fft": 2048,
183
+ "win_length": 2048,
184
+ "hop_length": 512,
185
+ "n_mels": 228,
186
+ "sample_rate": 44100,
187
+ "eps": 1e-05,
188
+ "norm_mean": 0.0,
189
+ "norm_std": 1.0
190
+ },
191
+ "ksz_init": 7,
192
+ "ksz": 7,
193
+ "num_layers": 10,
194
+ "dilation_lst": [
195
+ 1,
196
+ 1,
197
+ 1,
198
+ 1,
199
+ 1,
200
+ 1,
201
+ 1,
202
+ 1,
203
+ 1,
204
+ 1
205
+ ],
206
+ "intermediate_dim": 2048,
207
+ "idim": 1253,
208
+ "hdim": 512,
209
+ "odim": 24
210
+ },
211
+ "decoder": {
212
+ "ksz_init": 7,
213
+ "ksz": 7,
214
+ "num_layers": 10,
215
+ "dilation_lst": [
216
+ 1,
217
+ 2,
218
+ 4,
219
+ 1,
220
+ 2,
221
+ 4,
222
+ 1,
223
+ 1,
224
+ 1,
225
+ 1
226
+ ],
227
+ "intermediate_dim": 2048,
228
+ "idim": 24,
229
+ "hdim": 512,
230
+ "head": {
231
+ "idim": 512,
232
+ "hdim": 2048,
233
+ "odim": 512,
234
+ "ksz": 3
235
+ }
236
+ }
237
+ },
238
+ "dp": {
239
+ "latent_dim": 24,
240
+ "chunk_compress_factor": 6,
241
+ "normalizer": {
242
+ "scale": 1.0
243
+ },
244
+ "sentence_encoder": {
245
+ "char_emb_dim": 64,
246
+ "text_embedder": {
247
+ "char_emb_dim": 64
248
+ },
249
+ "convnext": {
250
+ "idim": 64,
251
+ "ksz": 5,
252
+ "intermediate_dim": 256,
253
+ "num_layers": 6,
254
+ "dilation_lst": [
255
+ 1,
256
+ 1,
257
+ 1,
258
+ 1,
259
+ 1,
260
+ 1
261
+ ]
262
+ },
263
+ "attn_encoder": {
264
+ "hidden_channels": 64,
265
+ "filter_channels": 256,
266
+ "n_heads": 2,
267
+ "n_layers": 2,
268
+ "p_dropout": 0.0
269
+ },
270
+ "proj_out": {
271
+ "idim": 64,
272
+ "odim": 64
273
+ }
274
+ },
275
+ "style_encoder": {
276
+ "proj_in": {
277
+ "ldim": 24,
278
+ "chunk_compress_factor": 6,
279
+ "odim": 64
280
+ },
281
+ "convnext": {
282
+ "idim": 64,
283
+ "ksz": 5,
284
+ "intermediate_dim": 256,
285
+ "num_layers": 4,
286
+ "dilation_lst": [
287
+ 1,
288
+ 1,
289
+ 1,
290
+ 1
291
+ ]
292
+ },
293
+ "style_token_layer": {
294
+ "input_dim": 64,
295
+ "n_style": 8,
296
+ "style_key_dim": 0,
297
+ "style_value_dim": 16,
298
+ "prototype_dim": 64,
299
+ "n_units": 64,
300
+ "n_heads": 2
301
+ }
302
+ },
303
+ "predictor": {
304
+ "sentence_dim": 64,
305
+ "n_style": 8,
306
+ "style_dim": 16,
307
+ "hdim": 128,
308
+ "n_layer": 2
309
+ }
310
+ }
311
+ }
unicode_indexer.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F5.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M5.json ADDED
The diff for this file is too large to render. See raw diff