Reza2kn commited on
Commit
dbcccfe
·
verified ·
1 Parent(s): fd2bb20

Initial upload: fp32 + INT4 LiteRT + ONNX vector_estimator + auto-pad inference + README

Browse files
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: openrail
3
+ language:
4
+ - en
5
+ - ja
6
+ - zh
7
+ - ko
8
+ - es
9
+ - fr
10
+ - de
11
+ - multilingual
12
+ library_name: ai-edge-litert
13
+ tags:
14
+ - litert
15
+ - tflite
16
+ - tensorflow-lite
17
+ - text-to-speech
18
+ - tts
19
+ - audio
20
+ - diffusion
21
+ - flow-matching
22
+ - on-device
23
+ - mobile
24
+ - android
25
+ - int4
26
+ - weight-only-quantization
27
+ pipeline_tag: text-to-speech
28
+ base_model: Supertone/supertonic-3
29
+ ---
30
+
31
+ # Supertonic-3 — LiteRT (.tflite, INT4)
32
+
33
+ LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
34
+ a 99M-parameter multilingual TTS model. 3 of the 4 components convert
35
+ cleanly to true INT4 weight-only quantization via Google's
36
+ [ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer)
37
+ and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
38
+ runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX —
39
+ its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
40
+ inference, and `litert_torch.convert` deadlocks in MLIR lowering when
41
+ fed the model with loaded weights (the same fresh-initialized module
42
+ converts cleanly in 11 s, isolating the trigger to specific weight
43
+ patterns; a likely upstream bug).
44
+
45
+ ## Configurations
46
+
47
+ | Config | Components | Size | Notes |
48
+ | --- | --- | ---: | --- |
49
+ | **int4 (recommended)** | `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` | **310 MB** | true 4-bit weights via ai-edge-quantizer; vocoder kept at INT8 because INT4 vocoder broke (cos ~0) |
50
+ | fp32 | `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 398 MB | float reference |
51
+ | int4 (all) | `int4/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 296 MB | broken — vocoder INT4 produces white noise |
52
+
53
+ | Component file | Size |
54
+ | --- | ---: |
55
+ | `fp32/duration_predictor.tflite` | 4 MB |
56
+ | `fp32/text_encoder.tflite` | 37 MB |
57
+ | `fp32/vocoder.tflite` | 101 MB |
58
+ | `int4/duration_predictor.tflite` | 2.5 MB |
59
+ | `int4/text_encoder.tflite` | 13 MB |
60
+ | `int8/vocoder.tflite` (recommended) | 26 MB |
61
+ | `vector_estimator.onnx` (always) | 256 MB |
62
+
63
+ ## Quickstart
64
+
65
+ ```bash
66
+ pip install ai-edge-litert onnxruntime soundfile numpy supertonic
67
+ git clone https://huggingface.co/Reza2kn/supertonic-3-litert
68
+ cd supertonic-3-litert
69
+
70
+ # Recommended INT4 config (default)
71
+ python inference.py --text "Hello, world." --voice F1 --out hello.wav
72
+
73
+ # Long prompt — use --auto-pad for full content rendering
74
+ python inference.py \
75
+ --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
76
+ --voice F5 --auto-pad --out long.wav
77
+
78
+ # Explicit FP32 baseline
79
+ python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32
80
+ ```
81
+
82
+ 10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
83
+ 31 languages supported via `unicode_indexer.json`.
84
+
85
+ ## The auto-pad trick (why `--auto-pad` matters)
86
+
87
+ The supertonic-3 model has a soft cap on per-utterance content — it
88
+ truncates long prompts and drops into a stable filler tone for the rest
89
+ of the budget. The LiteRT pipeline uses ONNX vector_estimator at native
90
+ shapes, so the truncation is at the model's hard limit (~13.7 s for the
91
+ test long prompt) rather than CoreML's bucket-extended ~16.7 s.
92
+
93
+ `--auto-pad`:
94
+
95
+ 1. **Pass 1** synthesizes the prompt alone to find the natural endpoint.
96
+ 2. **Pass 2** appends a long filler sentence
97
+ (`" And with that, the gentle silence wrapped itself around the room."`)
98
+ that gives the model more text tokens + more diffusion frames to
99
+ fully render the original prompt before truncating into filler.
100
+ 3. Trims at the longest clean-silence gap between the original prompt's
101
+ natural endpoint and the appended sentence's endpoint. Tail-pad with
102
+ 0.5 s of true silence.
103
+
104
+ Cost: 2× synthesis. Recommended for any prompt over ~5 s.
105
+
106
+ ## Conversion pipeline
107
+
108
+ ```
109
+ Supertone/supertonic-3 (ONNX)
110
+ -> onnxsim.simplify (T=L=320)
111
+ -> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20) # required to keep ai_edge_litert eligible
112
+ -> onnx2tf -kt -coion (TF SavedModel)
113
+ -> tf.lite.TFLiteConverter (fp32 .tflite)
114
+ -> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
115
+ -> ai_edge_litert.Interpreter at runtime
116
+ ```
117
+
118
+ The **GELU fuse** is the key unlock. Without it, `onnx2tf` emits FlexErf
119
+ ops which disqualify the model from `ai_edge_litert` (the runtime that
120
+ supports INT4). Replacing the Erf-based GELU expansion with a single
121
+ ONNX `Gelu` op (opset 20) keeps the model in pure-TFLite ops and unblocks
122
+ INT4 inference.
123
+
124
+ `vector_estimator` is kept as ONNX because onnx2tf's transpose
125
+ optimization breaks rotary attention masking, and `litert_torch.convert`
126
+ deadlocks on its loaded weights. Per-step ONNX VE inference on CPU is
127
+ ~3.5 s wall total for an 8-step long-prompt synthesis on M2 Pro.
128
+
129
+ ## License
130
+
131
+ OpenRAIL — same as the original Supertone/supertonic-3.
132
+
133
+ ## Credits
134
+
135
+ - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
136
+ - LiteRT conversion + auto-pad workflow: this repo
137
+ - Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer)
138
+ - Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
fp32/duration_predictor.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:083179cec4e187c81b6d4be7c3e827acc90adc9cb1dc8c587e06fc5ae9b6a8e1
3
+ size 3855484
fp32/text_encoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c18aa25615e95363585b41673596ffe0d96736cf21aa4461af2f76e17991b507
3
+ size 36932784
fp32/vocoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6296bcb7b7de728aeb0cfc8ee89443ba1c404a13a5806d52f43b1fd8e378d42
3
+ size 101421512
inference.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """End-to-end TTS inference using the LiteRT (.tflite) + ONNX components.
2
+
3
+ Architecture:
4
+ text -> tokenize
5
+ -> duration_predictor (.tflite) -> frame count
6
+ -> text_encoder (.tflite) -> text embedding
7
+ -> sample noisy latent ~ N(0, I)
8
+ -> vector_estimator (.onnx) -> ODE step x 8
9
+ -> vocoder (.tflite) -> 44.1 kHz waveform
10
+
11
+ 3 of the 4 components convert cleanly to LiteRT via onnx2tf + ai-edge-
12
+ quantizer. `vector_estimator` is kept as ONNX because its rotary
13
+ multi-head attention defeats onnx2tf's NCW-NHWC shape inference (and
14
+ litert-torch deadlocks on loaded weights with specific patterns). This
15
+ ONNX fallback runs on CPU via onnxruntime; the other three run on the
16
+ LiteRT runtime (`ai_edge_litert`) which supports true INT4 inference.
17
+
18
+ Two recommended configurations:
19
+
20
+ fp32: fp32/dp + fp32/te + vector_estimator.onnx + fp32/vocoder
21
+ (142 MB tflite + 256 MB ONNX = ~398 MB)
22
+
23
+ int4: int4/dp + int4/te + vector_estimator.onnx + int8/vocoder
24
+ (28 MB tflite + 26 MB INT8 vocoder + 256 MB ONNX = ~310 MB)
25
+ (INT4 vocoder is broken — cos ~0 — so we ship INT8 for vocoder)
26
+
27
+ Usage:
28
+ python inference.py --text "Hello, world." --voice F1 --lang en
29
+ python inference.py --text "<longer prompt>" --voice F5 --auto-pad
30
+ """
31
+ from __future__ import annotations
32
+
33
+ import argparse
34
+ import json
35
+ import sys
36
+ import time
37
+ from pathlib import Path
38
+
39
+ import numpy as np
40
+ import soundfile as sf
41
+ import onnxruntime as ort
42
+
43
+ HERE = Path(__file__).parent
44
+ T_BUCKET = 320
45
+ L_BUCKET = 320
46
+ SAMPLE_RATE = 44_100
47
+ LATENT_DIM = 24
48
+ CHUNK_COMPRESS_FACTOR = 6
49
+ BASE_CHUNK_SIZE = 512
50
+ DEFAULT_TOTAL_STEPS = 8
51
+ DEFAULT_SPEED = 1.05
52
+ DEFAULT_AUTO_PAD = " And with that, the gentle silence wrapped itself around the room."
53
+
54
+
55
+ def _pad(arr: np.ndarray, axis: int, target: int) -> np.ndarray:
56
+ if arr.shape[axis] >= target:
57
+ return arr
58
+ pad = [(0, 0)] * arr.ndim
59
+ pad[axis] = (0, target - arr.shape[axis])
60
+ return np.pad(arr, pad)
61
+
62
+
63
+ def _load_voice(name: str) -> tuple[np.ndarray, np.ndarray]:
64
+ j = json.loads((HERE / "voice_styles" / f"{name}.json").read_text())
65
+ def r(part): return np.array(part["data"], dtype=np.float32).reshape(*part["dims"])
66
+ return r(j["style_ttl"]), r(j["style_dp"])
67
+
68
+
69
+ def _load_tokenizer(indexer_path: Path):
70
+ try:
71
+ from supertonic.core import UnicodeProcessor
72
+ except ImportError as e:
73
+ raise RuntimeError(
74
+ "supertonic package is required for tokenization. "
75
+ "Install with: pip install supertonic"
76
+ ) from e
77
+ return UnicodeProcessor(str(indexer_path))
78
+
79
+
80
+ class TFLiteRunner:
81
+ """Convenience wrapper around ai_edge_litert.Interpreter (true LiteRT
82
+ runtime, supports INT4) — falls back to tf.lite.Interpreter for FP32
83
+ if ai_edge_litert is unavailable."""
84
+ def __init__(self, path: Path):
85
+ try:
86
+ from ai_edge_litert.interpreter import Interpreter as AILiteRT
87
+ self._interp = AILiteRT(model_path=str(path))
88
+ except ImportError:
89
+ import tensorflow as tf
90
+ self._interp = tf.lite.Interpreter(model_path=str(path))
91
+ self._interp.allocate_tensors()
92
+ self._in_details = {d["name"]: d for d in self._interp.get_input_details()}
93
+ self._in_keys = {full.split("/")[-1]: full for full in self._in_details}
94
+ self._out = self._interp.get_output_details()[0]
95
+
96
+ def predict(self, feed: dict[str, np.ndarray]) -> np.ndarray:
97
+ for short, value in feed.items():
98
+ full = self._in_keys.get(short) or next(
99
+ (k for k in self._in_details if short in k), None)
100
+ d = self._in_details[full]
101
+ v = value if value.dtype == d["dtype"] else value.astype(d["dtype"])
102
+ self._interp.set_tensor(d["index"], v)
103
+ self._interp.invoke()
104
+ return self._interp.get_tensor(self._out["index"])
105
+
106
+
107
+ def _last_loud_window(audio: np.ndarray, thresh: float = 0.025) -> int:
108
+ win = int(0.05 * SAMPLE_RATE)
109
+ n = len(audio) // win
110
+ rms = np.sqrt(np.mean(audio[: n * win].reshape(n, win) ** 2, axis=1))
111
+ loud = np.where(rms > thresh)[0]
112
+ return int(loud[-1]) if len(loud) else 0
113
+
114
+
115
+ def trim_padded(unpad: np.ndarray, padded: np.ndarray) -> np.ndarray:
116
+ win = int(0.05 * SAMPLE_RATE)
117
+ n = len(padded) // win
118
+ rms = np.sqrt(np.mean(padded[: n * win].reshape(n, win) ** 2, axis=1))
119
+ floor = _last_loud_window(unpad)
120
+ ceil_ = _last_loud_window(padded) + 1
121
+ candidates = []
122
+ j = floor
123
+ while j < ceil_ - 1:
124
+ if rms[j] < 0.025 and rms[j + 1] < 0.025:
125
+ start = j; total = 0.0; cnt = 0
126
+ while j < ceil_ and rms[j] < 0.025:
127
+ total += float(rms[j]); cnt += 1; j += 1
128
+ candidates.append((start, cnt, total / max(cnt, 1)))
129
+ else:
130
+ j += 1
131
+ if not candidates:
132
+ return padded[: ceil_ * win]
133
+ start_win, length, avg = max(candidates, key=lambda c: (c[1], -c[0]))
134
+ end_samples = start_win * win
135
+ out = padded[:end_samples].copy()
136
+ fade = min(int(0.06 * SAMPLE_RATE), len(out))
137
+ out[-fade:] *= np.linspace(1.0, 0.0, fade, dtype=np.float32)
138
+ return np.concatenate([out, np.zeros(int(0.5 * SAMPLE_RATE), dtype=np.float32)])
139
+
140
+
141
+ class Supertonic3LiteRT:
142
+ """LiteRT TTS with ONNX vector_estimator fallback. Pass quants per
143
+ component; defaults give the recommended (int4 dp + te, int8 vocoder)
144
+ configuration."""
145
+ def __init__(self, dp_quant: str = "int4", te_quant: str = "int4",
146
+ voc_quant: str = "int8"):
147
+ self.dp = TFLiteRunner(HERE / dp_quant / "duration_predictor.tflite")
148
+ self.te = TFLiteRunner(HERE / te_quant / "text_encoder.tflite")
149
+ self.voc = TFLiteRunner(HERE / voc_quant / "vocoder.tflite")
150
+ self.ve = ort.InferenceSession(
151
+ str(HERE / "vector_estimator.onnx"),
152
+ providers=["CPUExecutionProvider"],
153
+ )
154
+ self.tok = _load_tokenizer(HERE / "unicode_indexer.json")
155
+
156
+ def _synth(self, text: str, voice: str, lang: str, seed: int,
157
+ total_steps: int, speed: float, full_bucket: bool) -> np.ndarray:
158
+ text_ids, text_mask = self.tok([text], lang)
159
+ text_ids = text_ids.astype(np.int64); text_mask = text_mask.astype(np.float32)
160
+ style_ttl, style_dp = _load_voice(voice)
161
+ text_ids_p = _pad(text_ids, 1, T_BUCKET)
162
+ text_mask_p = _pad(text_mask, 2, T_BUCKET)
163
+ dur = float(self.dp.predict({"text_ids": text_ids_p, "style_dp": style_dp,
164
+ "text_mask": text_mask_p})[0]) / speed
165
+ text_emb_full = self.te.predict({"text_ids": text_ids_p, "style_ttl": style_ttl,
166
+ "text_mask": text_mask_p})
167
+ # ONNX VE accepts native shapes — trim text_emb back to T_real.
168
+ T_real = text_ids.shape[1]
169
+ text_emb_real = text_emb_full[:, :, :T_real]
170
+ L_real = max(1, min(L_BUCKET, (int(dur * SAMPLE_RATE) + BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR - 1)
171
+ // (BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR)))
172
+ np.random.seed(seed)
173
+ xt = (np.random.randn(1, LATENT_DIM * CHUNK_COMPRESS_FACTOR, L_real)).astype(np.float32)
174
+ latent_mask = np.ones((1, 1, L_real), dtype=np.float32)
175
+ xt = xt * latent_mask
176
+ total_step_arr = np.array([float(total_steps)], dtype=np.float32)
177
+ for step in range(total_steps):
178
+ xt = self.ve.run(None, {
179
+ "noisy_latent": xt, "text_emb": text_emb_real, "style_ttl": style_ttl,
180
+ "text_mask": text_mask, "latent_mask": latent_mask,
181
+ "current_step": np.array([float(step)], dtype=np.float32),
182
+ "total_step": total_step_arr,
183
+ })[0]
184
+ xt_padded = _pad(xt, 2, L_BUCKET)
185
+ wav = self.voc.predict({"latent": xt_padded})[0]
186
+ if full_bucket:
187
+ return wav
188
+ return wav[: L_real * CHUNK_COMPRESS_FACTOR * BASE_CHUNK_SIZE]
189
+
190
+ def synthesize(self, text: str, voice: str = "F1", lang: str = "en", seed: int = 0,
191
+ total_steps: int = DEFAULT_TOTAL_STEPS, speed: float = DEFAULT_SPEED,
192
+ auto_pad: str | None = DEFAULT_AUTO_PAD) -> np.ndarray:
193
+ if auto_pad is None:
194
+ return self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=False)
195
+ unpad = self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=True)
196
+ padded = self._synth(text + auto_pad, voice, lang, seed, total_steps, speed, full_bucket=True)
197
+ return trim_padded(unpad, padded)
198
+
199
+
200
+ def main() -> int:
201
+ ap = argparse.ArgumentParser()
202
+ ap.add_argument("--text", required=True)
203
+ ap.add_argument("--voice", default="F1",
204
+ choices=[f"F{i}" for i in range(1, 6)] + [f"M{i}" for i in range(1, 6)])
205
+ ap.add_argument("--lang", default="en")
206
+ ap.add_argument("--seed", type=int, default=0)
207
+ ap.add_argument("--total-steps", type=int, default=DEFAULT_TOTAL_STEPS)
208
+ ap.add_argument("--auto-pad", nargs="?", const=DEFAULT_AUTO_PAD, default=None,
209
+ help="2-pass synthesis with filler suffix + auto-trim (recommended for long prompts).")
210
+ ap.add_argument("--dp-quant", default="int4", choices=["fp32", "int4"])
211
+ ap.add_argument("--te-quant", default="int4", choices=["fp32", "int4"])
212
+ ap.add_argument("--voc-quant", default="int8", choices=["fp32", "int8", "int4"],
213
+ help="INT4 vocoder is broken (cos ~0) — use int8 or fp32.")
214
+ ap.add_argument("--out", default="out.wav")
215
+ args = ap.parse_args()
216
+
217
+ t0 = time.time()
218
+ tts = Supertonic3LiteRT(dp_quant=args.dp_quant, te_quant=args.te_quant, voc_quant=args.voc_quant)
219
+ print(f"Loaded models in {time.time() - t0:.2f}s (dp={args.dp_quant}, te={args.te_quant}, voc={args.voc_quant})")
220
+
221
+ t0 = time.time()
222
+ audio = tts.synthesize(args.text, voice=args.voice, lang=args.lang, seed=args.seed,
223
+ total_steps=args.total_steps, auto_pad=args.auto_pad)
224
+ sf.write(args.out, audio, SAMPLE_RATE)
225
+ print(f"Synthesized {len(audio)/SAMPLE_RATE:.2f}s in {time.time() - t0:.2f}s -> {args.out}")
226
+ return 0
227
+
228
+
229
+ if __name__ == "__main__":
230
+ sys.exit(main())
int4/duration_predictor.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0ca55b4aba85cd5de5daf1a91a68ffc27d64bd8bc33d98c2cdab20d8c98ebd7
3
+ size 2491168
int4/text_encoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62d0fdfcb5a368bd36ffd3664d323486b24de772e82bcb133608c6abd39e5577
3
+ size 12552576
int4/vocoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51e0ed88c0d5c089c0275812aecfc260ab0d77e5373783857b8a6f6ef463a4ad
3
+ size 13321728
int8/vocoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bccafcdf53d4b359cf8ed923ad547a99d84b356b84805a61ab8488278603a7d
3
+ size 25965568
tts.json ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tts_version": "v1.7.3",
3
+ "split": "opensource-multilingual",
4
+ "ttl": {
5
+ "latent_dim": 24,
6
+ "chunk_compress_factor": 6,
7
+ "batch_expander": {
8
+ "n_batch_expand": 6
9
+ },
10
+ "normalizer": {
11
+ "scale": 0.25
12
+ },
13
+ "text_encoder": {
14
+ "n_langs": 0,
15
+ "lang_emb_dim": 0,
16
+ "text_embedder": {
17
+ "char_emb_dim": 256
18
+ },
19
+ "convnext": {
20
+ "idim": 256,
21
+ "ksz": 5,
22
+ "intermediate_dim": 1024,
23
+ "num_layers": 6,
24
+ "dilation_lst": [
25
+ 1,
26
+ 1,
27
+ 2,
28
+ 2,
29
+ 4,
30
+ 4
31
+ ]
32
+ },
33
+ "attn_encoder": {
34
+ "hidden_channels": 256,
35
+ "filter_channels": 1024,
36
+ "n_heads": 4,
37
+ "n_layers": 4,
38
+ "p_dropout": 0.0
39
+ },
40
+ "proj_out": {
41
+ "idim": 256,
42
+ "odim": 256
43
+ }
44
+ },
45
+ "flow_matching": {
46
+ "sig_min": 1e-08
47
+ },
48
+ "style_encoder": {
49
+ "proj_in": {
50
+ "ldim": 24,
51
+ "chunk_compress_factor": 6,
52
+ "odim": 256
53
+ },
54
+ "convnext": {
55
+ "idim": 256,
56
+ "ksz": 5,
57
+ "intermediate_dim": 1024,
58
+ "num_layers": 6,
59
+ "dilation_lst": [
60
+ 1,
61
+ 1,
62
+ 1,
63
+ 1,
64
+ 1,
65
+ 1
66
+ ]
67
+ },
68
+ "style_token_layer": {
69
+ "input_dim": 256,
70
+ "n_style": 50,
71
+ "style_key_dim": 256,
72
+ "style_value_dim": 256,
73
+ "prototype_dim": 256,
74
+ "n_units": 256,
75
+ "n_heads": 2
76
+ }
77
+ },
78
+ "speech_prompted_text_encoder": {
79
+ "text_dim": 256,
80
+ "style_dim": 256,
81
+ "n_units": 256,
82
+ "n_heads": 2
83
+ },
84
+ "uncond_masker": {
85
+ "prob_both_uncond": 0.04,
86
+ "prob_text_uncond": 0.01,
87
+ "std": 0.1,
88
+ "text_dim": 256,
89
+ "n_style": 50,
90
+ "style_key_dim": 256,
91
+ "style_value_dim": 256
92
+ },
93
+ "vector_field": {
94
+ "n_langs": 0,
95
+ "lang_emb_dim": 0,
96
+ "proj_in": {
97
+ "ldim": 24,
98
+ "chunk_compress_factor": 6,
99
+ "odim": 512
100
+ },
101
+ "time_encoder": {
102
+ "time_dim": 64,
103
+ "hdim": 256
104
+ },
105
+ "main_blocks": {
106
+ "n_blocks": 4,
107
+ "time_cond_layer": {
108
+ "idim": 512,
109
+ "time_dim": 64
110
+ },
111
+ "style_cond_layer": {
112
+ "idim": 512,
113
+ "style_dim": 256
114
+ },
115
+ "text_cond_layer": {
116
+ "idim": 512,
117
+ "text_dim": 256,
118
+ "n_heads": 8,
119
+ "n_units": 512,
120
+ "use_residual": true,
121
+ "rotary_base": 10000,
122
+ "rotary_scale": 10
123
+ },
124
+ "convnext_0": {
125
+ "idim": 512,
126
+ "ksz": 5,
127
+ "intermediate_dim": 2048,
128
+ "num_layers": 4,
129
+ "dilation_lst": [
130
+ 1,
131
+ 2,
132
+ 4,
133
+ 8
134
+ ]
135
+ },
136
+ "convnext_1": {
137
+ "idim": 512,
138
+ "ksz": 5,
139
+ "intermediate_dim": 2048,
140
+ "num_layers": 1,
141
+ "dilation_lst": [
142
+ 1
143
+ ]
144
+ },
145
+ "convnext_2": {
146
+ "idim": 512,
147
+ "ksz": 5,
148
+ "intermediate_dim": 2048,
149
+ "num_layers": 1,
150
+ "dilation_lst": [
151
+ 1
152
+ ]
153
+ }
154
+ },
155
+ "last_convnext": {
156
+ "idim": 512,
157
+ "ksz": 5,
158
+ "intermediate_dim": 2048,
159
+ "num_layers": 4,
160
+ "dilation_lst": [
161
+ 1,
162
+ 1,
163
+ 1,
164
+ 1
165
+ ]
166
+ },
167
+ "proj_out": {
168
+ "idim": 512,
169
+ "chunk_compress_factor": 6,
170
+ "ldim": 24
171
+ }
172
+ }
173
+ },
174
+ "ae": {
175
+ "sample_rate": 44100,
176
+ "n_delay": 0,
177
+ "base_chunk_size": 512,
178
+ "chunk_compress_factor": 1,
179
+ "ldim": 24,
180
+ "encoder": {
181
+ "spec_processor": {
182
+ "n_fft": 2048,
183
+ "win_length": 2048,
184
+ "hop_length": 512,
185
+ "n_mels": 228,
186
+ "sample_rate": 44100,
187
+ "eps": 1e-05,
188
+ "norm_mean": 0.0,
189
+ "norm_std": 1.0
190
+ },
191
+ "ksz_init": 7,
192
+ "ksz": 7,
193
+ "num_layers": 10,
194
+ "dilation_lst": [
195
+ 1,
196
+ 1,
197
+ 1,
198
+ 1,
199
+ 1,
200
+ 1,
201
+ 1,
202
+ 1,
203
+ 1,
204
+ 1
205
+ ],
206
+ "intermediate_dim": 2048,
207
+ "idim": 1253,
208
+ "hdim": 512,
209
+ "odim": 24
210
+ },
211
+ "decoder": {
212
+ "ksz_init": 7,
213
+ "ksz": 7,
214
+ "num_layers": 10,
215
+ "dilation_lst": [
216
+ 1,
217
+ 2,
218
+ 4,
219
+ 1,
220
+ 2,
221
+ 4,
222
+ 1,
223
+ 1,
224
+ 1,
225
+ 1
226
+ ],
227
+ "intermediate_dim": 2048,
228
+ "idim": 24,
229
+ "hdim": 512,
230
+ "head": {
231
+ "idim": 512,
232
+ "hdim": 2048,
233
+ "odim": 512,
234
+ "ksz": 3
235
+ }
236
+ }
237
+ },
238
+ "dp": {
239
+ "latent_dim": 24,
240
+ "chunk_compress_factor": 6,
241
+ "normalizer": {
242
+ "scale": 1.0
243
+ },
244
+ "sentence_encoder": {
245
+ "char_emb_dim": 64,
246
+ "text_embedder": {
247
+ "char_emb_dim": 64
248
+ },
249
+ "convnext": {
250
+ "idim": 64,
251
+ "ksz": 5,
252
+ "intermediate_dim": 256,
253
+ "num_layers": 6,
254
+ "dilation_lst": [
255
+ 1,
256
+ 1,
257
+ 1,
258
+ 1,
259
+ 1,
260
+ 1
261
+ ]
262
+ },
263
+ "attn_encoder": {
264
+ "hidden_channels": 64,
265
+ "filter_channels": 256,
266
+ "n_heads": 2,
267
+ "n_layers": 2,
268
+ "p_dropout": 0.0
269
+ },
270
+ "proj_out": {
271
+ "idim": 64,
272
+ "odim": 64
273
+ }
274
+ },
275
+ "style_encoder": {
276
+ "proj_in": {
277
+ "ldim": 24,
278
+ "chunk_compress_factor": 6,
279
+ "odim": 64
280
+ },
281
+ "convnext": {
282
+ "idim": 64,
283
+ "ksz": 5,
284
+ "intermediate_dim": 256,
285
+ "num_layers": 4,
286
+ "dilation_lst": [
287
+ 1,
288
+ 1,
289
+ 1,
290
+ 1
291
+ ]
292
+ },
293
+ "style_token_layer": {
294
+ "input_dim": 64,
295
+ "n_style": 8,
296
+ "style_key_dim": 0,
297
+ "style_value_dim": 16,
298
+ "prototype_dim": 64,
299
+ "n_units": 64,
300
+ "n_heads": 2
301
+ }
302
+ },
303
+ "predictor": {
304
+ "sentence_dim": 64,
305
+ "n_style": 8,
306
+ "style_dim": 16,
307
+ "hdim": 128,
308
+ "n_layer": 2
309
+ }
310
+ }
311
+ }
unicode_indexer.json ADDED
The diff for this file is too large to render. See raw diff
 
vector_estimator.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883ac868ea0275ef0e991524dc64f16b3c0376efd7c320af6b53f5b780d7c61c
3
+ size 256534781
voice_styles/F1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F5.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M5.json ADDED
The diff for this file is too large to render. See raw diff