Initial upload: fp32 + INT4 LiteRT + ONNX vector_estimator + auto-pad inference + README

Browse files

Files changed (22) hide show

README.md +138 -0
fp32/duration_predictor.tflite +3 -0
fp32/text_encoder.tflite +3 -0
fp32/vocoder.tflite +3 -0
inference.py +230 -0
int4/duration_predictor.tflite +3 -0
int4/text_encoder.tflite +3 -0
int4/vocoder.tflite +3 -0
int8/vocoder.tflite +3 -0
tts.json +311 -0
unicode_indexer.json +0 -0
vector_estimator.onnx +3 -0
voice_styles/F1.json +0 -0
voice_styles/F2.json +0 -0
voice_styles/F3.json +0 -0
voice_styles/F4.json +0 -0
voice_styles/F5.json +0 -0
voice_styles/M1.json +0 -0
voice_styles/M2.json +0 -0
voice_styles/M3.json +0 -0
voice_styles/M4.json +0 -0
voice_styles/M5.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,138 @@

+---
+license: openrail
+language:
+- en
+- ja
+- zh
+- ko
+- es
+- fr
+- de
+- multilingual
+library_name: ai-edge-litert
+tags:
+- litert
+- tflite
+- tensorflow-lite
+- text-to-speech
+- tts
+- audio
+- diffusion
+- flow-matching
+- on-device
+- mobile
+- android
+- int4
+- weight-only-quantization
+pipeline_tag: text-to-speech
+base_model: Supertone/supertonic-3
+---
+# Supertonic-3 — LiteRT (.tflite, INT4)
+LiteRT / TensorFlow Lite conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
+a 99M-parameter multilingual TTS model. 3 of the 4 components convert
+cleanly to true INT4 weight-only quantization via Google's
+[ai-edge-quantizer](https://github.com/google-ai-edge/ai-edge-quantizer)
+and run on the [`ai_edge_litert`](https://github.com/google-ai-edge/litert)
+runtime. `vector_estimator` (the diffusion denoiser) is kept as ONNX —
+its rotary multi-head attention defeats onnx2tf's NCW↔NHWC shape
+inference, and `litert_torch.convert` deadlocks in MLIR lowering when
+fed the model with loaded weights (the same fresh-initialized module
+converts cleanly in 11 s, isolating the trigger to specific weight
+patterns; a likely upstream bug).
+## Configurations
+| Config | Components | Size | Notes |
+| --- | --- | ---: | --- |
+| **int4 (recommended)** | `int4/{dp,te}.tflite` + `vector_estimator.onnx` + `int8/vocoder.tflite` | **310 MB** | true 4-bit weights via ai-edge-quantizer; vocoder kept at INT8 because INT4 vocoder broke (cos ~0) |
+| fp32 | `fp32/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 398 MB | float reference |
+| int4 (all) | `int4/{dp,te,vocoder}.tflite` + `vector_estimator.onnx` | 296 MB | broken — vocoder INT4 produces white noise |
+| Component file | Size |
+| --- | ---: |
+| `fp32/duration_predictor.tflite` | 4 MB |
+| `fp32/text_encoder.tflite` | 37 MB |
+| `fp32/vocoder.tflite` | 101 MB |
+| `int4/duration_predictor.tflite` | 2.5 MB |
+| `int4/text_encoder.tflite` | 13 MB |
+| `int8/vocoder.tflite` (recommended) | 26 MB |
+| `vector_estimator.onnx` (always) | 256 MB |
+## Quickstart
+```bash
+pip install ai-edge-litert onnxruntime soundfile numpy supertonic
+git clone https://huggingface.co/Reza2kn/supertonic-3-litert
+cd supertonic-3-litert
+# Recommended INT4 config (default)
+python inference.py --text "Hello, world." --voice F1 --out hello.wav
+# Long prompt — use --auto-pad for full content rendering
+python inference.py \
+  --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
+  --voice F5 --auto-pad --out long.wav
+# Explicit FP32 baseline
+python inference.py --text "Hello" --dp-quant fp32 --te-quant fp32 --voc-quant fp32
+```
+10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
+31 languages supported via `unicode_indexer.json`.
+## The auto-pad trick (why `--auto-pad` matters)
+The supertonic-3 model has a soft cap on per-utterance content — it
+truncates long prompts and drops into a stable filler tone for the rest
+of the budget. The LiteRT pipeline uses ONNX vector_estimator at native
+shapes, so the truncation is at the model's hard limit (~13.7 s for the
+test long prompt) rather than CoreML's bucket-extended ~16.7 s.
+`--auto-pad`:
+1. **Pass 1** synthesizes the prompt alone to find the natural endpoint.
+2. **Pass 2** appends a long filler sentence
+   (`" And with that, the gentle silence wrapped itself around the room."`)
+   that gives the model more text tokens + more diffusion frames to
+   fully render the original prompt before truncating into filler.
+3. Trims at the longest clean-silence gap between the original prompt's
+   natural endpoint and the appended sentence's endpoint. Tail-pad with
+   0.5 s of true silence.
+Cost: 2× synthesis. Recommended for any prompt over ~5 s.
+## Conversion pipeline
+```
+Supertone/supertonic-3 (ONNX)
+  -> onnxsim.simplify (T=L=320)
+  -> fuse_gelu (Div/Erf/Add/Mul/Mul -> ONNX Gelu opset 20)   # required to keep ai_edge_litert eligible
+  -> onnx2tf -kt -coion (TF SavedModel)
+  -> tf.lite.TFLiteConverter (fp32 .tflite)
+  -> ai-edge-quantizer weight_only_wi4_afp32() (true INT4)
+  -> ai_edge_litert.Interpreter at runtime
+```
+The **GELU fuse** is the key unlock. Without it, `onnx2tf` emits FlexErf
+ops which disqualify the model from `ai_edge_litert` (the runtime that
+supports INT4). Replacing the Erf-based GELU expansion with a single
+ONNX `Gelu` op (opset 20) keeps the model in pure-TFLite ops and unblocks
+INT4 inference.
+`vector_estimator` is kept as ONNX because onnx2tf's transpose
+optimization breaks rotary attention masking, and `litert_torch.convert`
+deadlocks on its loaded weights. Per-step ONNX VE inference on CPU is
+~3.5 s wall total for an 8-step long-prompt synthesis on M2 Pro.
+## License
+OpenRAIL — same as the original Supertone/supertonic-3.
+## Credits
+- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
+- LiteRT conversion + auto-pad workflow: this repo
+- Quantization: [`ai-edge-quantizer`](https://github.com/google-ai-edge/ai-edge-quantizer)
+- Runtime: [`ai_edge_litert`](https://github.com/google-ai-edge/litert)

fp32/duration_predictor.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:083179cec4e187c81b6d4be7c3e827acc90adc9cb1dc8c587e06fc5ae9b6a8e1
+size 3855484

fp32/text_encoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c18aa25615e95363585b41673596ffe0d96736cf21aa4461af2f76e17991b507
+size 36932784

fp32/vocoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6296bcb7b7de728aeb0cfc8ee89443ba1c404a13a5806d52f43b1fd8e378d42
+size 101421512

inference.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""End-to-end TTS inference using the LiteRT (.tflite) + ONNX components.
+Architecture:
+    text -> tokenize
+        -> duration_predictor (.tflite)  -> frame count
+        -> text_encoder       (.tflite)  -> text embedding
+        -> sample noisy latent ~ N(0, I)
+        -> vector_estimator   (.onnx)    -> ODE step x 8
+        -> vocoder            (.tflite)  -> 44.1 kHz waveform
+3 of the 4 components convert cleanly to LiteRT via onnx2tf + ai-edge-
+quantizer. `vector_estimator` is kept as ONNX because its rotary
+multi-head attention defeats onnx2tf's NCW-NHWC shape inference (and
+litert-torch deadlocks on loaded weights with specific patterns). This
+ONNX fallback runs on CPU via onnxruntime; the other three run on the
+LiteRT runtime (`ai_edge_litert`) which supports true INT4 inference.
+Two recommended configurations:
+  fp32:  fp32/dp + fp32/te + vector_estimator.onnx + fp32/vocoder
+         (142 MB tflite + 256 MB ONNX = ~398 MB)
+  int4:  int4/dp + int4/te + vector_estimator.onnx + int8/vocoder
+         (28 MB tflite + 26 MB INT8 vocoder + 256 MB ONNX = ~310 MB)
+         (INT4 vocoder is broken — cos ~0 — so we ship INT8 for vocoder)
+Usage:
+    python inference.py --text "Hello, world." --voice F1 --lang en
+    python inference.py --text "<longer prompt>" --voice F5 --auto-pad
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+import numpy as np
+import soundfile as sf
+import onnxruntime as ort
+HERE = Path(__file__).parent
+T_BUCKET = 320
+L_BUCKET = 320
+SAMPLE_RATE = 44_100
+LATENT_DIM = 24
+CHUNK_COMPRESS_FACTOR = 6
+BASE_CHUNK_SIZE = 512
+DEFAULT_TOTAL_STEPS = 8
+DEFAULT_SPEED = 1.05
+DEFAULT_AUTO_PAD = " And with that, the gentle silence wrapped itself around the room."
+def _pad(arr: np.ndarray, axis: int, target: int) -> np.ndarray:
+    if arr.shape[axis] >= target:
+        return arr
+    pad = [(0, 0)] * arr.ndim
+    pad[axis] = (0, target - arr.shape[axis])
+    return np.pad(arr, pad)
+def _load_voice(name: str) -> tuple[np.ndarray, np.ndarray]:
+    j = json.loads((HERE / "voice_styles" / f"{name}.json").read_text())
+    def r(part): return np.array(part["data"], dtype=np.float32).reshape(*part["dims"])
+    return r(j["style_ttl"]), r(j["style_dp"])
+def _load_tokenizer(indexer_path: Path):
+    try:
+        from supertonic.core import UnicodeProcessor
+    except ImportError as e:
+        raise RuntimeError(
+            "supertonic package is required for tokenization. "
+            "Install with: pip install supertonic"
+        ) from e
+    return UnicodeProcessor(str(indexer_path))
+class TFLiteRunner:
+    """Convenience wrapper around ai_edge_litert.Interpreter (true LiteRT
+    runtime, supports INT4) — falls back to tf.lite.Interpreter for FP32
+    if ai_edge_litert is unavailable."""
+    def __init__(self, path: Path):
+        try:
+            from ai_edge_litert.interpreter import Interpreter as AILiteRT
+            self._interp = AILiteRT(model_path=str(path))
+        except ImportError:
+            import tensorflow as tf
+            self._interp = tf.lite.Interpreter(model_path=str(path))
+        self._interp.allocate_tensors()
+        self._in_details = {d["name"]: d for d in self._interp.get_input_details()}
+        self._in_keys = {full.split("/")[-1]: full for full in self._in_details}
+        self._out = self._interp.get_output_details()[0]
+    def predict(self, feed: dict[str, np.ndarray]) -> np.ndarray:
+        for short, value in feed.items():
+            full = self._in_keys.get(short) or next(
+                (k for k in self._in_details if short in k), None)
+            d = self._in_details[full]
+            v = value if value.dtype == d["dtype"] else value.astype(d["dtype"])
+            self._interp.set_tensor(d["index"], v)
+        self._interp.invoke()
+        return self._interp.get_tensor(self._out["index"])
+def _last_loud_window(audio: np.ndarray, thresh: float = 0.025) -> int:
+    win = int(0.05 * SAMPLE_RATE)
+    n = len(audio) // win
+    rms = np.sqrt(np.mean(audio[: n * win].reshape(n, win) ** 2, axis=1))
+    loud = np.where(rms > thresh)[0]
+    return int(loud[-1]) if len(loud) else 0
+def trim_padded(unpad: np.ndarray, padded: np.ndarray) -> np.ndarray:
+    win = int(0.05 * SAMPLE_RATE)
+    n = len(padded) // win
+    rms = np.sqrt(np.mean(padded[: n * win].reshape(n, win) ** 2, axis=1))
+    floor = _last_loud_window(unpad)
+    ceil_ = _last_loud_window(padded) + 1
+    candidates = []
+    j = floor
+    while j < ceil_ - 1:
+        if rms[j] < 0.025 and rms[j + 1] < 0.025:
+            start = j; total = 0.0; cnt = 0
+            while j < ceil_ and rms[j] < 0.025:
+                total += float(rms[j]); cnt += 1; j += 1
+            candidates.append((start, cnt, total / max(cnt, 1)))
+        else:
+            j += 1
+    if not candidates:
+        return padded[: ceil_ * win]
+    start_win, length, avg = max(candidates, key=lambda c: (c[1], -c[0]))
+    end_samples = start_win * win
+    out = padded[:end_samples].copy()
+    fade = min(int(0.06 * SAMPLE_RATE), len(out))
+    out[-fade:] *= np.linspace(1.0, 0.0, fade, dtype=np.float32)
+    return np.concatenate([out, np.zeros(int(0.5 * SAMPLE_RATE), dtype=np.float32)])
+class Supertonic3LiteRT:
+    """LiteRT TTS with ONNX vector_estimator fallback. Pass quants per
+    component; defaults give the recommended (int4 dp + te, int8 vocoder)
+    configuration."""
+    def __init__(self, dp_quant: str = "int4", te_quant: str = "int4",
+                 voc_quant: str = "int8"):
+        self.dp = TFLiteRunner(HERE / dp_quant / "duration_predictor.tflite")
+        self.te = TFLiteRunner(HERE / te_quant / "text_encoder.tflite")
+        self.voc = TFLiteRunner(HERE / voc_quant / "vocoder.tflite")
+        self.ve = ort.InferenceSession(
+            str(HERE / "vector_estimator.onnx"),
+            providers=["CPUExecutionProvider"],
+        )
+        self.tok = _load_tokenizer(HERE / "unicode_indexer.json")
+    def _synth(self, text: str, voice: str, lang: str, seed: int,
+               total_steps: int, speed: float, full_bucket: bool) -> np.ndarray:
+        text_ids, text_mask = self.tok([text], lang)
+        text_ids = text_ids.astype(np.int64); text_mask = text_mask.astype(np.float32)
+        style_ttl, style_dp = _load_voice(voice)
+        text_ids_p = _pad(text_ids, 1, T_BUCKET)
+        text_mask_p = _pad(text_mask, 2, T_BUCKET)
+        dur = float(self.dp.predict({"text_ids": text_ids_p, "style_dp": style_dp,
+                                     "text_mask": text_mask_p})[0]) / speed
+        text_emb_full = self.te.predict({"text_ids": text_ids_p, "style_ttl": style_ttl,
+                                         "text_mask": text_mask_p})
+        # ONNX VE accepts native shapes — trim text_emb back to T_real.
+        T_real = text_ids.shape[1]
+        text_emb_real = text_emb_full[:, :, :T_real]
+        L_real = max(1, min(L_BUCKET, (int(dur * SAMPLE_RATE) + BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR - 1)
+                            // (BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR)))
+        np.random.seed(seed)
+        xt = (np.random.randn(1, LATENT_DIM * CHUNK_COMPRESS_FACTOR, L_real)).astype(np.float32)
+        latent_mask = np.ones((1, 1, L_real), dtype=np.float32)
+        xt = xt * latent_mask
+        total_step_arr = np.array([float(total_steps)], dtype=np.float32)
+        for step in range(total_steps):
+            xt = self.ve.run(None, {
+                "noisy_latent": xt, "text_emb": text_emb_real, "style_ttl": style_ttl,
+                "text_mask": text_mask, "latent_mask": latent_mask,
+                "current_step": np.array([float(step)], dtype=np.float32),
+                "total_step": total_step_arr,
+            })[0]
+        xt_padded = _pad(xt, 2, L_BUCKET)
+        wav = self.voc.predict({"latent": xt_padded})[0]
+        if full_bucket:
+            return wav
+        return wav[: L_real * CHUNK_COMPRESS_FACTOR * BASE_CHUNK_SIZE]
+    def synthesize(self, text: str, voice: str = "F1", lang: str = "en", seed: int = 0,
+                   total_steps: int = DEFAULT_TOTAL_STEPS, speed: float = DEFAULT_SPEED,
+                   auto_pad: str | None = DEFAULT_AUTO_PAD) -> np.ndarray:
+        if auto_pad is None:
+            return self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=False)
+        unpad = self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=True)
+        padded = self._synth(text + auto_pad, voice, lang, seed, total_steps, speed, full_bucket=True)
+        return trim_padded(unpad, padded)
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--text", required=True)
+    ap.add_argument("--voice", default="F1",
+                    choices=[f"F{i}" for i in range(1, 6)] + [f"M{i}" for i in range(1, 6)])
+    ap.add_argument("--lang", default="en")
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--total-steps", type=int, default=DEFAULT_TOTAL_STEPS)
+    ap.add_argument("--auto-pad", nargs="?", const=DEFAULT_AUTO_PAD, default=None,
+                    help="2-pass synthesis with filler suffix + auto-trim (recommended for long prompts).")
+    ap.add_argument("--dp-quant", default="int4", choices=["fp32", "int4"])
+    ap.add_argument("--te-quant", default="int4", choices=["fp32", "int4"])
+    ap.add_argument("--voc-quant", default="int8", choices=["fp32", "int8", "int4"],
+                    help="INT4 vocoder is broken (cos ~0) — use int8 or fp32.")
+    ap.add_argument("--out", default="out.wav")
+    args = ap.parse_args()
+    t0 = time.time()
+    tts = Supertonic3LiteRT(dp_quant=args.dp_quant, te_quant=args.te_quant, voc_quant=args.voc_quant)
+    print(f"Loaded models in {time.time() - t0:.2f}s (dp={args.dp_quant}, te={args.te_quant}, voc={args.voc_quant})")
+    t0 = time.time()
+    audio = tts.synthesize(args.text, voice=args.voice, lang=args.lang, seed=args.seed,
+                           total_steps=args.total_steps, auto_pad=args.auto_pad)
+    sf.write(args.out, audio, SAMPLE_RATE)
+    print(f"Synthesized {len(audio)/SAMPLE_RATE:.2f}s in {time.time() - t0:.2f}s -> {args.out}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

int4/duration_predictor.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d0ca55b4aba85cd5de5daf1a91a68ffc27d64bd8bc33d98c2cdab20d8c98ebd7
+size 2491168

int4/text_encoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62d0fdfcb5a368bd36ffd3664d323486b24de772e82bcb133608c6abd39e5577
+size 12552576

int4/vocoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51e0ed88c0d5c089c0275812aecfc260ab0d77e5373783857b8a6f6ef463a4ad
+size 13321728

int8/vocoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6bccafcdf53d4b359cf8ed923ad547a99d84b356b84805a61ab8488278603a7d
+size 25965568

tts.json ADDED Viewed

	@@ -0,0 +1,311 @@

+{
+    "tts_version": "v1.7.3",
+    "split": "opensource-multilingual",
+    "ttl": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "batch_expander": {
+            "n_batch_expand": 6
+        },
+        "normalizer": {
+            "scale": 0.25
+        },
+        "text_encoder": {
+            "n_langs": 0,
+            "lang_emb_dim": 0,
+            "text_embedder": {
+                "char_emb_dim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    2,
+                    2,
+                    4,
+                    4
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 256,
+                "filter_channels": 1024,
+                "n_heads": 4,
+                "n_layers": 4,
+                "p_dropout": 0.0
+            },
+            "proj_out": {
+                "idim": 256,
+                "odim": 256
+            }
+        },
+        "flow_matching": {
+            "sig_min": 1e-08
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 256,
+                "n_style": 50,
+                "style_key_dim": 256,
+                "style_value_dim": 256,
+                "prototype_dim": 256,
+                "n_units": 256,
+                "n_heads": 2
+            }
+        },
+        "speech_prompted_text_encoder": {
+            "text_dim": 256,
+            "style_dim": 256,
+            "n_units": 256,
+            "n_heads": 2
+        },
+        "uncond_masker": {
+            "prob_both_uncond": 0.04,
+            "prob_text_uncond": 0.01,
+            "std": 0.1,
+            "text_dim": 256,
+            "n_style": 50,
+            "style_key_dim": 256,
+            "style_value_dim": 256
+        },
+        "vector_field": {
+            "n_langs": 0,
+            "lang_emb_dim": 0,
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 512
+            },
+            "time_encoder": {
+                "time_dim": 64,
+                "hdim": 256
+            },
+            "main_blocks": {
+                "n_blocks": 4,
+                "time_cond_layer": {
+                    "idim": 512,
+                    "time_dim": 64
+                },
+                "style_cond_layer": {
+                    "idim": 512,
+                    "style_dim": 256
+                },
+                "text_cond_layer": {
+                    "idim": 512,
+                    "text_dim": 256,
+                    "n_heads": 8,
+                    "n_units": 512,
+                    "use_residual": true,
+                    "rotary_base": 10000,
+                    "rotary_scale": 10
+                },
+                "convnext_0": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 4,
+                    "dilation_lst": [
+                        1,
+                        2,
+                        4,
+                        8
+                    ]
+                },
+                "convnext_1": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                },
+                "convnext_2": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                }
+            },
+            "last_convnext": {
+                "idim": 512,
+                "ksz": 5,
+                "intermediate_dim": 2048,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "proj_out": {
+                "idim": 512,
+                "chunk_compress_factor": 6,
+                "ldim": 24
+            }
+        }
+    },
+    "ae": {
+        "sample_rate": 44100,
+        "n_delay": 0,
+        "base_chunk_size": 512,
+        "chunk_compress_factor": 1,
+        "ldim": 24,
+        "encoder": {
+            "spec_processor": {
+                "n_fft": 2048,
+                "win_length": 2048,
+                "hop_length": 512,
+                "n_mels": 228,
+                "sample_rate": 44100,
+                "eps": 1e-05,
+                "norm_mean": 0.0,
+                "norm_std": 1.0
+            },
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 1253,
+            "hdim": 512,
+            "odim": 24
+        },
+        "decoder": {
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                2,
+                4,
+                1,
+                2,
+                4,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 24,
+            "hdim": 512,
+            "head": {
+                "idim": 512,
+                "hdim": 2048,
+                "odim": 512,
+                "ksz": 3
+            }
+        }
+    },
+    "dp": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "normalizer": {
+            "scale": 1.0
+        },
+        "sentence_encoder": {
+            "char_emb_dim": 64,
+            "text_embedder": {
+                "char_emb_dim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 64,
+                "filter_channels": 256,
+                "n_heads": 2,
+                "n_layers": 2,
+                "p_dropout": 0.0
+            },
+            "proj_out": {
+                "idim": 64,
+                "odim": 64
+            }
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 64,
+                "n_style": 8,
+                "style_key_dim": 0,
+                "style_value_dim": 16,
+                "prototype_dim": 64,
+                "n_units": 64,
+                "n_heads": 2
+            }
+        },
+        "predictor": {
+            "sentence_dim": 64,
+            "n_style": 8,
+            "style_dim": 16,
+            "hdim": 128,
+            "n_layer": 2
+        }
+    }
+}

unicode_indexer.json ADDED Viewed