Initial upload: fp16 CoreML + auto-pad inference + README

Browse files

Files changed (26) hide show

README.md +129 -0
fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
fp16/duration_predictor.mlpackage/Manifest.json +18 -0
fp16/text_encoder.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
fp16/text_encoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
fp16/text_encoder.mlpackage/Manifest.json +18 -0
fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
fp16/vector_estimator.mlpackage/Manifest.json +18 -0
fp16/vocoder.mlpackage/Data/com.apple.CoreML/model.mlmodel +3 -0
fp16/vocoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin +3 -0
fp16/vocoder.mlpackage/Manifest.json +18 -0
inference.py +194 -0
tts.json +311 -0
unicode_indexer.json +0 -0
voice_styles/F1.json +0 -0
voice_styles/F2.json +0 -0
voice_styles/F3.json +0 -0
voice_styles/F4.json +0 -0
voice_styles/F5.json +0 -0
voice_styles/M1.json +0 -0
voice_styles/M2.json +0 -0
voice_styles/M3.json +0 -0
voice_styles/M4.json +0 -0
voice_styles/M5.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+---
+license: openrail
+language:
+- en
+- ja
+- zh
+- ko
+- es
+- fr
+- de
+- multilingual
+library_name: coremltools
+tags:
+- coreml
+- ane
+- apple-neural-engine
+- text-to-speech
+- tts
+- audio
+- diffusion
+- flow-matching
+- on-device
+pipeline_tag: text-to-speech
+base_model: Supertone/supertonic-3
+---
+# Supertonic-3 — CoreML (fp16, ANE-ready)
+CoreML conversion of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3),
+a 99M-parameter multilingual TTS model. All 4 components run on the
+Apple Neural Engine (1.8–3.7× faster than CPU on M-series chips).
+| Component | Size | Role |
+| --- | ---: | --- |
+| `fp16/duration_predictor.mlpackage` | 15 MB | text -> frame count |
+| `fp16/text_encoder.mlpackage` | 71 MB | text -> conditioning latent |
+| `fp16/vector_estimator.mlpackage` | 135 MB | flow-matching denoiser (8 steps) |
+| `fp16/vocoder.mlpackage` | 51 MB | latent -> 44.1 kHz waveform |
+| **Total** | **272 MB** | (originals: ~400 MB ONNX) |
+## Quickstart
+```bash
+pip install coremltools soundfile numpy supertonic
+git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
+cd supertonic-3-coreml
+# Short prompt
+python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav
+# Long prompt — use --auto-pad for full content rendering
+python inference.py \
+  --text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
+  --voice F5 --lang en --auto-pad --out long.wav
+```
+10 voice styles ship in `voice_styles/`: F1–F5 (female), M1–M5 (male).
+31 languages supported via `unicode_indexer.json`.
+## The auto-pad trick (why `--auto-pad` matters)
+The supertonic-3 model has a soft cap on how much speech it renders per
+utterance. For long inputs (more than ~13 s of natural speech) the model
+truncates the prompt and emits a low-amplitude filler tone for the rest
+of the budget. The CoreML conversion's static bucket (T=L=320) extends
+this cap by ~3 s due to the way the bucket's padded positions leak into
+the real positions through ConvNeXt's dilated convolutions — that's
+**why CoreML inference sounds more natural than the original ONNX
+library** (proper word separation, intonation), but it still cuts off
+mid-sentence on long prompts.
+`--auto-pad` is a two-pass workaround:
+1. **Pass 1** synthesizes the prompt alone at full bucket length to find
+   where the model's content naturally stops (`t_orig`).
+2. **Pass 2** appends a long filler sentence
+   (`" And with that, the gentle silence wrapped itself around the room."`)
+   that gives the model extra frames to fully render the original
+   prompt, then renders the filler sentence, then drops into the filler
+   tone.
+3. The longest clean-silence gap after `t_orig` is the boundary between
+   the original prompt and the appended filler. The pipeline trims
+   there and tail-pads with 0.5 s of true silence.
+Cost: ~2× synthesis time. Worth it for any prompt over ~5 s.
+## ANE engagement
+All 4 components compile to ANE-resident programs when loaded with
+`compute_units=ALL` (default). Measured speedups on M2 Pro vs CPU:
+| Component | ANE speedup |
+| --- | --- |
+| duration_predictor | 1.9× |
+| text_encoder | 2.8× |
+| vector_estimator | 2.4× (per step; 8 steps total) |
+| vocoder | 3.7× |
+Verify ANE engagement with:
+```bash
+xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
+```
+## Conversion notes
+- Static bucket: T=320 (text length), L=320 (latent length). Inputs are
+  zero-padded on the right and masked. Bucket = 22.3 s of audio.
+- `duration_predictor`, `text_encoder`, `vocoder` are hand-reimplemented
+  in PyTorch from the ONNX initializers, then traced to CoreML.
+  Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998
+  (text_encoder), cos 0.9998 (vocoder).
+- `vector_estimator` (the heavy diffusion model) goes through
+  `onnxsim.simplify(T=L=320)` -> `onnx2torch.convert` -> `torch.jit.trace`
+  -> coremltools. Cos 0.998 vs ONNX per diffusion step.
+- The diffusion sampler stays host-side (8 Euler steps over the single
+  step graph). All 4 components are individually quantizable.
+## License
+This conversion follows the original Supertone/supertonic-3 license
+(OpenRAIL). See `LICENSE` (or the upstream model card).
+## Credits
+- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
+- CoreML conversion + auto-pad workflow: this repo
+INT4 quantized variants coming next.

fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e21c025c5b03d75cf4bee36f1143d313acc3f6126554fee795d3ce81215bb5b
+size 292296

fp16/duration_predictor.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c09b368b045582ade6d153550d64d42ce3642fd27d270d539b923a0054c80d53
+size 15128768

fp16/duration_predictor.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "CC4B4953-3EB5-4D7E-A01F-10DBAF546402": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        },
+        "DD17ECC9-71D9-4AF3-92F7-F6695D27F1F9": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        }
+    },
+    "rootModelIdentifier": "DD17ECC9-71D9-4AF3-92F7-F6695D27F1F9"
+}

fp16/text_encoder.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ec80244148794cc813893607a52af0b55c7a7596649039cc142b686944d0de9
+size 540341

fp16/text_encoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f94addefeb3a05c1c5f28ff594cb976fc50f8596c982fb37f53c6ff200c2f14
+size 70399168

fp16/text_encoder.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "1FEF1B1A-5400-4BCB-807C-F65684E0B270": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        },
+        "CF6B5B08-E226-47FF-9857-965147569173": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        }
+    },
+    "rootModelIdentifier": "1FEF1B1A-5400-4BCB-807C-F65684E0B270"
+}

fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f02c1334d3bbbcc96718dd862d9ef6252d0b27dd9915b1402062ffc13c965159
+size 362380

fp16/vector_estimator.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfb90a6f12980209b627aeca9641866826a3085ac6dda6590d19847806c8d4f0
+size 134491840

fp16/vector_estimator.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "2441BE30-7851-43DD-8CF1-66C3121FBF9C": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        },
+        "F97584CB-19BF-403B-80AF-E194801E6AC9": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        }
+    },
+    "rootModelIdentifier": "F97584CB-19BF-403B-80AF-E194801E6AC9"
+}

fp16/vocoder.mlpackage/Data/com.apple.CoreML/model.mlmodel ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71ab85bf16a8e088a5511c4dbcc67e7b167563d643d22d4dc3f873100359e1d7
+size 70325

fp16/vocoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b45e4c17a45cc4de30dcedae9f662548aa35f8f9763632c342b8bbef3089fcee
+size 50672512

fp16/vocoder.mlpackage/Manifest.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "fileFormatVersion": "1.0.0",
+    "itemInfoEntries": {
+        "1AAA4924-9DF4-4960-8DFF-D575A8583886": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Weights",
+            "name": "weights",
+            "path": "com.apple.CoreML/weights"
+        },
+        "250BB268-FD46-4919-822F-7B1BB10FCECC": {
+            "author": "com.apple.CoreML",
+            "description": "CoreML Model Specification",
+            "name": "model.mlmodel",
+            "path": "com.apple.CoreML/model.mlmodel"
+        }
+    },
+    "rootModelIdentifier": "250BB268-FD46-4919-822F-7B1BB10FCECC"
+}

inference.py ADDED Viewed

	@@ -0,0 +1,194 @@

+"""End-to-end TTS inference using the 4 CoreML components.
+Pipeline (mirrors supertonic.core.Supertonic):
+    text -> tokenize
+        -> duration_predictor -> frame count
+        -> text_encoder       -> text embedding
+        -> sample noisy latent ~ N(0, I)
+        -> vector_estimator x 8 (flow-matching ODE step, runs on ANE)
+        -> vocoder -> 44.1 kHz waveform
+All four mlpackages are static-shape buckets at T=L=320. The driver pads
+inputs to that bucket and trims outputs.
+The supertonic-3 model truncates long prompts at its content limit
+(~13.7s natural; CoreML's bucket-leak extends this to ~16.7s but still
+short for long inputs). The `--auto-pad` mode does a two-pass synthesis
+(once unpadded to find the natural endpoint, once with a long filler
+sentence appended that gives the model more frames to render the full
+original prompt), then trims at the silence gap between original and
+appended content. Recommended for prompts longer than ~5s.
+Usage:
+    python inference.py --text "Hello, world." --voice F1 --lang en
+    python inference.py --text "<longer prompt>" --voice F5 --lang en --auto-pad
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+import coremltools as ct
+import numpy as np
+import soundfile as sf
+HERE = Path(__file__).parent
+T_BUCKET = 320
+L_BUCKET = 320
+SAMPLE_RATE = 44_100
+LATENT_DIM = 24
+CHUNK_COMPRESS_FACTOR = 6
+BASE_CHUNK_SIZE = 512
+DEFAULT_TOTAL_STEPS = 8
+DEFAULT_SPEED = 1.05
+DEFAULT_AUTO_PAD = " And with that, the gentle silence wrapped itself around the room."
+def _pad(arr: np.ndarray, axis: int, target: int) -> np.ndarray:
+    if arr.shape[axis] >= target:
+        return arr
+    pad = [(0, 0)] * arr.ndim
+    pad[axis] = (0, target - arr.shape[axis])
+    return np.pad(arr, pad)
+def _load_voice(name: str) -> tuple[np.ndarray, np.ndarray]:
+    j = json.loads((HERE / "voice_styles" / f"{name}.json").read_text())
+    def r(part): return np.array(part["data"], dtype=np.float32).reshape(*part["dims"])
+    return r(j["style_ttl"]), r(j["style_dp"])
+def _load_tokenizer(indexer_path: Path):
+    """Reuse the official supertonic UnicodeProcessor (handles the 31
+    languages, abbreviation expansion, punctuation rules, etc.).
+    Install with: pip install supertonic
+    """
+    try:
+        from supertonic.core import UnicodeProcessor
+    except ImportError as e:
+        raise RuntimeError(
+            "supertonic package is required for tokenization. "
+            "Install with: pip install supertonic"
+        ) from e
+    return UnicodeProcessor(str(indexer_path))
+def _last_loud_window(audio: np.ndarray, thresh: float = 0.025, win_s: float = 0.05) -> int:
+    win = int(win_s * SAMPLE_RATE)
+    n = len(audio) // win
+    rms = np.sqrt(np.mean(audio[: n * win].reshape(n, win) ** 2, axis=1))
+    loud = np.where(rms > thresh)[0]
+    return int(loud[-1]) if len(loud) else 0
+def trim_padded(unpad: np.ndarray, padded: np.ndarray) -> np.ndarray:
+    """Trim padded synthesis at the longest clean silence between original
+    prompt and appended suffix. Tail-pad with 0.5 s of true silence."""
+    win = int(0.05 * SAMPLE_RATE)
+    n = len(padded) // win
+    rms = np.sqrt(np.mean(padded[: n * win].reshape(n, win) ** 2, axis=1))
+    floor = _last_loud_window(unpad)
+    ceil_ = _last_loud_window(padded) + 1
+    candidates = []
+    j = floor
+    while j < ceil_ - 1:
+        if rms[j] < 0.025 and rms[j + 1] < 0.025:
+            start = j; total = 0.0; cnt = 0
+            while j < ceil_ and rms[j] < 0.025:
+                total += float(rms[j]); cnt += 1; j += 1
+            candidates.append((start, cnt, total / max(cnt, 1)))
+        else:
+            j += 1
+    if not candidates:
+        return padded[: ceil_ * win]
+    start_win, length, avg = max(candidates, key=lambda c: (c[1], -c[0]))
+    end_samples = start_win * win
+    out = padded[:end_samples].copy()
+    fade = min(int(0.06 * SAMPLE_RATE), len(out))
+    out[-fade:] *= np.linspace(1.0, 0.0, fade, dtype=np.float32)
+    return np.concatenate([out, np.zeros(int(0.5 * SAMPLE_RATE), dtype=np.float32)])
+class Supertonic3CoreML:
+    def __init__(self, quant: str = "fp16"):
+        d = HERE / quant
+        self.dp = ct.models.MLModel(str(d / "duration_predictor.mlpackage"))
+        self.te = ct.models.MLModel(str(d / "text_encoder.mlpackage"))
+        self.ve = ct.models.MLModel(str(d / "vector_estimator.mlpackage"))
+        self.voc = ct.models.MLModel(str(d / "vocoder.mlpackage"))
+        self.tok = _load_tokenizer(HERE / "unicode_indexer.json")
+    def _synth(self, text: str, voice: str, lang: str, seed: int,
+               total_steps: int, speed: float, full_bucket: bool) -> np.ndarray:
+        text_ids, text_mask = self.tok([text], lang)
+        text_ids = text_ids.astype(np.int64); text_mask = text_mask.astype(np.float32)
+        style_ttl, style_dp = _load_voice(voice)
+        text_ids_p = _pad(text_ids.astype(np.int32), 1, T_BUCKET)
+        text_mask_p = _pad(text_mask, 2, T_BUCKET)
+        dur = float(self.dp.predict({"text_ids": text_ids_p, "style_dp": style_dp,
+                                     "text_mask": text_mask_p})["duration"][0]) / speed
+        text_emb = self.te.predict({"text_ids": text_ids_p, "style_ttl": style_ttl,
+                                    "text_mask": text_mask_p})["text_emb"]
+        L_real = max(1, min(L_BUCKET, (int(dur * SAMPLE_RATE) + BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR - 1) // (BASE_CHUNK_SIZE * CHUNK_COMPRESS_FACTOR)))
+        np.random.seed(seed)
+        xt = (np.random.randn(1, LATENT_DIM * CHUNK_COMPRESS_FACTOR, L_real)).astype(np.float32)
+        latent_mask = np.ones((1, 1, L_real), dtype=np.float32)
+        xt = xt * latent_mask
+        xt = _pad(xt, 2, L_BUCKET)
+        latent_mask = _pad(latent_mask, 2, L_BUCKET)
+        total_step_arr = np.array([float(total_steps)], dtype=np.float32)
+        for step in range(total_steps):
+            xt = self.ve.predict({
+                "noisy_latent": xt, "text_emb": text_emb, "style_ttl": style_ttl,
+                "text_mask": text_mask_p, "latent_mask": latent_mask,
+                "current_step": np.array([float(step)], dtype=np.float32),
+                "total_step": total_step_arr,
+            })["denoised_latent"]
+        wav = self.voc.predict({"latent": xt})["wav_tts"][0]
+        if full_bucket:
+            return wav
+        return wav[: L_real * CHUNK_COMPRESS_FACTOR * BASE_CHUNK_SIZE]
+    def synthesize(self, text: str, voice: str = "F1", lang: str = "en", seed: int = 0,
+                   total_steps: int = DEFAULT_TOTAL_STEPS, speed: float = DEFAULT_SPEED,
+                   auto_pad: str | None = DEFAULT_AUTO_PAD) -> np.ndarray:
+        """Synthesize speech. With ``auto_pad`` set, runs the 2-pass auto-pad
+        flow for full content rendering on longer prompts."""
+        if auto_pad is None:
+            return self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=False)
+        unpad_audio = self._synth(text, voice, lang, seed, total_steps, speed, full_bucket=True)
+        pad_audio = self._synth(text + auto_pad, voice, lang, seed, total_steps, speed, full_bucket=True)
+        return trim_padded(unpad_audio, pad_audio)
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--text", required=True, help="Text to synthesize")
+    ap.add_argument("--voice", default="F1", choices=[f"F{i}" for i in range(1, 6)] + [f"M{i}" for i in range(1, 6)])
+    ap.add_argument("--lang", default="en")
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--total-steps", type=int, default=DEFAULT_TOTAL_STEPS)
+    ap.add_argument("--auto-pad", nargs="?", const=DEFAULT_AUTO_PAD, default=None,
+                    help="2-pass synthesis with filler suffix + auto-trim (recommended).")
+    ap.add_argument("--quant", default="fp16", choices=["fp16"])
+    ap.add_argument("--out", default="out.wav")
+    args = ap.parse_args()
+    t0 = time.time()
+    tts = Supertonic3CoreML(quant=args.quant)
+    print(f"Loaded models in {time.time() - t0:.2f}s")
+    t0 = time.time()
+    audio = tts.synthesize(args.text, voice=args.voice, lang=args.lang, seed=args.seed,
+                           total_steps=args.total_steps, auto_pad=args.auto_pad)
+    dur = len(audio) / SAMPLE_RATE
+    sf.write(args.out, audio, SAMPLE_RATE)
+    print(f"Synthesized {dur:.2f}s of audio in {time.time() - t0:.2f}s -> {args.out}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

tts.json ADDED Viewed

	@@ -0,0 +1,311 @@

+{
+    "tts_version": "v1.7.3",
+    "split": "opensource-multilingual",
+    "ttl": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "batch_expander": {
+            "n_batch_expand": 6
+        },
+        "normalizer": {
+            "scale": 0.25
+        },
+        "text_encoder": {
+            "n_langs": 0,
+            "lang_emb_dim": 0,
+            "text_embedder": {
+                "char_emb_dim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    2,
+                    2,
+                    4,
+                    4
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 256,
+                "filter_channels": 1024,
+                "n_heads": 4,
+                "n_layers": 4,
+                "p_dropout": 0.0
+            },
+            "proj_out": {
+                "idim": 256,
+                "odim": 256
+            }
+        },
+        "flow_matching": {
+            "sig_min": 1e-08
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 256,
+                "n_style": 50,
+                "style_key_dim": 256,
+                "style_value_dim": 256,
+                "prototype_dim": 256,
+                "n_units": 256,
+                "n_heads": 2
+            }
+        },
+        "speech_prompted_text_encoder": {
+            "text_dim": 256,
+            "style_dim": 256,
+            "n_units": 256,
+            "n_heads": 2
+        },
+        "uncond_masker": {
+            "prob_both_uncond": 0.04,
+            "prob_text_uncond": 0.01,
+            "std": 0.1,
+            "text_dim": 256,
+            "n_style": 50,
+            "style_key_dim": 256,
+            "style_value_dim": 256
+        },
+        "vector_field": {
+            "n_langs": 0,
+            "lang_emb_dim": 0,
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 512
+            },
+            "time_encoder": {
+                "time_dim": 64,
+                "hdim": 256
+            },
+            "main_blocks": {
+                "n_blocks": 4,
+                "time_cond_layer": {
+                    "idim": 512,
+                    "time_dim": 64
+                },
+                "style_cond_layer": {
+                    "idim": 512,
+                    "style_dim": 256
+                },
+                "text_cond_layer": {
+                    "idim": 512,
+                    "text_dim": 256,
+                    "n_heads": 8,
+                    "n_units": 512,
+                    "use_residual": true,
+                    "rotary_base": 10000,
+                    "rotary_scale": 10
+                },
+                "convnext_0": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 4,
+                    "dilation_lst": [
+                        1,
+                        2,
+                        4,
+                        8
+                    ]
+                },
+                "convnext_1": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                },
+                "convnext_2": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 2048,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                }
+            },
+            "last_convnext": {
+                "idim": 512,
+                "ksz": 5,
+                "intermediate_dim": 2048,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "proj_out": {
+                "idim": 512,
+                "chunk_compress_factor": 6,
+                "ldim": 24
+            }
+        }
+    },
+    "ae": {
+        "sample_rate": 44100,
+        "n_delay": 0,
+        "base_chunk_size": 512,
+        "chunk_compress_factor": 1,
+        "ldim": 24,
+        "encoder": {
+            "spec_processor": {
+                "n_fft": 2048,
+                "win_length": 2048,
+                "hop_length": 512,
+                "n_mels": 228,
+                "sample_rate": 44100,
+                "eps": 1e-05,
+                "norm_mean": 0.0,
+                "norm_std": 1.0
+            },
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 1253,
+            "hdim": 512,
+            "odim": 24
+        },
+        "decoder": {
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                2,
+                4,
+                1,
+                2,
+                4,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 24,
+            "hdim": 512,
+            "head": {
+                "idim": 512,
+                "hdim": 2048,
+                "odim": 512,
+                "ksz": 3
+            }
+        }
+    },
+    "dp": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "normalizer": {
+            "scale": 1.0
+        },
+        "sentence_encoder": {
+            "char_emb_dim": 64,
+            "text_embedder": {
+                "char_emb_dim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 64,
+                "filter_channels": 256,
+                "n_heads": 2,
+                "n_layers": 2,
+                "p_dropout": 0.0
+            },
+            "proj_out": {
+                "idim": 64,
+                "odim": 64
+            }
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 64,
+                "n_style": 8,
+                "style_key_dim": 0,
+                "style_value_dim": 16,
+                "prototype_dim": 64,
+                "n_units": 64,
+                "n_heads": 2
+            }
+        },
+        "predictor": {
+            "sentence_dim": 64,
+            "n_style": 8,
+            "style_dim": 16,
+            "hdim": 128,
+            "n_layer": 2
+        }
+    }
+}

unicode_indexer.json ADDED Viewed