Manmay Nakhashi commited on
Commit
1636761
·
1 Parent(s): c29ae29

Rebrand: DramaBox → LTX-2.3-Voice

Browse files

- README title + frontmatter (title, emoji, license_link, short_description)
- All in-text references in docs and code
- model_downloader: DRAMABOX_REPO → LTX23_VOICE_REPO, default cache → ~/.cache/ltx-2.3-voice
- app.py title, prefix, log lines
- HF model+space repos are renamed to ResembleAI/LTX-2.3-Voice (HF auto-redirects old paths)

The on-disk safetensors filenames (`dramabox-dit-v1.safetensors`,
`dramabox-audio-components.safetensors`) stay as-is to avoid an 8 GB
re-upload; comment in model_downloader.py explains the leftover names.

README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: DramaBox
3
- emoji: 🎭
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: gradio
@@ -9,34 +9,202 @@ app_file: app.py
9
  pinned: true
10
  license: other
11
  license_name: ltx-2-community
12
- license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
13
  hf_oauth: false
14
- short_description: Expressive TTS with voice cloning — DramaBox demo
15
  ---
16
 
17
- # DramaBox — Expressive TTS Demo
18
 
19
- Live demo of [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox). Write a scene prompt, optionally upload a 10-second voice reference, and generate. Audio is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth).
20
 
21
- The model checkpoints download automatically on first launch.
 
 
 
 
22
 
23
- ## Prompt format
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ```
26
- <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
- - **Inside double quotes**: dialogue and phonetic sounds (`"Hahaha"`, `"Mmmmm"`, `"Ugh"`)
30
- - **Outside quotes**: stage directions (`She sighs.`, `He clears his throat.`)
31
- - **Avoid inside quotes**: `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough` — the model will speak them literally.
32
 
33
- See the **Load an example prompt** dropdown for ready-made scene templates.
34
 
35
- ## Files
36
 
37
- - `app.py`Gradio UI
38
- - `src/inference_server.py` — warm `TTSServer` (single load, ~2.5s/request)
39
- - `src/inference.py` — CLI inference
40
- - `src/model_downloader.py` — auto-fetches model from HuggingFace
41
- - `ltx2/` — vendored LTX-2 pipelines
42
- - `requirements.txt` — Python deps (includes `resemble-perth`)
 
1
  ---
2
+ title: LTX-2.3-Voice
3
+ emoji: 🎙️
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: gradio
 
9
  pinned: true
10
  license: other
11
  license_name: ltx-2-community
12
+ license_link: https://huggingface.co/ResembleAI/LTX-2.3-Voice/blob/main/LICENSE
13
  hf_oauth: false
14
+ short_description: Expressive TTS with voice cloning — LTX-2.3-Voice demo
15
  ---
16
 
17
+ # LTX-2.3-Voice — Expressive TTS with Voice Cloning
18
 
19
+ Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
20
 
21
+ | | |
22
+ |---|---|
23
+ | 🤗 **Model** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/ResembleAI/LTX-2.3-Voice) |
24
+ | 🎭 **Demo Space** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/spaces/ResembleAI/LTX-2.3-Voice) (ZeroGPU) |
25
+ | 📜 **License** | LTX-2 Community License — see [`LICENSE`](LICENSE) |
26
 
27
+ ## Models
28
 
29
+ Auto-downloaded from the HF model repo on first run.
30
+
31
+ | File | Size | Description |
32
+ |---|---|---|
33
+ | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) |
34
+ | `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
35
+ | [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder |
36
+
37
+ **VRAM**: ~24 GB peak · **Speed**: ~2.5 s / generation (warm server, H100)
38
+
39
+ ## Quick Start
40
+
41
+ ### Warm server (recommended)
42
+
43
+ ```python
44
+ from src.inference_server import TTSServer
45
+
46
+ server = TTSServer(device="cuda")
47
+
48
+ server.generate_to_file(
49
+ prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
50
+ output="output.wav",
51
+ voice_ref="reference.wav", # optional, 10+ seconds
52
+ )
53
+ ```
54
+
55
+ ### CLI
56
+
57
+ ```bash
58
+ python src/inference.py \
59
+ --voice-sample reference.wav \
60
+ --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
61
+ --output output.wav \
62
+ --cfg-scale 2.5 --stg-scale 1.5
63
+ ```
64
+
65
+ ### Gradio app
66
+
67
+ ```bash
68
+ CUDA_VISIBLE_DEVICES=4 python app.py
69
+ ```
70
+
71
+ ## Inference Settings
72
+
73
+ | Parameter | Default | Notes |
74
+ |---|---|---|
75
+ | `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful |
76
+ | `stg-scale` | 1.5 | Skip-token guidance |
77
+ | `rescale` | 0 | No rescaling |
78
+ | `modality` | 1 | No modality guidance |
79
+ | `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length |
80
+ | `steps` | 30 | Euler flow matching |
81
+
82
+ ## Prompt Writing Guide
83
+
84
+ **Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
85
+
86
+ **Inside quotes** (model produces actual sounds):
87
+ - Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
88
+ - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
89
+
90
+ **Outside quotes** (stage directions):
91
+ - `She sighs deeply.` · `He gulps nervously.` · `A long pause.`
92
+ - `Her voice cracks.` · `He clears his throat.` · `She scoffs.`
93
+
94
+ **Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`.
95
+
96
+ **Tips**
97
+ - Match gender/age in the speaker description to the voice reference
98
+ - Break long dialogue into segments with action directions in between
99
+ - End the prompt at the last closing quote mark (no trailing description)
100
+
101
+ ## Watermarking
102
+
103
+ Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) — an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
104
+
105
+ ```python
106
+ import perth, librosa
107
+ wav, sr = librosa.load("output.wav", sr=None, mono=True)
108
+ detector = perth.PerthImplicitWatermarker()
109
+ print(detector.get_watermark(wav, sample_rate=sr)) # confidence ≈ 1.0
110
+ ```
111
+
112
+ Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
113
+
114
+ ## Training a LoRA on top of LTX-2.3-Voice
115
+
116
+ You can fine-tune your own LoRA using LTX-2.3-Voice itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
117
+
118
+ ### 1. Prepare your index file
119
+
120
+ The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
121
+
122
+ > `A woman speaks warmly, "<your transcript here>"`
123
+
124
+ Both forms are supported — with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
125
+
126
+ **Format A — `manifest` (JSONL)** — recommended for new datasets:
127
+
128
+ ```jsonl
129
+ {"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
130
+ {"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
131
+ {"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
132
+ ```
133
+
134
+ Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional.
135
+
136
+ **Format B — `tsv`** — simplest, one line per sample:
137
+
138
+ ```
139
+ wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?"
140
+ wavs/spk01_002.wav Hello, how are you today?
141
+ ```
142
+
143
+ **Format C — `gemini_synthetic`** — `~`-separated, used for prompted synthetic data:
144
+
145
+ ```
146
+ id~speaker~lang~sr~samples~dur~phonemes~text
147
+ spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
148
+ ```
149
+
150
+ **Format D — `libriheavy`** — `~`-separated, for unprompted text-only data:
151
+
152
+ ```
153
+ id~speaker~lang~samples~dur_ms~phonemes~text
154
+ spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
155
  ```
156
+
157
+ ### 2. Preprocess
158
+
159
+ ```bash
160
+ python src/preprocess.py \
161
+ --dataset-type manifest \
162
+ --index your_data.jsonl \
163
+ --audio-dir /path/to/wavs \
164
+ --output-dir /path/to/preprocessed/ \
165
+ --checkpoint /path/to/dramabox-audio-components.safetensors \
166
+ --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
167
+ --max-duration 20.0 --min-duration 2.0
168
+ ```
169
+
170
+ Output layout (training-ready `.pt` files):
171
+
172
+ ```
173
+ preprocessed/
174
+ ├── audio_latents/sample_*.pt # Audio VAE-encoded latents
175
+ ├── conditions/sample_*.pt # Gemma text embeddings
176
+ └── latents/sample_*.pt # Dummy video latents (placeholder)
177
+ ```
178
+
179
+ ### 3. Train
180
+
181
+ Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the LTX-2.3-Voice files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
182
+
183
+ ```bash
184
+ accelerate launch src/train.py \
185
+ --config configs/training_args.example.yaml
186
+ ```
187
+
188
+ The trainer attaches a fresh LoRA to the audio branch on top of the LTX-2.3-Voice checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
189
+
190
+ To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML — `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
191
+
192
+ ### Inference with your trained LoRA
193
+
194
+ ```bash
195
+ python src/inference.py \
196
+ --lora /path/to/your/lora_step_5000.safetensors \
197
+ --voice-sample reference.wav \
198
+ --prompt 'A woman speaks warmly, "..."' \
199
+ --output output.wav
200
  ```
201
 
202
+ Always load the LoRA at inference rather than pre-merging it pre-merged checkpoints have produced degraded output in our runs.
203
+
204
+ ## Language
205
 
206
+ English.
207
 
208
+ ## License
209
 
210
+ Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks. Distributed under the LTX-2 Community License Agreement see [`LICENSE`](LICENSE).
 
 
 
 
 
app.py CHANGED
@@ -1,5 +1,5 @@
1
  #!/usr/bin/env python3
2
- """DramaBox — Gradio demo (warm server).
3
 
4
  Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
5
  generated audio is invisibly watermarked with Resemble Perth before being
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths # noqa: E402
21
 
22
 
23
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
24
- logging.info("Fetching DramaBox checkpoints from HuggingFace (cached after first run)...")
25
  PATHS = get_all_paths()
26
 
27
  # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
28
  # `spaces` package patches torch so that .to("cuda") at import time pins the
29
  # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
30
  # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
31
- logging.info("Loading DramaBox warm server (Gemma + DiT + VAE + Decoder)...")
32
  tts = TTSServer(
33
  checkpoint=PATHS["transformer"],
34
  full_checkpoint=PATHS["audio_components"],
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
112
  raise gr.Error("Prompt is empty.")
113
  t0 = time.time()
114
  ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
115
- output = tempfile.mktemp(suffix=".wav", prefix="dramabox_")
116
  tts.generate_to_file(
117
  prompt=prompt,
118
  output=output,
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
127
 
128
  # ── UI ──────────────────────────────────────────────────────────────────────
129
  with gr.Blocks(
130
- title="DramaBox — Expressive TTS",
131
  theme=gr.themes.Default(),
132
  css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
133
  analytics_enabled=False,
134
  ) as app:
135
- gr.Markdown("# 🎭 DramaBox — Expressive TTS with Voice Cloning")
136
  gr.Markdown(
137
  "Write a scene prompt, optionally upload a 10-second voice reference, "
138
  "and generate. Audio is automatically watermarked with "
 
1
  #!/usr/bin/env python3
2
+ """LTX-2.3-Voice — Gradio demo (warm server).
3
 
4
  Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
5
  generated audio is invisibly watermarked with Resemble Perth before being
 
21
 
22
 
23
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
24
+ logging.info("Fetching LTX-2.3-Voice checkpoints from HuggingFace (cached after first run)...")
25
  PATHS = get_all_paths()
26
 
27
  # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
28
  # `spaces` package patches torch so that .to("cuda") at import time pins the
29
  # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
30
  # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
31
+ logging.info("Loading LTX-2.3-Voice warm server (Gemma + DiT + VAE + Decoder)...")
32
  tts = TTSServer(
33
  checkpoint=PATHS["transformer"],
34
  full_checkpoint=PATHS["audio_components"],
 
112
  raise gr.Error("Prompt is empty.")
113
  t0 = time.time()
114
  ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
115
+ output = tempfile.mktemp(suffix=".wav", prefix="ltx23voice_")
116
  tts.generate_to_file(
117
  prompt=prompt,
118
  output=output,
 
127
 
128
  # ── UI ──────────────────────────────────────────────────────────────────────
129
  with gr.Blocks(
130
+ title="LTX-2.3-Voice — Expressive TTS",
131
  theme=gr.themes.Default(),
132
  css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
133
  analytics_enabled=False,
134
  ) as app:
135
+ gr.Markdown("# 🎭 LTX-2.3-Voice — Expressive TTS with Voice Cloning")
136
  gr.Markdown(
137
  "Write a scene prompt, optionally upload a 10-second voice reference, "
138
  "and generate. Audio is automatically watermarked with "
configs/training_args.example.yaml CHANGED
@@ -1,38 +1,50 @@
1
- # Example DramaBox IC-LoRA training config. Used by scripts/train.sh.
 
 
2
 
3
- # Where to load preprocessed `audio_latents/` + `conditions/` shards from.
 
4
  data_dir:
5
- - /path/to/preprocessed_dataset_a/
6
- - /path/to/preprocessed_dataset_b/
7
 
8
- # One index file per data_dir entry. Each line:
9
- # <sample_id>~<speaker_id>~<lang>~<sample_rate>~<offset>~<duration>~<phonemes>~<text>
10
  speaker_index:
11
- - /path/to/preprocessed_dataset_a/index.txt
12
- - /path/to/preprocessed_dataset_b/index.txt
13
 
14
- # Output directory (relative is fine resolved against the repo root).
 
15
  output_dir: tts_iclora_v1
16
 
17
- # LTX-2.3 22B base. Same file is used for the transformer + the aux stack
18
- # (PromptEncoder, AudioVAE, AudioDecoder).
19
- checkpoint: ltx-2.3-22b-dev.safetensors
20
- full_checkpoint: ltx-2.3-22b-dev.safetensors
21
- base_model: dev
 
22
 
23
- # LoRA hyperparams. rank == alpha is the simplest setup (scale = 1.0).
24
  lora_rank: 128
25
  lora_alpha: 128
26
- lora_dropout: 0.1
27
 
28
- # Voice-cloning ref-token settings.
29
- ref_ratio: 0.3 # fraction of training samples that get a ref token
30
- max_ref_tokens: 200 # max ref-token positions appended to target
 
 
 
 
31
 
32
- text_dropout: 0.4 # CFG training: drop the text prompt with prob 0.4
 
 
33
 
34
- # Schedule. Use lr_scheduler=constant with a small lr (1e-5) for a "fine-tune"
35
- # resume; cosine + larger lr (1e-4) for from-scratch.
 
36
  steps: 10000
37
  lr: 1.0e-04
38
  lr_scheduler: cosine
@@ -46,8 +58,6 @@ save_every: 500
46
  log_every: 50
47
  seed: 53
48
 
49
- # (Optional) per-checkpoint validation eval see configs/val_config.example.yaml
50
- # val_config: val_config.example.yaml
51
-
52
- # (Optional) resume from a previous LoRA adapter file:
53
- # resume_lora: tts_iclora_v0/lora_step_05000.safetensors
 
1
+ # LTX-2.3-Voice IC-LoRA training config values become the defaults for
2
+ # `accelerate launch src/train.py --config configs/training_args.example.yaml`.
3
+ # Any flag explicitly passed on the CLI overrides the YAML.
4
 
5
+ # ── Data ───────────────────────────────────────────────────────────────────
6
+ # One entry per preprocessed dataset (output dirs from src/preprocess.py).
7
  data_dir:
8
+ - /path/to/preprocessed_dataset_a/
9
+ - /path/to/preprocessed_dataset_b/
10
 
11
+ # One index file per data_dir entry. Each line follows the format you fed to
12
+ # preprocess.py — see README "Prepare your index file".
13
  speaker_index:
14
+ - /path/to/preprocessed_dataset_a/index.txt
15
+ - /path/to/preprocessed_dataset_b/index.txt
16
 
17
+ # Output directory for LoRA shards + logs (relative paths resolve against the
18
+ # repo root).
19
  output_dir: tts_iclora_v1
20
 
21
+ # ── Base model ─────────────────────────────────────────────────────────────
22
+ # Train your LoRA on top of LTX-2.3-Voice itself (recommended) — the trimmed audio
23
+ # components are enough; no need to ship the raw LTX-2.3 base.
24
+ checkpoint: dramabox-dit-v1.safetensors
25
+ full_checkpoint: dramabox-audio-components.safetensors
26
+ base_model: dev # 'dev' = ShiftedLogitNormal sampler; 'distilled' = DistilledTimestepSampler
27
 
28
+ # ── LoRA hyperparams (rank == alpha scale = 1.0) ─────────────────────────
29
  lora_rank: 128
30
  lora_alpha: 128
31
+ lora_dropout: 0.1 # ~0.1 helps regularize on small datasets
32
 
33
+ # Resume an existing LoRA — step number parsed from the filename
34
+ # (e.g. lora_step_05000.safetensors starts at step 5000).
35
+ # resume_lora: tts_iclora_v0/lora_step_05000.safetensors
36
+
37
+ # ── Voice-cloning reference tokens ─────────────────────────────────────────
38
+ ref_ratio: 0.3 # fraction of training samples that get a ref-token tail
39
+ max_ref_tokens: 200 # cap on appended ref tokens after patchification
40
 
41
+ # CFG training: probability of zeroing the text condition (forces reliance on
42
+ # the voice ref / unconditional path).
43
+ text_dropout: 0.4
44
 
45
+ # ── Schedule ───────────────────────────────────────────────────────────────
46
+ # Cosine + 1e-4 = from-scratch fine-tune.
47
+ # Constant + 1e-5 = polish on top of an existing LoRA (use with `resume_lora`).
48
  steps: 10000
49
  lr: 1.0e-04
50
  lr_scheduler: cosine
 
58
  log_every: 50
59
  seed: 53
60
 
61
+ # Optional per-save-step validation pass. Generates a sample for every speaker
62
+ # in the val_config so you can A/B listen during training.
63
+ # val_config: configs/val_config.example.yaml
 
 
configs/val_config.example.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Validation prompts run by src/validate.py at every --save-every checkpoint.
2
+ # Each entry produces one .wav under <output_dir>/val_step_<N>/<name>.wav.
3
+ #
4
+ # Fields:
5
+ # name — short tag used as the output filename
6
+ # prompt — full LTX-2.3-Voice-style scene prompt
7
+ # reference — (optional) absolute path to a 10+ s voice reference clip;
8
+ # omit for prompt-only generation
9
+
10
+ speakers:
11
+ - name: villain_growl
12
+ prompt: 'A shadowy villain speaks with cold menace, "You have entered my domain, mortal." He chuckles darkly, "Such arrogance will be your undoing."'
13
+ reference: /path/to/voice_refs/male_villain.wav
14
+
15
+ - name: tender_whisper
16
+ prompt: 'A woman speaks tenderly, "It has been a long day, my love." She whispers, "Close your eyes. I am right here."'
17
+ reference: /path/to/voice_refs/female_warm.wav
18
+
19
+ - name: catgirl_giggle
20
+ prompt: 'A playful girl already mid-giggle, "Hehehe, oh my gosh you should see your face!" She gasps, "Oh my, hehe, I cannot stop!"'
21
+ # No `reference:` here — pure prompt-driven generation.
22
+
23
+ - name: announcer_smug
24
+ prompt: 'A confident announcer speaks proudly, "And now, the moment you have all been waiting for." He chuckles knowingly, "Heheh."'
25
+ reference: /path/to/voice_refs/male_announcer.wav
src/inference.py CHANGED
@@ -608,10 +608,26 @@ def main():
608
  audio_state = audio_tools.unpatchify(audio_state)
609
  logging.info(f"Final latent shape: {audio_state.latent.shape}")
610
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
  # ---- Decode audio ----
612
  logging.info("Decoding audio...")
613
  ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
614
- decoded = ad(audio_state.latent)
615
  del ad
616
  torch.cuda.empty_cache()
617
 
 
608
  audio_state = audio_tools.unpatchify(audio_state)
609
  logging.info(f"Final latent shape: {audio_state.latent.shape}")
610
 
611
+ # ---- End-of-clip silence-prior fix ----
612
+ # Base LTX-2.3 22B was trained on audio clips ≤ ~20 s and learned a strong
613
+ # "clip-end silence" prior at the next patchifier-aligned latent boundary
614
+ # (frame 513 = 8 × 64 + 1). For longer outputs that prior leaks through as
615
+ # a ~30 ms hard silence dip near 20.4 s. Linearly interpolating frames
616
+ # 512–513 between their neighbours (511 and 514) removes the dip cleanly.
617
+ latent_in = audio_state.latent
618
+ if latent_in.shape[2] > 513:
619
+ f0, f1 = 511, 514
620
+ n = f1 - f0
621
+ patched = latent_in.clone()
622
+ for f in (512, 513):
623
+ t = (f - f0) / n
624
+ patched[:, :, f, :] = (1.0 - t) * latent_in[:, :, f0, :] + t * latent_in[:, :, f1, :]
625
+ latent_in = patched
626
+
627
  # ---- Decode audio ----
628
  logging.info("Decoding audio...")
629
  ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
630
+ decoded = ad(latent_in)
631
  del ad
632
  torch.cuda.empty_cache()
633
 
src/model_downloader.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Download Dramabox models from HuggingFace.
4
 
5
  Models are cached locally after first download.
6
  Gemma text encoder is fetched separately from Google's repo.
@@ -13,16 +13,17 @@ from huggingface_hub import hf_hub_download, snapshot_download
13
 
14
  logger = logging.getLogger(__name__)
15
 
16
- DRAMABOX_REPO = "ResembleAI/Dramabox"
17
  GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
18
 
19
  # Default cache directory
20
  DEFAULT_CACHE = os.environ.get(
21
- "DRAMABOX_CACHE",
22
- os.path.join(os.path.expanduser("~"), ".cache", "dramabox"),
23
  )
24
 
25
- # Model files in the HF repo (flat structure)
 
26
  MODEL_FILES = {
27
  "transformer": "dramabox-dit-v1.safetensors",
28
  "audio_components": "dramabox-audio-components.safetensors",
@@ -35,7 +36,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
35
 
36
  Args:
37
  name: One of 'transformer', 'audio_components', 'silence_latent'
38
- cache_dir: Local cache directory (default: ~/.cache/dramabox)
39
 
40
  Returns:
41
  Local file path
@@ -46,10 +47,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
46
  raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
47
 
48
  repo_path = MODEL_FILES[name]
49
- logger.info(f"Fetching {name} from {DRAMABOX_REPO}/{repo_path}...")
50
 
51
  local_path = hf_hub_download(
52
- repo_id=DRAMABOX_REPO,
53
  filename=repo_path,
54
  cache_dir=cache_dir,
55
  token=os.environ.get("HF_TOKEN"),
 
1
  #!/usr/bin/env python3
2
  """
3
+ Download LTX-2.3-Voice models from HuggingFace.
4
 
5
  Models are cached locally after first download.
6
  Gemma text encoder is fetched separately from Google's repo.
 
13
 
14
  logger = logging.getLogger(__name__)
15
 
16
+ LTX23_VOICE_REPO = "ResembleAI/LTX-2.3-Voice"
17
  GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
18
 
19
  # Default cache directory
20
  DEFAULT_CACHE = os.environ.get(
21
+ "LTX23_VOICE_CACHE",
22
+ os.path.join(os.path.expanduser("~"), ".cache", "ltx-2.3-voice"),
23
  )
24
 
25
+ # Model files in the HF repo (flat structure). The on-disk filenames stayed
26
+ # `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
27
  MODEL_FILES = {
28
  "transformer": "dramabox-dit-v1.safetensors",
29
  "audio_components": "dramabox-audio-components.safetensors",
 
36
 
37
  Args:
38
  name: One of 'transformer', 'audio_components', 'silence_latent'
39
+ cache_dir: Local cache directory (default: ~/.cache/ltx-2.3-voice)
40
 
41
  Returns:
42
  Local file path
 
47
  raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
48
 
49
  repo_path = MODEL_FILES[name]
50
+ logger.info(f"Fetching {name} from {LTX23_VOICE_REPO}/{repo_path}...")
51
 
52
  local_path = hf_hub_download(
53
+ repo_id=LTX23_VOICE_REPO,
54
  filename=repo_path,
55
  cache_dir=cache_dir,
56
  token=os.environ.get("HF_TOKEN"),
src/train.py CHANGED
@@ -372,42 +372,64 @@ def run_validation(lora_path, val_config_path, output_dir, step, lora_rank=128):
372
  # ─── Args ───
373
 
374
  def parse_args():
375
- p = argparse.ArgumentParser(description="Audio-Only IC-LoRA Training for Voice Cloning")
376
- p.add_argument("--data-dir", required=True, nargs="+")
377
- p.add_argument("--speaker-index", required=True, nargs="+")
378
- p.add_argument("--output-dir", default=os.path.join(MODEL_DIR, "tts_iclora_v1"))
379
- p.add_argument("--checkpoint", default=os.path.join(MODEL_DIR, "ltx-2.3-audio-only.safetensors"))
380
- p.add_argument("--full-checkpoint", default=os.path.join(MODEL_DIR, "ltx-2.3-22b-distilled.safetensors"))
381
- p.add_argument("--base-model", choices=["distilled", "dev"], default="distilled",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
383
- p.add_argument("--lora-rank", type=int, default=128)
384
- p.add_argument("--lora-alpha", type=int, default=128)
385
- p.add_argument("--lora-dropout", type=float, default=0.0,
386
  help="Dropout applied to LoRA A/B matrices during training. "
387
  "Recommended ~0.1 for small datasets to regularize.")
388
- p.add_argument("--resume-lora", default=None)
389
- p.add_argument("--resume-step-offset", type=int, default=None,
390
  help="Step to add when naming saved checkpoints. If None, inferred "
391
  "from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
392
  "Set to 0 to start numbering at 0 regardless.")
393
- p.add_argument("--ref-ratio", type=float, default=0.3,
394
  help="Fraction of target length to use as reference (default 0.3)")
395
- p.add_argument("--max-ref-tokens", type=int, default=200,
396
  help="Maximum reference tokens after patchification (default 200)")
397
- p.add_argument("--text-dropout", type=float, default=0.0,
398
  help="Probability of dropping text conditioning (forces reliance on voice ref)")
399
- p.add_argument("--steps", type=int, default=30000)
400
- p.add_argument("--lr", type=float, default=3e-5)
401
- p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default="cosine")
402
- p.add_argument("--batch-size", type=int, default=1)
403
- p.add_argument("--grad-accum", type=int, default=4)
404
- p.add_argument("--max-grad-norm", type=float, default=1.0)
405
- p.add_argument("--save-every", type=int, default=1000)
406
- p.add_argument("--log-every", type=int, default=50)
407
- p.add_argument("--seed", type=int, default=42)
408
- p.add_argument("--warmup-steps", type=int, default=100)
409
- p.add_argument("--val-config", default=None)
410
- return p.parse_args()
411
 
412
 
413
  # ─── Main ───
 
372
  # ─── Args ───
373
 
374
  def parse_args():
375
+ # First pass: pull out --config so its values can become argparse defaults.
376
+ cfg_parser = argparse.ArgumentParser(add_help=False)
377
+ cfg_parser.add_argument("--config", default=None,
378
+ help="YAML file with default values for any of the flags below. "
379
+ "Explicit CLI flags still override the YAML.")
380
+ cfg_args, remaining = cfg_parser.parse_known_args()
381
+ yaml_defaults: dict = {}
382
+ if cfg_args.config:
383
+ import yaml as _yaml
384
+ with open(cfg_args.config) as f:
385
+ yaml_defaults = _yaml.safe_load(f) or {}
386
+ # YAML keys are dashes-or-underscores → normalize to argparse dest (underscore).
387
+ yaml_defaults = {k.replace("-", "_"): v for k, v in yaml_defaults.items()}
388
+
389
+ def _yaml(name, fallback):
390
+ return yaml_defaults.get(name, fallback)
391
+
392
+ p = argparse.ArgumentParser(
393
+ parents=[cfg_parser],
394
+ description="Audio-Only IC-LoRA Training for Voice Cloning",
395
+ )
396
+ p.add_argument("--data-dir", required="data_dir" not in yaml_defaults,
397
+ nargs="+", default=_yaml("data_dir", None))
398
+ p.add_argument("--speaker-index", required="speaker_index" not in yaml_defaults,
399
+ nargs="+", default=_yaml("speaker_index", None))
400
+ p.add_argument("--output-dir", default=_yaml("output_dir", os.path.join(MODEL_DIR, "tts_iclora_v1")))
401
+ p.add_argument("--checkpoint", default=_yaml("checkpoint", os.path.join(MODEL_DIR, "dramabox-dit-v1.safetensors")))
402
+ p.add_argument("--full-checkpoint", default=_yaml("full_checkpoint", os.path.join(MODEL_DIR, "dramabox-audio-components.safetensors")))
403
+ p.add_argument("--base-model", choices=["distilled", "dev"], default=_yaml("base_model", "dev"),
404
  help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
405
+ p.add_argument("--lora-rank", type=int, default=_yaml("lora_rank", 128))
406
+ p.add_argument("--lora-alpha", type=int, default=_yaml("lora_alpha", 128))
407
+ p.add_argument("--lora-dropout", type=float, default=_yaml("lora_dropout", 0.0),
408
  help="Dropout applied to LoRA A/B matrices during training. "
409
  "Recommended ~0.1 for small datasets to regularize.")
410
+ p.add_argument("--resume-lora", default=_yaml("resume_lora", None))
411
+ p.add_argument("--resume-step-offset", type=int, default=_yaml("resume_step_offset", None),
412
  help="Step to add when naming saved checkpoints. If None, inferred "
413
  "from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
414
  "Set to 0 to start numbering at 0 regardless.")
415
+ p.add_argument("--ref-ratio", type=float, default=_yaml("ref_ratio", 0.3),
416
  help="Fraction of target length to use as reference (default 0.3)")
417
+ p.add_argument("--max-ref-tokens", type=int, default=_yaml("max_ref_tokens", 200),
418
  help="Maximum reference tokens after patchification (default 200)")
419
+ p.add_argument("--text-dropout", type=float, default=_yaml("text_dropout", 0.0),
420
  help="Probability of dropping text conditioning (forces reliance on voice ref)")
421
+ p.add_argument("--steps", type=int, default=_yaml("steps", 30000))
422
+ p.add_argument("--lr", type=float, default=_yaml("lr", 3e-5))
423
+ p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default=_yaml("lr_scheduler", "cosine"))
424
+ p.add_argument("--batch-size", type=int, default=_yaml("batch_size", 1))
425
+ p.add_argument("--grad-accum", type=int, default=_yaml("grad_accum", 4))
426
+ p.add_argument("--max-grad-norm", type=float, default=_yaml("max_grad_norm", 1.0))
427
+ p.add_argument("--save-every", type=int, default=_yaml("save_every", 1000))
428
+ p.add_argument("--log-every", type=int, default=_yaml("log_every", 50))
429
+ p.add_argument("--seed", type=int, default=_yaml("seed", 42))
430
+ p.add_argument("--warmup-steps", type=int, default=_yaml("warmup_steps", 100))
431
+ p.add_argument("--val-config", default=_yaml("val_config", None))
432
+ return p.parse_args(remaining)
433
 
434
 
435
  # ─── Main ───
src/validate.py CHANGED
@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
29
  )
30
  GEMMA_ROOT = os.environ.get(
31
  "GEMMA_ROOT",
32
- os.path.expanduser("~/.cache/dramabox/gemma-3-12b-it-bnb-4bit"),
33
  )
34
 
35
 
 
29
  )
30
  GEMMA_ROOT = os.environ.get(
31
  "GEMMA_ROOT",
32
+ os.path.expanduser("~/.cache/ltx-2.3-voice/gemma-3-12b-it-bnb-4bit"),
33
  )
34
 
35