Manmay Nakhashi commited on
Commit ·
1636761
1
Parent(s): c29ae29
Rebrand: DramaBox → LTX-2.3-Voice
Browse files- README title + frontmatter (title, emoji, license_link, short_description)
- All in-text references in docs and code
- model_downloader: DRAMABOX_REPO → LTX23_VOICE_REPO, default cache → ~/.cache/ltx-2.3-voice
- app.py title, prefix, log lines
- HF model+space repos are renamed to ResembleAI/LTX-2.3-Voice (HF auto-redirects old paths)
The on-disk safetensors filenames (`dramabox-dit-v1.safetensors`,
`dramabox-audio-components.safetensors`) stay as-is to avoid an 8 GB
re-upload; comment in model_downloader.py explains the leftover names.
- README.md +188 -20
- app.py +6 -6
- configs/training_args.example.yaml +37 -27
- configs/val_config.example.yaml +25 -0
- src/inference.py +17 -1
- src/model_downloader.py +9 -8
- src/train.py +49 -27
- src/validate.py +1 -1
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
|
@@ -9,34 +9,202 @@ app_file: app.py
|
|
| 9 |
pinned: true
|
| 10 |
license: other
|
| 11 |
license_name: ltx-2-community
|
| 12 |
-
license_link: https://huggingface.co/ResembleAI/
|
| 13 |
hf_oauth: false
|
| 14 |
-
short_description: Expressive TTS with voice cloning —
|
| 15 |
---
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
```
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
```
|
| 28 |
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
##
|
| 36 |
|
| 37 |
-
-
|
| 38 |
-
- `src/inference_server.py` — warm `TTSServer` (single load, ~2.5s/request)
|
| 39 |
-
- `src/inference.py` — CLI inference
|
| 40 |
-
- `src/model_downloader.py` — auto-fetches model from HuggingFace
|
| 41 |
-
- `ltx2/` — vendored LTX-2 pipelines
|
| 42 |
-
- `requirements.txt` — Python deps (includes `resemble-perth`)
|
|
|
|
| 1 |
---
|
| 2 |
+
title: LTX-2.3-Voice
|
| 3 |
+
emoji: 🎙️
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
|
|
|
| 9 |
pinned: true
|
| 10 |
license: other
|
| 11 |
license_name: ltx-2-community
|
| 12 |
+
license_link: https://huggingface.co/ResembleAI/LTX-2.3-Voice/blob/main/LICENSE
|
| 13 |
hf_oauth: false
|
| 14 |
+
short_description: Expressive TTS with voice cloning — LTX-2.3-Voice demo
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# LTX-2.3-Voice — Expressive TTS with Voice Cloning
|
| 18 |
|
| 19 |
+
Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
|
| 20 |
|
| 21 |
+
| | |
|
| 22 |
+
|---|---|
|
| 23 |
+
| 🤗 **Model** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/ResembleAI/LTX-2.3-Voice) |
|
| 24 |
+
| 🎭 **Demo Space** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/spaces/ResembleAI/LTX-2.3-Voice) (ZeroGPU) |
|
| 25 |
+
| 📜 **License** | LTX-2 Community License — see [`LICENSE`](LICENSE) |
|
| 26 |
|
| 27 |
+
## Models
|
| 28 |
|
| 29 |
+
Auto-downloaded from the HF model repo on first run.
|
| 30 |
+
|
| 31 |
+
| File | Size | Description |
|
| 32 |
+
|---|---|---|
|
| 33 |
+
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) |
|
| 34 |
+
| `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
|
| 35 |
+
| [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder |
|
| 36 |
+
|
| 37 |
+
**VRAM**: ~24 GB peak · **Speed**: ~2.5 s / generation (warm server, H100)
|
| 38 |
+
|
| 39 |
+
## Quick Start
|
| 40 |
+
|
| 41 |
+
### Warm server (recommended)
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
from src.inference_server import TTSServer
|
| 45 |
+
|
| 46 |
+
server = TTSServer(device="cuda")
|
| 47 |
+
|
| 48 |
+
server.generate_to_file(
|
| 49 |
+
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
|
| 50 |
+
output="output.wav",
|
| 51 |
+
voice_ref="reference.wav", # optional, 10+ seconds
|
| 52 |
+
)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### CLI
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
python src/inference.py \
|
| 59 |
+
--voice-sample reference.wav \
|
| 60 |
+
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
|
| 61 |
+
--output output.wav \
|
| 62 |
+
--cfg-scale 2.5 --stg-scale 1.5
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### Gradio app
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
+
CUDA_VISIBLE_DEVICES=4 python app.py
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Inference Settings
|
| 72 |
+
|
| 73 |
+
| Parameter | Default | Notes |
|
| 74 |
+
|---|---|---|
|
| 75 |
+
| `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful |
|
| 76 |
+
| `stg-scale` | 1.5 | Skip-token guidance |
|
| 77 |
+
| `rescale` | 0 | No rescaling |
|
| 78 |
+
| `modality` | 1 | No modality guidance |
|
| 79 |
+
| `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length |
|
| 80 |
+
| `steps` | 30 | Euler flow matching |
|
| 81 |
+
|
| 82 |
+
## Prompt Writing Guide
|
| 83 |
+
|
| 84 |
+
**Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
|
| 85 |
+
|
| 86 |
+
**Inside quotes** (model produces actual sounds):
|
| 87 |
+
- Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
|
| 88 |
+
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
|
| 89 |
+
|
| 90 |
+
**Outside quotes** (stage directions):
|
| 91 |
+
- `She sighs deeply.` · `He gulps nervously.` · `A long pause.`
|
| 92 |
+
- `Her voice cracks.` · `He clears his throat.` · `She scoffs.`
|
| 93 |
+
|
| 94 |
+
**Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`.
|
| 95 |
+
|
| 96 |
+
**Tips**
|
| 97 |
+
- Match gender/age in the speaker description to the voice reference
|
| 98 |
+
- Break long dialogue into segments with action directions in between
|
| 99 |
+
- End the prompt at the last closing quote mark (no trailing description)
|
| 100 |
+
|
| 101 |
+
## Watermarking
|
| 102 |
+
|
| 103 |
+
Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) — an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
|
| 104 |
+
|
| 105 |
+
```python
|
| 106 |
+
import perth, librosa
|
| 107 |
+
wav, sr = librosa.load("output.wav", sr=None, mono=True)
|
| 108 |
+
detector = perth.PerthImplicitWatermarker()
|
| 109 |
+
print(detector.get_watermark(wav, sample_rate=sr)) # confidence ≈ 1.0
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
|
| 113 |
+
|
| 114 |
+
## Training a LoRA on top of LTX-2.3-Voice
|
| 115 |
+
|
| 116 |
+
You can fine-tune your own LoRA using LTX-2.3-Voice itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
|
| 117 |
+
|
| 118 |
+
### 1. Prepare your index file
|
| 119 |
+
|
| 120 |
+
The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
|
| 121 |
+
|
| 122 |
+
> `A woman speaks warmly, "<your transcript here>"`
|
| 123 |
+
|
| 124 |
+
Both forms are supported — with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
|
| 125 |
+
|
| 126 |
+
**Format A — `manifest` (JSONL)** — recommended for new datasets:
|
| 127 |
+
|
| 128 |
+
```jsonl
|
| 129 |
+
{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
|
| 130 |
+
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
|
| 131 |
+
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional.
|
| 135 |
+
|
| 136 |
+
**Format B — `tsv`** — simplest, one line per sample:
|
| 137 |
+
|
| 138 |
+
```
|
| 139 |
+
wavs/spk01_001.wav A woman speaks warmly, "Hello, how are you today?"
|
| 140 |
+
wavs/spk01_002.wav Hello, how are you today?
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
**Format C — `gemini_synthetic`** — `~`-separated, used for prompted synthetic data:
|
| 144 |
+
|
| 145 |
+
```
|
| 146 |
+
id~speaker~lang~sr~samples~dur~phonemes~text
|
| 147 |
+
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
**Format D — `libriheavy`** — `~`-separated, for unprompted text-only data:
|
| 151 |
+
|
| 152 |
+
```
|
| 153 |
+
id~speaker~lang~samples~dur_ms~phonemes~text
|
| 154 |
+
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
|
| 155 |
```
|
| 156 |
+
|
| 157 |
+
### 2. Preprocess
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
python src/preprocess.py \
|
| 161 |
+
--dataset-type manifest \
|
| 162 |
+
--index your_data.jsonl \
|
| 163 |
+
--audio-dir /path/to/wavs \
|
| 164 |
+
--output-dir /path/to/preprocessed/ \
|
| 165 |
+
--checkpoint /path/to/dramabox-audio-components.safetensors \
|
| 166 |
+
--gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
|
| 167 |
+
--max-duration 20.0 --min-duration 2.0
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
Output layout (training-ready `.pt` files):
|
| 171 |
+
|
| 172 |
+
```
|
| 173 |
+
preprocessed/
|
| 174 |
+
├── audio_latents/sample_*.pt # Audio VAE-encoded latents
|
| 175 |
+
├── conditions/sample_*.pt # Gemma text embeddings
|
| 176 |
+
└── latents/sample_*.pt # Dummy video latents (placeholder)
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
### 3. Train
|
| 180 |
+
|
| 181 |
+
Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the LTX-2.3-Voice files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
|
| 182 |
+
|
| 183 |
+
```bash
|
| 184 |
+
accelerate launch src/train.py \
|
| 185 |
+
--config configs/training_args.example.yaml
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
The trainer attaches a fresh LoRA to the audio branch on top of the LTX-2.3-Voice checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
|
| 189 |
+
|
| 190 |
+
To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML — `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
|
| 191 |
+
|
| 192 |
+
### Inference with your trained LoRA
|
| 193 |
+
|
| 194 |
+
```bash
|
| 195 |
+
python src/inference.py \
|
| 196 |
+
--lora /path/to/your/lora_step_5000.safetensors \
|
| 197 |
+
--voice-sample reference.wav \
|
| 198 |
+
--prompt 'A woman speaks warmly, "..."' \
|
| 199 |
+
--output output.wav
|
| 200 |
```
|
| 201 |
|
| 202 |
+
Always load the LoRA at inference rather than pre-merging it — pre-merged checkpoints have produced degraded output in our runs.
|
| 203 |
+
|
| 204 |
+
## Language
|
| 205 |
|
| 206 |
+
English.
|
| 207 |
|
| 208 |
+
## License
|
| 209 |
|
| 210 |
+
Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks. Distributed under the LTX-2 Community License Agreement — see [`LICENSE`](LICENSE).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
|
| 4 |
Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
|
| 5 |
generated audio is invisibly watermarked with Resemble Perth before being
|
|
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths # noqa: E402
|
|
| 21 |
|
| 22 |
|
| 23 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
| 24 |
-
logging.info("Fetching
|
| 25 |
PATHS = get_all_paths()
|
| 26 |
|
| 27 |
# Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
|
| 28 |
# `spaces` package patches torch so that .to("cuda") at import time pins the
|
| 29 |
# weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
|
| 30 |
# onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
|
| 31 |
-
logging.info("Loading
|
| 32 |
tts = TTSServer(
|
| 33 |
checkpoint=PATHS["transformer"],
|
| 34 |
full_checkpoint=PATHS["audio_components"],
|
|
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
|
|
| 112 |
raise gr.Error("Prompt is empty.")
|
| 113 |
t0 = time.time()
|
| 114 |
ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
|
| 115 |
-
output = tempfile.mktemp(suffix=".wav", prefix="
|
| 116 |
tts.generate_to_file(
|
| 117 |
prompt=prompt,
|
| 118 |
output=output,
|
|
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
|
|
| 127 |
|
| 128 |
# ── UI ──────────────────────────────────────────────────────────────────────
|
| 129 |
with gr.Blocks(
|
| 130 |
-
title="
|
| 131 |
theme=gr.themes.Default(),
|
| 132 |
css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
|
| 133 |
analytics_enabled=False,
|
| 134 |
) as app:
|
| 135 |
-
gr.Markdown("# 🎭
|
| 136 |
gr.Markdown(
|
| 137 |
"Write a scene prompt, optionally upload a 10-second voice reference, "
|
| 138 |
"and generate. Audio is automatically watermarked with "
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
+
"""LTX-2.3-Voice — Gradio demo (warm server).
|
| 3 |
|
| 4 |
Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
|
| 5 |
generated audio is invisibly watermarked with Resemble Perth before being
|
|
|
|
| 21 |
|
| 22 |
|
| 23 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
| 24 |
+
logging.info("Fetching LTX-2.3-Voice checkpoints from HuggingFace (cached after first run)...")
|
| 25 |
PATHS = get_all_paths()
|
| 26 |
|
| 27 |
# Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
|
| 28 |
# `spaces` package patches torch so that .to("cuda") at import time pins the
|
| 29 |
# weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
|
| 30 |
# onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
|
| 31 |
+
logging.info("Loading LTX-2.3-Voice warm server (Gemma + DiT + VAE + Decoder)...")
|
| 32 |
tts = TTSServer(
|
| 33 |
checkpoint=PATHS["transformer"],
|
| 34 |
full_checkpoint=PATHS["audio_components"],
|
|
|
|
| 112 |
raise gr.Error("Prompt is empty.")
|
| 113 |
t0 = time.time()
|
| 114 |
ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
|
| 115 |
+
output = tempfile.mktemp(suffix=".wav", prefix="ltx23voice_")
|
| 116 |
tts.generate_to_file(
|
| 117 |
prompt=prompt,
|
| 118 |
output=output,
|
|
|
|
| 127 |
|
| 128 |
# ── UI ──────────────────────────────────────────────────────────────────────
|
| 129 |
with gr.Blocks(
|
| 130 |
+
title="LTX-2.3-Voice — Expressive TTS",
|
| 131 |
theme=gr.themes.Default(),
|
| 132 |
css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
|
| 133 |
analytics_enabled=False,
|
| 134 |
) as app:
|
| 135 |
+
gr.Markdown("# 🎭 LTX-2.3-Voice — Expressive TTS with Voice Cloning")
|
| 136 |
gr.Markdown(
|
| 137 |
"Write a scene prompt, optionally upload a 10-second voice reference, "
|
| 138 |
"and generate. Audio is automatically watermarked with "
|
configs/training_args.example.yaml
CHANGED
|
@@ -1,38 +1,50 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
#
|
|
|
|
| 4 |
data_dir:
|
| 5 |
-
- /path/to/preprocessed_dataset_a/
|
| 6 |
-
- /path/to/preprocessed_dataset_b/
|
| 7 |
|
| 8 |
-
# One index file per data_dir entry. Each line
|
| 9 |
-
#
|
| 10 |
speaker_index:
|
| 11 |
-
- /path/to/preprocessed_dataset_a/index.txt
|
| 12 |
-
- /path/to/preprocessed_dataset_b/index.txt
|
| 13 |
|
| 14 |
-
# Output directory
|
|
|
|
| 15 |
output_dir: tts_iclora_v1
|
| 16 |
|
| 17 |
-
#
|
| 18 |
-
#
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
| 22 |
|
| 23 |
-
# LoRA hyperparams
|
| 24 |
lora_rank: 128
|
| 25 |
lora_alpha: 128
|
| 26 |
-
lora_dropout: 0.1
|
| 27 |
|
| 28 |
-
#
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
# Schedule
|
| 35 |
-
#
|
|
|
|
| 36 |
steps: 10000
|
| 37 |
lr: 1.0e-04
|
| 38 |
lr_scheduler: cosine
|
|
@@ -46,8 +58,6 @@ save_every: 500
|
|
| 46 |
log_every: 50
|
| 47 |
seed: 53
|
| 48 |
|
| 49 |
-
#
|
| 50 |
-
#
|
| 51 |
-
|
| 52 |
-
# (Optional) resume from a previous LoRA adapter file:
|
| 53 |
-
# resume_lora: tts_iclora_v0/lora_step_05000.safetensors
|
|
|
|
| 1 |
+
# LTX-2.3-Voice IC-LoRA training config — values become the defaults for
|
| 2 |
+
# `accelerate launch src/train.py --config configs/training_args.example.yaml`.
|
| 3 |
+
# Any flag explicitly passed on the CLI overrides the YAML.
|
| 4 |
|
| 5 |
+
# ── Data ───────────────────────────────────────────────────────────────────
|
| 6 |
+
# One entry per preprocessed dataset (output dirs from src/preprocess.py).
|
| 7 |
data_dir:
|
| 8 |
+
- /path/to/preprocessed_dataset_a/
|
| 9 |
+
- /path/to/preprocessed_dataset_b/
|
| 10 |
|
| 11 |
+
# One index file per data_dir entry. Each line follows the format you fed to
|
| 12 |
+
# preprocess.py — see README "Prepare your index file".
|
| 13 |
speaker_index:
|
| 14 |
+
- /path/to/preprocessed_dataset_a/index.txt
|
| 15 |
+
- /path/to/preprocessed_dataset_b/index.txt
|
| 16 |
|
| 17 |
+
# Output directory for LoRA shards + logs (relative paths resolve against the
|
| 18 |
+
# repo root).
|
| 19 |
output_dir: tts_iclora_v1
|
| 20 |
|
| 21 |
+
# ── Base model ─────────────────────────────────────────────────────────────
|
| 22 |
+
# Train your LoRA on top of LTX-2.3-Voice itself (recommended) — the trimmed audio
|
| 23 |
+
# components are enough; no need to ship the raw LTX-2.3 base.
|
| 24 |
+
checkpoint: dramabox-dit-v1.safetensors
|
| 25 |
+
full_checkpoint: dramabox-audio-components.safetensors
|
| 26 |
+
base_model: dev # 'dev' = ShiftedLogitNormal sampler; 'distilled' = DistilledTimestepSampler
|
| 27 |
|
| 28 |
+
# ── LoRA hyperparams (rank == alpha → scale = 1.0) ─────────────────────────
|
| 29 |
lora_rank: 128
|
| 30 |
lora_alpha: 128
|
| 31 |
+
lora_dropout: 0.1 # ~0.1 helps regularize on small datasets
|
| 32 |
|
| 33 |
+
# Resume an existing LoRA — step number parsed from the filename
|
| 34 |
+
# (e.g. lora_step_05000.safetensors → starts at step 5000).
|
| 35 |
+
# resume_lora: tts_iclora_v0/lora_step_05000.safetensors
|
| 36 |
+
|
| 37 |
+
# ── Voice-cloning reference tokens ─────────────────────────────────────────
|
| 38 |
+
ref_ratio: 0.3 # fraction of training samples that get a ref-token tail
|
| 39 |
+
max_ref_tokens: 200 # cap on appended ref tokens after patchification
|
| 40 |
|
| 41 |
+
# CFG training: probability of zeroing the text condition (forces reliance on
|
| 42 |
+
# the voice ref / unconditional path).
|
| 43 |
+
text_dropout: 0.4
|
| 44 |
|
| 45 |
+
# ── Schedule ───────────────────────────────────────────────────────────────
|
| 46 |
+
# Cosine + 1e-4 = from-scratch fine-tune.
|
| 47 |
+
# Constant + 1e-5 = polish on top of an existing LoRA (use with `resume_lora`).
|
| 48 |
steps: 10000
|
| 49 |
lr: 1.0e-04
|
| 50 |
lr_scheduler: cosine
|
|
|
|
| 58 |
log_every: 50
|
| 59 |
seed: 53
|
| 60 |
|
| 61 |
+
# Optional per-save-step validation pass. Generates a sample for every speaker
|
| 62 |
+
# in the val_config so you can A/B listen during training.
|
| 63 |
+
# val_config: configs/val_config.example.yaml
|
|
|
|
|
|
configs/val_config.example.yaml
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Validation prompts run by src/validate.py at every --save-every checkpoint.
|
| 2 |
+
# Each entry produces one .wav under <output_dir>/val_step_<N>/<name>.wav.
|
| 3 |
+
#
|
| 4 |
+
# Fields:
|
| 5 |
+
# name — short tag used as the output filename
|
| 6 |
+
# prompt — full LTX-2.3-Voice-style scene prompt
|
| 7 |
+
# reference — (optional) absolute path to a 10+ s voice reference clip;
|
| 8 |
+
# omit for prompt-only generation
|
| 9 |
+
|
| 10 |
+
speakers:
|
| 11 |
+
- name: villain_growl
|
| 12 |
+
prompt: 'A shadowy villain speaks with cold menace, "You have entered my domain, mortal." He chuckles darkly, "Such arrogance will be your undoing."'
|
| 13 |
+
reference: /path/to/voice_refs/male_villain.wav
|
| 14 |
+
|
| 15 |
+
- name: tender_whisper
|
| 16 |
+
prompt: 'A woman speaks tenderly, "It has been a long day, my love." She whispers, "Close your eyes. I am right here."'
|
| 17 |
+
reference: /path/to/voice_refs/female_warm.wav
|
| 18 |
+
|
| 19 |
+
- name: catgirl_giggle
|
| 20 |
+
prompt: 'A playful girl already mid-giggle, "Hehehe, oh my gosh you should see your face!" She gasps, "Oh my, hehe, I cannot stop!"'
|
| 21 |
+
# No `reference:` here — pure prompt-driven generation.
|
| 22 |
+
|
| 23 |
+
- name: announcer_smug
|
| 24 |
+
prompt: 'A confident announcer speaks proudly, "And now, the moment you have all been waiting for." He chuckles knowingly, "Heheh."'
|
| 25 |
+
reference: /path/to/voice_refs/male_announcer.wav
|
src/inference.py
CHANGED
|
@@ -608,10 +608,26 @@ def main():
|
|
| 608 |
audio_state = audio_tools.unpatchify(audio_state)
|
| 609 |
logging.info(f"Final latent shape: {audio_state.latent.shape}")
|
| 610 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 611 |
# ---- Decode audio ----
|
| 612 |
logging.info("Decoding audio...")
|
| 613 |
ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
|
| 614 |
-
decoded = ad(
|
| 615 |
del ad
|
| 616 |
torch.cuda.empty_cache()
|
| 617 |
|
|
|
|
| 608 |
audio_state = audio_tools.unpatchify(audio_state)
|
| 609 |
logging.info(f"Final latent shape: {audio_state.latent.shape}")
|
| 610 |
|
| 611 |
+
# ---- End-of-clip silence-prior fix ----
|
| 612 |
+
# Base LTX-2.3 22B was trained on audio clips ≤ ~20 s and learned a strong
|
| 613 |
+
# "clip-end silence" prior at the next patchifier-aligned latent boundary
|
| 614 |
+
# (frame 513 = 8 × 64 + 1). For longer outputs that prior leaks through as
|
| 615 |
+
# a ~30 ms hard silence dip near 20.4 s. Linearly interpolating frames
|
| 616 |
+
# 512–513 between their neighbours (511 and 514) removes the dip cleanly.
|
| 617 |
+
latent_in = audio_state.latent
|
| 618 |
+
if latent_in.shape[2] > 513:
|
| 619 |
+
f0, f1 = 511, 514
|
| 620 |
+
n = f1 - f0
|
| 621 |
+
patched = latent_in.clone()
|
| 622 |
+
for f in (512, 513):
|
| 623 |
+
t = (f - f0) / n
|
| 624 |
+
patched[:, :, f, :] = (1.0 - t) * latent_in[:, :, f0, :] + t * latent_in[:, :, f1, :]
|
| 625 |
+
latent_in = patched
|
| 626 |
+
|
| 627 |
# ---- Decode audio ----
|
| 628 |
logging.info("Decoding audio...")
|
| 629 |
ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
|
| 630 |
+
decoded = ad(latent_in)
|
| 631 |
del ad
|
| 632 |
torch.cuda.empty_cache()
|
| 633 |
|
src/model_downloader.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Download
|
| 4 |
|
| 5 |
Models are cached locally after first download.
|
| 6 |
Gemma text encoder is fetched separately from Google's repo.
|
|
@@ -13,16 +13,17 @@ from huggingface_hub import hf_hub_download, snapshot_download
|
|
| 13 |
|
| 14 |
logger = logging.getLogger(__name__)
|
| 15 |
|
| 16 |
-
|
| 17 |
GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
|
| 18 |
|
| 19 |
# Default cache directory
|
| 20 |
DEFAULT_CACHE = os.environ.get(
|
| 21 |
-
"
|
| 22 |
-
os.path.join(os.path.expanduser("~"), ".cache", "
|
| 23 |
)
|
| 24 |
|
| 25 |
-
# Model files in the HF repo (flat structure)
|
|
|
|
| 26 |
MODEL_FILES = {
|
| 27 |
"transformer": "dramabox-dit-v1.safetensors",
|
| 28 |
"audio_components": "dramabox-audio-components.safetensors",
|
|
@@ -35,7 +36,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
|
|
| 35 |
|
| 36 |
Args:
|
| 37 |
name: One of 'transformer', 'audio_components', 'silence_latent'
|
| 38 |
-
cache_dir: Local cache directory (default: ~/.cache/
|
| 39 |
|
| 40 |
Returns:
|
| 41 |
Local file path
|
|
@@ -46,10 +47,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
|
|
| 46 |
raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
|
| 47 |
|
| 48 |
repo_path = MODEL_FILES[name]
|
| 49 |
-
logger.info(f"Fetching {name} from {
|
| 50 |
|
| 51 |
local_path = hf_hub_download(
|
| 52 |
-
repo_id=
|
| 53 |
filename=repo_path,
|
| 54 |
cache_dir=cache_dir,
|
| 55 |
token=os.environ.get("HF_TOKEN"),
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Download LTX-2.3-Voice models from HuggingFace.
|
| 4 |
|
| 5 |
Models are cached locally after first download.
|
| 6 |
Gemma text encoder is fetched separately from Google's repo.
|
|
|
|
| 13 |
|
| 14 |
logger = logging.getLogger(__name__)
|
| 15 |
|
| 16 |
+
LTX23_VOICE_REPO = "ResembleAI/LTX-2.3-Voice"
|
| 17 |
GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
|
| 18 |
|
| 19 |
# Default cache directory
|
| 20 |
DEFAULT_CACHE = os.environ.get(
|
| 21 |
+
"LTX23_VOICE_CACHE",
|
| 22 |
+
os.path.join(os.path.expanduser("~"), ".cache", "ltx-2.3-voice"),
|
| 23 |
)
|
| 24 |
|
| 25 |
+
# Model files in the HF repo (flat structure). The on-disk filenames stayed
|
| 26 |
+
# `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
|
| 27 |
MODEL_FILES = {
|
| 28 |
"transformer": "dramabox-dit-v1.safetensors",
|
| 29 |
"audio_components": "dramabox-audio-components.safetensors",
|
|
|
|
| 36 |
|
| 37 |
Args:
|
| 38 |
name: One of 'transformer', 'audio_components', 'silence_latent'
|
| 39 |
+
cache_dir: Local cache directory (default: ~/.cache/ltx-2.3-voice)
|
| 40 |
|
| 41 |
Returns:
|
| 42 |
Local file path
|
|
|
|
| 47 |
raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
|
| 48 |
|
| 49 |
repo_path = MODEL_FILES[name]
|
| 50 |
+
logger.info(f"Fetching {name} from {LTX23_VOICE_REPO}/{repo_path}...")
|
| 51 |
|
| 52 |
local_path = hf_hub_download(
|
| 53 |
+
repo_id=LTX23_VOICE_REPO,
|
| 54 |
filename=repo_path,
|
| 55 |
cache_dir=cache_dir,
|
| 56 |
token=os.environ.get("HF_TOKEN"),
|
src/train.py
CHANGED
|
@@ -372,42 +372,64 @@ def run_validation(lora_path, val_config_path, output_dir, step, lora_rank=128):
|
|
| 372 |
# ─── Args ───
|
| 373 |
|
| 374 |
def parse_args():
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 382 |
help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
|
| 383 |
-
p.add_argument("--lora-rank", type=int, default=128)
|
| 384 |
-
p.add_argument("--lora-alpha", type=int, default=128)
|
| 385 |
-
p.add_argument("--lora-dropout", type=float, default=0.0,
|
| 386 |
help="Dropout applied to LoRA A/B matrices during training. "
|
| 387 |
"Recommended ~0.1 for small datasets to regularize.")
|
| 388 |
-
p.add_argument("--resume-lora", default=None)
|
| 389 |
-
p.add_argument("--resume-step-offset", type=int, default=None,
|
| 390 |
help="Step to add when naming saved checkpoints. If None, inferred "
|
| 391 |
"from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
|
| 392 |
"Set to 0 to start numbering at 0 regardless.")
|
| 393 |
-
p.add_argument("--ref-ratio", type=float, default=0.3,
|
| 394 |
help="Fraction of target length to use as reference (default 0.3)")
|
| 395 |
-
p.add_argument("--max-ref-tokens", type=int, default=200,
|
| 396 |
help="Maximum reference tokens after patchification (default 200)")
|
| 397 |
-
p.add_argument("--text-dropout", type=float, default=0.0,
|
| 398 |
help="Probability of dropping text conditioning (forces reliance on voice ref)")
|
| 399 |
-
p.add_argument("--steps", type=int, default=30000)
|
| 400 |
-
p.add_argument("--lr", type=float, default=3e-5)
|
| 401 |
-
p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default="cosine")
|
| 402 |
-
p.add_argument("--batch-size", type=int, default=1)
|
| 403 |
-
p.add_argument("--grad-accum", type=int, default=4)
|
| 404 |
-
p.add_argument("--max-grad-norm", type=float, default=1.0)
|
| 405 |
-
p.add_argument("--save-every", type=int, default=1000)
|
| 406 |
-
p.add_argument("--log-every", type=int, default=50)
|
| 407 |
-
p.add_argument("--seed", type=int, default=42)
|
| 408 |
-
p.add_argument("--warmup-steps", type=int, default=100)
|
| 409 |
-
p.add_argument("--val-config", default=None)
|
| 410 |
-
return p.parse_args()
|
| 411 |
|
| 412 |
|
| 413 |
# ─── Main ───
|
|
|
|
| 372 |
# ─── Args ───
|
| 373 |
|
| 374 |
def parse_args():
|
| 375 |
+
# First pass: pull out --config so its values can become argparse defaults.
|
| 376 |
+
cfg_parser = argparse.ArgumentParser(add_help=False)
|
| 377 |
+
cfg_parser.add_argument("--config", default=None,
|
| 378 |
+
help="YAML file with default values for any of the flags below. "
|
| 379 |
+
"Explicit CLI flags still override the YAML.")
|
| 380 |
+
cfg_args, remaining = cfg_parser.parse_known_args()
|
| 381 |
+
yaml_defaults: dict = {}
|
| 382 |
+
if cfg_args.config:
|
| 383 |
+
import yaml as _yaml
|
| 384 |
+
with open(cfg_args.config) as f:
|
| 385 |
+
yaml_defaults = _yaml.safe_load(f) or {}
|
| 386 |
+
# YAML keys are dashes-or-underscores → normalize to argparse dest (underscore).
|
| 387 |
+
yaml_defaults = {k.replace("-", "_"): v for k, v in yaml_defaults.items()}
|
| 388 |
+
|
| 389 |
+
def _yaml(name, fallback):
|
| 390 |
+
return yaml_defaults.get(name, fallback)
|
| 391 |
+
|
| 392 |
+
p = argparse.ArgumentParser(
|
| 393 |
+
parents=[cfg_parser],
|
| 394 |
+
description="Audio-Only IC-LoRA Training for Voice Cloning",
|
| 395 |
+
)
|
| 396 |
+
p.add_argument("--data-dir", required="data_dir" not in yaml_defaults,
|
| 397 |
+
nargs="+", default=_yaml("data_dir", None))
|
| 398 |
+
p.add_argument("--speaker-index", required="speaker_index" not in yaml_defaults,
|
| 399 |
+
nargs="+", default=_yaml("speaker_index", None))
|
| 400 |
+
p.add_argument("--output-dir", default=_yaml("output_dir", os.path.join(MODEL_DIR, "tts_iclora_v1")))
|
| 401 |
+
p.add_argument("--checkpoint", default=_yaml("checkpoint", os.path.join(MODEL_DIR, "dramabox-dit-v1.safetensors")))
|
| 402 |
+
p.add_argument("--full-checkpoint", default=_yaml("full_checkpoint", os.path.join(MODEL_DIR, "dramabox-audio-components.safetensors")))
|
| 403 |
+
p.add_argument("--base-model", choices=["distilled", "dev"], default=_yaml("base_model", "dev"),
|
| 404 |
help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
|
| 405 |
+
p.add_argument("--lora-rank", type=int, default=_yaml("lora_rank", 128))
|
| 406 |
+
p.add_argument("--lora-alpha", type=int, default=_yaml("lora_alpha", 128))
|
| 407 |
+
p.add_argument("--lora-dropout", type=float, default=_yaml("lora_dropout", 0.0),
|
| 408 |
help="Dropout applied to LoRA A/B matrices during training. "
|
| 409 |
"Recommended ~0.1 for small datasets to regularize.")
|
| 410 |
+
p.add_argument("--resume-lora", default=_yaml("resume_lora", None))
|
| 411 |
+
p.add_argument("--resume-step-offset", type=int, default=_yaml("resume_step_offset", None),
|
| 412 |
help="Step to add when naming saved checkpoints. If None, inferred "
|
| 413 |
"from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
|
| 414 |
"Set to 0 to start numbering at 0 regardless.")
|
| 415 |
+
p.add_argument("--ref-ratio", type=float, default=_yaml("ref_ratio", 0.3),
|
| 416 |
help="Fraction of target length to use as reference (default 0.3)")
|
| 417 |
+
p.add_argument("--max-ref-tokens", type=int, default=_yaml("max_ref_tokens", 200),
|
| 418 |
help="Maximum reference tokens after patchification (default 200)")
|
| 419 |
+
p.add_argument("--text-dropout", type=float, default=_yaml("text_dropout", 0.0),
|
| 420 |
help="Probability of dropping text conditioning (forces reliance on voice ref)")
|
| 421 |
+
p.add_argument("--steps", type=int, default=_yaml("steps", 30000))
|
| 422 |
+
p.add_argument("--lr", type=float, default=_yaml("lr", 3e-5))
|
| 423 |
+
p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default=_yaml("lr_scheduler", "cosine"))
|
| 424 |
+
p.add_argument("--batch-size", type=int, default=_yaml("batch_size", 1))
|
| 425 |
+
p.add_argument("--grad-accum", type=int, default=_yaml("grad_accum", 4))
|
| 426 |
+
p.add_argument("--max-grad-norm", type=float, default=_yaml("max_grad_norm", 1.0))
|
| 427 |
+
p.add_argument("--save-every", type=int, default=_yaml("save_every", 1000))
|
| 428 |
+
p.add_argument("--log-every", type=int, default=_yaml("log_every", 50))
|
| 429 |
+
p.add_argument("--seed", type=int, default=_yaml("seed", 42))
|
| 430 |
+
p.add_argument("--warmup-steps", type=int, default=_yaml("warmup_steps", 100))
|
| 431 |
+
p.add_argument("--val-config", default=_yaml("val_config", None))
|
| 432 |
+
return p.parse_args(remaining)
|
| 433 |
|
| 434 |
|
| 435 |
# ─── Main ───
|
src/validate.py
CHANGED
|
@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
|
|
| 29 |
)
|
| 30 |
GEMMA_ROOT = os.environ.get(
|
| 31 |
"GEMMA_ROOT",
|
| 32 |
-
os.path.expanduser("~/.cache/
|
| 33 |
)
|
| 34 |
|
| 35 |
|
|
|
|
| 29 |
)
|
| 30 |
GEMMA_ROOT = os.environ.get(
|
| 31 |
"GEMMA_ROOT",
|
| 32 |
+
os.path.expanduser("~/.cache/ltx-2.3-voice/gemma-3-12b-it-bnb-4bit"),
|
| 33 |
)
|
| 34 |
|
| 35 |
|