Dramabox

Running on Zero

Manmay Nakhashi commited on 26 days ago

Commit

1636761

1 Parent(s): c29ae29

Rebrand: DramaBox → LTX-2.3-Voice

- README title + frontmatter (title, emoji, license_link, short_description)
- All in-text references in docs and code
- model_downloader: DRAMABOX_REPO → LTX23_VOICE_REPO, default cache → ~/.cache/ltx-2.3-voice
- app.py title, prefix, log lines
- HF model+space repos are renamed to ResembleAI/LTX-2.3-Voice (HF auto-redirects old paths)

The on-disk safetensors filenames (`dramabox-dit-v1.safetensors`,
`dramabox-audio-components.safetensors`) stay as-is to avoid an 8 GB
re-upload; comment in model_downloader.py explains the leftover names.

Files changed (8) hide show

README.md +188 -20
app.py +6 -6
configs/training_args.example.yaml +37 -27
configs/val_config.example.yaml +25 -0
src/inference.py +17 -1
src/model_downloader.py +9 -8
src/train.py +49 -27
src/validate.py +1 -1

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: DramaBox
-emoji: 🎭
 colorFrom: red
 colorTo: indigo
 sdk: gradio
@@ -9,34 +9,202 @@ app_file: app.py
 pinned: true
 license: other
 license_name: ltx-2-community
-license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
 hf_oauth: false
-short_description: Expressive TTS with voice cloning — DramaBox demo
 ---
-# DramaBox — Expressive TTS Demo
-Live demo of [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox). Write a scene prompt, optionally upload a 10-second voice reference, and generate. Audio is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth).
-The model checkpoints download automatically on first launch.
-## Prompt format
 ```
-<speaker description>, "<dialogue>" <action direction> "<more dialogue>"
 ```
-- **Inside double quotes**: dialogue and phonetic sounds (`"Hahaha"`, `"Mmmmm"`, `"Ugh"`)
-- **Outside quotes**: stage directions (`She sighs.`, `He clears his throat.`)
-- **Avoid inside quotes**: `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough` — the model will speak them literally.
-See the **Load an example prompt** dropdown for ready-made scene templates.
-## Files
-- `app.py` — Gradio UI
-- `src/inference_server.py` — warm `TTSServer` (single load, ~2.5s/request)
-- `src/inference.py` — CLI inference
-- `src/model_downloader.py` — auto-fetches model from HuggingFace
-- `ltx2/` — vendored LTX-2 pipelines
-- `requirements.txt` — Python deps (includes `resemble-perth`)

 ---
+title: LTX-2.3-Voice
+emoji: 🎙️
 colorFrom: red
 colorTo: indigo
 sdk: gradio
 pinned: true
 license: other
 license_name: ltx-2-community
+license_link: https://huggingface.co/ResembleAI/LTX-2.3-Voice/blob/main/LICENSE
 hf_oauth: false
+short_description: Expressive TTS with voice cloning — LTX-2.3-Voice demo
 ---
+# LTX-2.3-Voice — Expressive TTS with Voice Cloning
+Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
+| | |
+|---|---|
+| 🤗 **Model** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/ResembleAI/LTX-2.3-Voice) |
+| 🎭 **Demo Space** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/spaces/ResembleAI/LTX-2.3-Voice) (ZeroGPU) |
+| 📜 **License** | LTX-2 Community License — see [`LICENSE`](LICENSE) |
+## Models
+Auto-downloaded from the HF model repo on first run.
+| File | Size | Description |
+|---|---|---|
+| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (LoRA already merged into base) |
+| `dramabox-audio-components.safetensors` | 1.9 GB | Audio embeddings connector + audio text projection + audio VAE + vocoder |
+| [`unsloth/gemma-3-12b-it-bnb-4bit`](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder |
+**VRAM**: ~24 GB peak · **Speed**: ~2.5 s / generation (warm server, H100)
+## Quick Start
+### Warm server (recommended)
+```python
+from src.inference_server import TTSServer
+server = TTSServer(device="cuda")
+server.generate_to_file(
+    prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
+    output="output.wav",
+    voice_ref="reference.wav",   # optional, 10+ seconds
+)
+```
+### CLI
+```bash
+python src/inference.py \
+  --voice-sample reference.wav \
+  --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
+  --output output.wav \
+  --cfg-scale 2.5 --stg-scale 1.5
+```
+### Gradio app
+```bash
+CUDA_VISIBLE_DEVICES=4 python app.py
+```
+## Inference Settings
+| Parameter | Default | Notes |
+|---|---|---|
+| `cfg-scale` | 2.5 | Lower = more natural, higher = more text-faithful |
+| `stg-scale` | 1.5 | Skip-token guidance |
+| `rescale` | 0 | No rescaling |
+| `modality` | 1 | No modality guidance |
+| `duration-multiplier` | 1.1 | 10% breathing room on auto-estimated length |
+| `steps` | 30 | Euler flow matching |
+## Prompt Writing Guide
+**Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
+**Inside quotes** (model produces actual sounds):
+- Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
+- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
+**Outside quotes** (stage directions):
+- `She sighs deeply.` · `He gulps nervously.` · `A long pause.`
+- `Her voice cracks.` · `He clears his throat.` · `She scoffs.`
+**Avoid inside quotes** (model speaks them literally): `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough`.
+**Tips**
+- Match gender/age in the speaker description to the voice reference
+- Break long dialogue into segments with action directions in between
+- End the prompt at the last closing quote mark (no trailing description)
+## Watermarking
+Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) — an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
+```python
+import perth, librosa
+wav, sr = librosa.load("output.wav", sr=None, mono=True)
+detector = perth.PerthImplicitWatermarker()
+print(detector.get_watermark(wav, sample_rate=sr))   # confidence ≈ 1.0
+```
+Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
+## Training a LoRA on top of LTX-2.3-Voice
+You can fine-tune your own LoRA using LTX-2.3-Voice itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
+### 1. Prepare your index file
+The preprocessor accepts four formats. The `text` field is the **target transcript**; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:
+> `A woman speaks warmly, "<your transcript here>"`
+Both forms are supported — with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.
+**Format A — `manifest` (JSONL)** — recommended for new datasets:
+```jsonl
+{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
+{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
+{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}
+```
+Fields: `audio_filepath` (or `audio_path`) is required, `text` (or `transcript`) is required, `duration` is optional.
+**Format B — `tsv`** — simplest, one line per sample:
+```
+wavs/spk01_001.wav	A woman speaks warmly, "Hello, how are you today?"
+wavs/spk01_002.wav	Hello, how are you today?
+```
+**Format C — `gemini_synthetic`** — `~`-separated, used for prompted synthetic data:
+```
+id~speaker~lang~sr~samples~dur~phonemes~text
+spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"
+```
+**Format D — `libriheavy`** — `~`-separated, for unprompted text-only data:
+```
+id~speaker~lang~samples~dur_ms~phonemes~text
+spk01_001~spk01~en~93000~3875~_~Hello, how are you today?
 ```
+### 2. Preprocess
+```bash
+python src/preprocess.py \
+  --dataset-type manifest \
+  --index your_data.jsonl \
+  --audio-dir /path/to/wavs \
+  --output-dir /path/to/preprocessed/ \
+  --checkpoint /path/to/dramabox-audio-components.safetensors \
+  --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
+  --max-duration 20.0 --min-duration 2.0
+```
+Output layout (training-ready `.pt` files):
+```
+preprocessed/
+├── audio_latents/sample_*.pt     # Audio VAE-encoded latents
+├── conditions/sample_*.pt        # Gemma text embeddings
+└── latents/sample_*.pt           # Dummy video latents (placeholder)
+```
+### 3. Train
+Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the LTX-2.3-Voice files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
+```bash
+accelerate launch src/train.py \
+  --config configs/training_args.example.yaml
+```
+The trainer attaches a fresh LoRA to the audio branch on top of the LTX-2.3-Voice checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
+To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML — `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
+### Inference with your trained LoRA
+```bash
+python src/inference.py \
+  --lora /path/to/your/lora_step_5000.safetensors \
+  --voice-sample reference.wav \
+  --prompt 'A woman speaks warmly, "..."' \
+  --output output.wav
 ```
+Always load the LoRA at inference rather than pre-merging it — pre-merged checkpoints have produced degraded output in our runs.
+## Language
+English.
+## License
+Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks. Distributed under the LTX-2 Community License Agreement — see [`LICENSE`](LICENSE).

app.py CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-"""DramaBox — Gradio demo (warm server).
 Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
 generated audio is invisibly watermarked with Resemble Perth before being
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths  # noqa: E402
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
-logging.info("Fetching DramaBox checkpoints from HuggingFace (cached after first run)...")
 PATHS = get_all_paths()
 # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
 # `spaces` package patches torch so that .to("cuda") at import time pins the
 # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
 # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
-logging.info("Loading DramaBox warm server (Gemma + DiT + VAE + Decoder)...")
 tts = TTSServer(
     checkpoint=PATHS["transformer"],
     full_checkpoint=PATHS["audio_components"],
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
         raise gr.Error("Prompt is empty.")
     t0 = time.time()
     ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
-    output = tempfile.mktemp(suffix=".wav", prefix="dramabox_")
     tts.generate_to_file(
         prompt=prompt,
         output=output,
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
 # ── UI ──────────────────────────────────────────────────────────────────────
 with gr.Blocks(
-    title="DramaBox — Expressive TTS",
     theme=gr.themes.Default(),
     css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
     analytics_enabled=False,
 ) as app:
-    gr.Markdown("# 🎭 DramaBox — Expressive TTS with Voice Cloning")
     gr.Markdown(
         "Write a scene prompt, optionally upload a 10-second voice reference, "
         "and generate. Audio is automatically watermarked with "

 #!/usr/bin/env python3
+"""LTX-2.3-Voice — Gradio demo (warm server).
 Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
 generated audio is invisibly watermarked with Resemble Perth before being
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logging.info("Fetching LTX-2.3-Voice checkpoints from HuggingFace (cached after first run)...")
 PATHS = get_all_paths()
 # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
 # `spaces` package patches torch so that .to("cuda") at import time pins the
 # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
 # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
+logging.info("Loading LTX-2.3-Voice warm server (Gemma + DiT + VAE + Decoder)...")
 tts = TTSServer(
     checkpoint=PATHS["transformer"],
     full_checkpoint=PATHS["audio_components"],
         raise gr.Error("Prompt is empty.")
     t0 = time.time()
     ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
+    output = tempfile.mktemp(suffix=".wav", prefix="ltx23voice_")
     tts.generate_to_file(
         prompt=prompt,
         output=output,
 # ── UI ──────────────────────────────────────────────────────────────────────
 with gr.Blocks(
+    title="LTX-2.3-Voice — Expressive TTS",
     theme=gr.themes.Default(),
     css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
     analytics_enabled=False,
 ) as app:
+    gr.Markdown("# 🎭 LTX-2.3-Voice — Expressive TTS with Voice Cloning")
     gr.Markdown(
         "Write a scene prompt, optionally upload a 10-second voice reference, "
         "and generate. Audio is automatically watermarked with "

configs/training_args.example.yaml CHANGED Viewed

@@ -1,38 +1,50 @@
-# Example DramaBox IC-LoRA training config. Used by scripts/train.sh.
-# Where to load preprocessed `audio_latents/` + `conditions/` shards from.
 data_dir:
-- /path/to/preprocessed_dataset_a/
-- /path/to/preprocessed_dataset_b/
-# One index file per data_dir entry. Each line:
-#   <sample_id>~<speaker_id>~<lang>~<sample_rate>~<offset>~<duration>~<phonemes>~<text>
 speaker_index:
-- /path/to/preprocessed_dataset_a/index.txt
-- /path/to/preprocessed_dataset_b/index.txt
-# Output directory (relative is fine — resolved against the repo root).
 output_dir: tts_iclora_v1
-# LTX-2.3 22B base. Same file is used for the transformer + the aux stack
-# (PromptEncoder, AudioVAE, AudioDecoder).
-checkpoint: ltx-2.3-22b-dev.safetensors
-full_checkpoint: ltx-2.3-22b-dev.safetensors
-base_model: dev
-# LoRA hyperparams. rank == alpha is the simplest setup (scale = 1.0).
 lora_rank: 128
 lora_alpha: 128
-lora_dropout: 0.1
-# Voice-cloning ref-token settings.
-ref_ratio: 0.3            # fraction of training samples that get a ref token
-max_ref_tokens: 200       # max ref-token positions appended to target
-text_dropout: 0.4         # CFG training: drop the text prompt with prob 0.4
-# Schedule. Use lr_scheduler=constant with a small lr (1e-5) for a "fine-tune"
-# resume; cosine + larger lr (1e-4) for from-scratch.
 steps: 10000
 lr: 1.0e-04
 lr_scheduler: cosine
@@ -46,8 +58,6 @@ save_every: 500
 log_every: 50
 seed: 53
-# (Optional) per-checkpoint validation eval — see configs/val_config.example.yaml
-# val_config: val_config.example.yaml
-# (Optional) resume from a previous LoRA adapter file:
-# resume_lora: tts_iclora_v0/lora_step_05000.safetensors

+# LTX-2.3-Voice IC-LoRA training config — values become the defaults for
+# `accelerate launch src/train.py --config configs/training_args.example.yaml`.
+# Any flag explicitly passed on the CLI overrides the YAML.
+# ── Data ───────────────────────────────────────────────────────────────────
+# One entry per preprocessed dataset (output dirs from src/preprocess.py).
 data_dir:
+  - /path/to/preprocessed_dataset_a/
+  - /path/to/preprocessed_dataset_b/
+# One index file per data_dir entry. Each line follows the format you fed to
+# preprocess.py — see README "Prepare your index file".
 speaker_index:
+  - /path/to/preprocessed_dataset_a/index.txt
+  - /path/to/preprocessed_dataset_b/index.txt
+# Output directory for LoRA shards + logs (relative paths resolve against the
+# repo root).
 output_dir: tts_iclora_v1
+# ── Base model ─────────────────────────────────────────────────────────────
+# Train your LoRA on top of LTX-2.3-Voice itself (recommended) — the trimmed audio
+# components are enough; no need to ship the raw LTX-2.3 base.
+checkpoint: dramabox-dit-v1.safetensors
+full_checkpoint: dramabox-audio-components.safetensors
+base_model: dev          # 'dev' = ShiftedLogitNormal sampler; 'distilled' = DistilledTimestepSampler
+# ── LoRA hyperparams (rank == alpha → scale = 1.0) ─────────────────────────
 lora_rank: 128
 lora_alpha: 128
+lora_dropout: 0.1        # ~0.1 helps regularize on small datasets
+# Resume an existing LoRA — step number parsed from the filename
+# (e.g. lora_step_05000.safetensors → starts at step 5000).
+# resume_lora: tts_iclora_v0/lora_step_05000.safetensors
+# ── Voice-cloning reference tokens ─────────────────────────────────────────
+ref_ratio: 0.3           # fraction of training samples that get a ref-token tail
+max_ref_tokens: 200      # cap on appended ref tokens after patchification
+# CFG training: probability of zeroing the text condition (forces reliance on
+# the voice ref / unconditional path).
+text_dropout: 0.4
+# ── Schedule ───────────────────────────────────────────────────────────────
+# Cosine + 1e-4 = from-scratch fine-tune.
+# Constant + 1e-5 = polish on top of an existing LoRA (use with `resume_lora`).
 steps: 10000
 lr: 1.0e-04
 lr_scheduler: cosine
 log_every: 50
 seed: 53
+# Optional per-save-step validation pass. Generates a sample for every speaker
+# in the val_config so you can A/B listen during training.
+# val_config: configs/val_config.example.yaml

configs/val_config.example.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+# Validation prompts run by src/validate.py at every --save-every checkpoint.
+# Each entry produces one .wav under <output_dir>/val_step_<N>/<name>.wav.
+#
+# Fields:
+#   name      — short tag used as the output filename
+#   prompt    — full LTX-2.3-Voice-style scene prompt
+#   reference — (optional) absolute path to a 10+ s voice reference clip;
+#               omit for prompt-only generation
+speakers:
+  - name: villain_growl
+    prompt: 'A shadowy villain speaks with cold menace, "You have entered my domain, mortal." He chuckles darkly, "Such arrogance will be your undoing."'
+    reference: /path/to/voice_refs/male_villain.wav
+  - name: tender_whisper
+    prompt: 'A woman speaks tenderly, "It has been a long day, my love." She whispers, "Close your eyes. I am right here."'
+    reference: /path/to/voice_refs/female_warm.wav
+  - name: catgirl_giggle
+    prompt: 'A playful girl already mid-giggle, "Hehehe, oh my gosh you should see your face!" She gasps, "Oh my, hehe, I cannot stop!"'
+    # No `reference:` here — pure prompt-driven generation.
+  - name: announcer_smug
+    prompt: 'A confident announcer speaks proudly, "And now, the moment you have all been waiting for." He chuckles knowingly, "Heheh."'
+    reference: /path/to/voice_refs/male_announcer.wav

src/inference.py CHANGED Viewed

@@ -608,10 +608,26 @@ def main():
     audio_state = audio_tools.unpatchify(audio_state)
     logging.info(f"Final latent shape: {audio_state.latent.shape}")
     # ---- Decode audio ----
     logging.info("Decoding audio...")
     ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
-    decoded = ad(audio_state.latent)
     del ad
     torch.cuda.empty_cache()

     audio_state = audio_tools.unpatchify(audio_state)
     logging.info(f"Final latent shape: {audio_state.latent.shape}")
+    # ---- End-of-clip silence-prior fix ----
+    # Base LTX-2.3 22B was trained on audio clips ≤ ~20 s and learned a strong
+    # "clip-end silence" prior at the next patchifier-aligned latent boundary
+    # (frame 513 = 8 × 64 + 1). For longer outputs that prior leaks through as
+    # a ~30 ms hard silence dip near 20.4 s. Linearly interpolating frames
+    # 512–513 between their neighbours (511 and 514) removes the dip cleanly.
+    latent_in = audio_state.latent
+    if latent_in.shape[2] > 513:
+        f0, f1 = 511, 514
+        n = f1 - f0
+        patched = latent_in.clone()
+        for f in (512, 513):
+            t = (f - f0) / n
+            patched[:, :, f, :] = (1.0 - t) * latent_in[:, :, f0, :] + t * latent_in[:, :, f1, :]
+        latent_in = patched
     # ---- Decode audio ----
     logging.info("Decoding audio...")
     ad = AudioDecoder(checkpoint_path=args.full_checkpoint, dtype=dtype, device=device)
+    decoded = ad(latent_in)
     del ad
     torch.cuda.empty_cache()

src/model_downloader.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Download Dramabox models from HuggingFace.
 Models are cached locally after first download.
 Gemma text encoder is fetched separately from Google's repo.
@@ -13,16 +13,17 @@ from huggingface_hub import hf_hub_download, snapshot_download
 logger = logging.getLogger(__name__)
-DRAMABOX_REPO = "ResembleAI/Dramabox"
 GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
 # Default cache directory
 DEFAULT_CACHE = os.environ.get(
-    "DRAMABOX_CACHE",
-    os.path.join(os.path.expanduser("~"), ".cache", "dramabox"),
 )
-# Model files in the HF repo (flat structure)
 MODEL_FILES = {
     "transformer": "dramabox-dit-v1.safetensors",
     "audio_components": "dramabox-audio-components.safetensors",
@@ -35,7 +36,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
     Args:
         name: One of 'transformer', 'audio_components', 'silence_latent'
-        cache_dir: Local cache directory (default: ~/.cache/dramabox)
     Returns:
         Local file path
@@ -46,10 +47,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
         raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
     repo_path = MODEL_FILES[name]
-    logger.info(f"Fetching {name} from {DRAMABOX_REPO}/{repo_path}...")
     local_path = hf_hub_download(
-        repo_id=DRAMABOX_REPO,
         filename=repo_path,
         cache_dir=cache_dir,
         token=os.environ.get("HF_TOKEN"),

 #!/usr/bin/env python3
 """
+Download LTX-2.3-Voice models from HuggingFace.
 Models are cached locally after first download.
 Gemma text encoder is fetched separately from Google's repo.
 logger = logging.getLogger(__name__)
+LTX23_VOICE_REPO = "ResembleAI/LTX-2.3-Voice"
 GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
 # Default cache directory
 DEFAULT_CACHE = os.environ.get(
+    "LTX23_VOICE_CACHE",
+    os.path.join(os.path.expanduser("~"), ".cache", "ltx-2.3-voice"),
 )
+# Model files in the HF repo (flat structure). The on-disk filenames stayed
+# `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
 MODEL_FILES = {
     "transformer": "dramabox-dit-v1.safetensors",
     "audio_components": "dramabox-audio-components.safetensors",
     Args:
         name: One of 'transformer', 'audio_components', 'silence_latent'
+        cache_dir: Local cache directory (default: ~/.cache/ltx-2.3-voice)
     Returns:
         Local file path
         raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
     repo_path = MODEL_FILES[name]
+    logger.info(f"Fetching {name} from {LTX23_VOICE_REPO}/{repo_path}...")
     local_path = hf_hub_download(
+        repo_id=LTX23_VOICE_REPO,
         filename=repo_path,
         cache_dir=cache_dir,
         token=os.environ.get("HF_TOKEN"),

src/train.py CHANGED Viewed

@@ -372,42 +372,64 @@ def run_validation(lora_path, val_config_path, output_dir, step, lora_rank=128):
 # ─── Args ───
 def parse_args():
-    p = argparse.ArgumentParser(description="Audio-Only IC-LoRA Training for Voice Cloning")
-    p.add_argument("--data-dir", required=True, nargs="+")
-    p.add_argument("--speaker-index", required=True, nargs="+")
-    p.add_argument("--output-dir", default=os.path.join(MODEL_DIR, "tts_iclora_v1"))
-    p.add_argument("--checkpoint", default=os.path.join(MODEL_DIR, "ltx-2.3-audio-only.safetensors"))
-    p.add_argument("--full-checkpoint", default=os.path.join(MODEL_DIR, "ltx-2.3-22b-distilled.safetensors"))
-    p.add_argument("--base-model", choices=["distilled", "dev"], default="distilled",
                    help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
-    p.add_argument("--lora-rank", type=int, default=128)
-    p.add_argument("--lora-alpha", type=int, default=128)
-    p.add_argument("--lora-dropout", type=float, default=0.0,
                    help="Dropout applied to LoRA A/B matrices during training. "
                         "Recommended ~0.1 for small datasets to regularize.")
-    p.add_argument("--resume-lora", default=None)
-    p.add_argument("--resume-step-offset", type=int, default=None,
                    help="Step to add when naming saved checkpoints. If None, inferred "
                         "from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
                         "Set to 0 to start numbering at 0 regardless.")
-    p.add_argument("--ref-ratio", type=float, default=0.3,
                    help="Fraction of target length to use as reference (default 0.3)")
-    p.add_argument("--max-ref-tokens", type=int, default=200,
                    help="Maximum reference tokens after patchification (default 200)")
-    p.add_argument("--text-dropout", type=float, default=0.0,
                    help="Probability of dropping text conditioning (forces reliance on voice ref)")
-    p.add_argument("--steps", type=int, default=30000)
-    p.add_argument("--lr", type=float, default=3e-5)
-    p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default="cosine")
-    p.add_argument("--batch-size", type=int, default=1)
-    p.add_argument("--grad-accum", type=int, default=4)
-    p.add_argument("--max-grad-norm", type=float, default=1.0)
-    p.add_argument("--save-every", type=int, default=1000)
-    p.add_argument("--log-every", type=int, default=50)
-    p.add_argument("--seed", type=int, default=42)
-    p.add_argument("--warmup-steps", type=int, default=100)
-    p.add_argument("--val-config", default=None)
-    return p.parse_args()
 # ─── Main ───

 # ─── Args ───
 def parse_args():
+    # First pass: pull out --config so its values can become argparse defaults.
+    cfg_parser = argparse.ArgumentParser(add_help=False)
+    cfg_parser.add_argument("--config", default=None,
+                            help="YAML file with default values for any of the flags below. "
+                                 "Explicit CLI flags still override the YAML.")
+    cfg_args, remaining = cfg_parser.parse_known_args()
+    yaml_defaults: dict = {}
+    if cfg_args.config:
+        import yaml as _yaml
+        with open(cfg_args.config) as f:
+            yaml_defaults = _yaml.safe_load(f) or {}
+        # YAML keys are dashes-or-underscores → normalize to argparse dest (underscore).
+        yaml_defaults = {k.replace("-", "_"): v for k, v in yaml_defaults.items()}
+    def _yaml(name, fallback):
+        return yaml_defaults.get(name, fallback)
+    p = argparse.ArgumentParser(
+        parents=[cfg_parser],
+        description="Audio-Only IC-LoRA Training for Voice Cloning",
+    )
+    p.add_argument("--data-dir", required="data_dir" not in yaml_defaults,
+                   nargs="+", default=_yaml("data_dir", None))
+    p.add_argument("--speaker-index", required="speaker_index" not in yaml_defaults,
+                   nargs="+", default=_yaml("speaker_index", None))
+    p.add_argument("--output-dir", default=_yaml("output_dir", os.path.join(MODEL_DIR, "tts_iclora_v1")))
+    p.add_argument("--checkpoint", default=_yaml("checkpoint", os.path.join(MODEL_DIR, "dramabox-dit-v1.safetensors")))
+    p.add_argument("--full-checkpoint", default=_yaml("full_checkpoint", os.path.join(MODEL_DIR, "dramabox-audio-components.safetensors")))
+    p.add_argument("--base-model", choices=["distilled", "dev"], default=_yaml("base_model", "dev"),
                    help="Base model type: distilled uses DistilledTimestepSampler, dev uses ShiftedLogitNormal")
+    p.add_argument("--lora-rank", type=int, default=_yaml("lora_rank", 128))
+    p.add_argument("--lora-alpha", type=int, default=_yaml("lora_alpha", 128))
+    p.add_argument("--lora-dropout", type=float, default=_yaml("lora_dropout", 0.0),
                    help="Dropout applied to LoRA A/B matrices during training. "
                         "Recommended ~0.1 for small datasets to regularize.")
+    p.add_argument("--resume-lora", default=_yaml("resume_lora", None))
+    p.add_argument("--resume-step-offset", type=int, default=_yaml("resume_step_offset", None),
                    help="Step to add when naming saved checkpoints. If None, inferred "
                         "from --resume-lora filename (e.g. lora_step_10000.safetensors → 10000). "
                         "Set to 0 to start numbering at 0 regardless.")
+    p.add_argument("--ref-ratio", type=float, default=_yaml("ref_ratio", 0.3),
                    help="Fraction of target length to use as reference (default 0.3)")
+    p.add_argument("--max-ref-tokens", type=int, default=_yaml("max_ref_tokens", 200),
                    help="Maximum reference tokens after patchification (default 200)")
+    p.add_argument("--text-dropout", type=float, default=_yaml("text_dropout", 0.0),
                    help="Probability of dropping text conditioning (forces reliance on voice ref)")
+    p.add_argument("--steps", type=int, default=_yaml("steps", 30000))
+    p.add_argument("--lr", type=float, default=_yaml("lr", 3e-5))
+    p.add_argument("--lr-scheduler", choices=["cosine", "linear", "constant"], default=_yaml("lr_scheduler", "cosine"))
+    p.add_argument("--batch-size", type=int, default=_yaml("batch_size", 1))
+    p.add_argument("--grad-accum", type=int, default=_yaml("grad_accum", 4))
+    p.add_argument("--max-grad-norm", type=float, default=_yaml("max_grad_norm", 1.0))
+    p.add_argument("--save-every", type=int, default=_yaml("save_every", 1000))
+    p.add_argument("--log-every", type=int, default=_yaml("log_every", 50))
+    p.add_argument("--seed", type=int, default=_yaml("seed", 42))
+    p.add_argument("--warmup-steps", type=int, default=_yaml("warmup_steps", 100))
+    p.add_argument("--val-config", default=_yaml("val_config", None))
+    return p.parse_args(remaining)
 # ─── Main ───

src/validate.py CHANGED Viewed

@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
 )
 GEMMA_ROOT = os.environ.get(
     "GEMMA_ROOT",
-    os.path.expanduser("~/.cache/dramabox/gemma-3-12b-it-bnb-4bit"),
 )

 )
 GEMMA_ROOT = os.environ.get(
     "GEMMA_ROOT",
+    os.path.expanduser("~/.cache/ltx-2.3-voice/gemma-3-12b-it-bnb-4bit"),
 )