Manmay Nakhashi commited on
Commit
fdc2b0b
Β·
1 Parent(s): 1636761

Revert: keep DramaBox naming (rebrand reverted per CEO)

Browse files
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: LTX-2.3-Voice
3
- emoji: πŸŽ™οΈ
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: gradio
@@ -9,19 +9,19 @@ app_file: app.py
9
  pinned: true
10
  license: other
11
  license_name: ltx-2-community
12
- license_link: https://huggingface.co/ResembleAI/LTX-2.3-Voice/blob/main/LICENSE
13
  hf_oauth: false
14
- short_description: Expressive TTS with voice cloning β€” LTX-2.3-Voice demo
15
  ---
16
 
17
- # LTX-2.3-Voice β€” Expressive TTS with Voice Cloning
18
 
19
  Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
20
 
21
  | | |
22
  |---|---|
23
- | πŸ€— **Model** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/ResembleAI/LTX-2.3-Voice) |
24
- | 🎭 **Demo Space** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/spaces/ResembleAI/LTX-2.3-Voice) (ZeroGPU) |
25
  | πŸ“œ **License** | LTX-2 Community License β€” see [`LICENSE`](LICENSE) |
26
 
27
  ## Models
@@ -111,9 +111,9 @@ print(detector.get_watermark(wav, sample_rate=sr)) # confidence β‰ˆ 1.0
111
 
112
  Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
113
 
114
- ## Training a LoRA on top of LTX-2.3-Voice
115
 
116
- You can fine-tune your own LoRA using LTX-2.3-Voice itself as the base β€” no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
117
 
118
  ### 1. Prepare your index file
119
 
@@ -178,14 +178,14 @@ preprocessed/
178
 
179
  ### 3. Train
180
 
181
- Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the LTX-2.3-Voice files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
182
 
183
  ```bash
184
  accelerate launch src/train.py \
185
  --config configs/training_args.example.yaml
186
  ```
187
 
188
- The trainer attaches a fresh LoRA to the audio branch on top of the LTX-2.3-Voice checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
189
 
190
  To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β€” `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
191
 
 
1
  ---
2
+ title: DramaBox
3
+ emoji: 🎭
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: gradio
 
9
  pinned: true
10
  license: other
11
  license_name: ltx-2-community
12
+ license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
13
  hf_oauth: false
14
+ short_description: Expressive TTS with voice cloning β€” DramaBox demo
15
  ---
16
 
17
+ # DramaBox β€” Expressive TTS with Voice Cloning
18
 
19
  Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
20
 
21
  | | |
22
  |---|---|
23
+ | πŸ€— **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
24
+ | 🎭 **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
25
  | πŸ“œ **License** | LTX-2 Community License β€” see [`LICENSE`](LICENSE) |
26
 
27
  ## Models
 
111
 
112
  Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
113
 
114
+ ## Training a LoRA on top of DramaBox
115
 
116
+ You can fine-tune your own LoRA using DramaBox itself as the base β€” no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
117
 
118
  ### 1. Prepare your index file
119
 
 
178
 
179
  ### 3. Train
180
 
181
+ Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
182
 
183
  ```bash
184
  accelerate launch src/train.py \
185
  --config configs/training_args.example.yaml
186
  ```
187
 
188
+ The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
189
 
190
  To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β€” `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
191
 
app.py CHANGED
@@ -1,5 +1,5 @@
1
  #!/usr/bin/env python3
2
- """LTX-2.3-Voice β€” Gradio demo (warm server).
3
 
4
  Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
5
  generated audio is invisibly watermarked with Resemble Perth before being
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths # noqa: E402
21
 
22
 
23
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
24
- logging.info("Fetching LTX-2.3-Voice checkpoints from HuggingFace (cached after first run)...")
25
  PATHS = get_all_paths()
26
 
27
  # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
28
  # `spaces` package patches torch so that .to("cuda") at import time pins the
29
  # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
30
  # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
31
- logging.info("Loading LTX-2.3-Voice warm server (Gemma + DiT + VAE + Decoder)...")
32
  tts = TTSServer(
33
  checkpoint=PATHS["transformer"],
34
  full_checkpoint=PATHS["audio_components"],
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
112
  raise gr.Error("Prompt is empty.")
113
  t0 = time.time()
114
  ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
115
- output = tempfile.mktemp(suffix=".wav", prefix="ltx23voice_")
116
  tts.generate_to_file(
117
  prompt=prompt,
118
  output=output,
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
127
 
128
  # ── UI ──────────────────────────────────────────────────────────────────────
129
  with gr.Blocks(
130
- title="LTX-2.3-Voice β€” Expressive TTS",
131
  theme=gr.themes.Default(),
132
  css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
133
  analytics_enabled=False,
134
  ) as app:
135
- gr.Markdown("# 🎭 LTX-2.3-Voice β€” Expressive TTS with Voice Cloning")
136
  gr.Markdown(
137
  "Write a scene prompt, optionally upload a 10-second voice reference, "
138
  "and generate. Audio is automatically watermarked with "
 
1
  #!/usr/bin/env python3
2
+ """DramaBox β€” Gradio demo (warm server).
3
 
4
  Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
5
  generated audio is invisibly watermarked with Resemble Perth before being
 
21
 
22
 
23
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
24
+ logging.info("Fetching DramaBox checkpoints from HuggingFace (cached after first run)...")
25
  PATHS = get_all_paths()
26
 
27
  # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
28
  # `spaces` package patches torch so that .to("cuda") at import time pins the
29
  # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
30
  # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
31
+ logging.info("Loading DramaBox warm server (Gemma + DiT + VAE + Decoder)...")
32
  tts = TTSServer(
33
  checkpoint=PATHS["transformer"],
34
  full_checkpoint=PATHS["audio_components"],
 
112
  raise gr.Error("Prompt is empty.")
113
  t0 = time.time()
114
  ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
115
+ output = tempfile.mktemp(suffix=".wav", prefix="dramabox_")
116
  tts.generate_to_file(
117
  prompt=prompt,
118
  output=output,
 
127
 
128
  # ── UI ──────────────────────────────────────────────────────────────────────
129
  with gr.Blocks(
130
+ title="DramaBox β€” Expressive TTS",
131
  theme=gr.themes.Default(),
132
  css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
133
  analytics_enabled=False,
134
  ) as app:
135
+ gr.Markdown("# 🎭 DramaBox β€” Expressive TTS with Voice Cloning")
136
  gr.Markdown(
137
  "Write a scene prompt, optionally upload a 10-second voice reference, "
138
  "and generate. Audio is automatically watermarked with "
configs/training_args.example.yaml CHANGED
@@ -1,4 +1,4 @@
1
- # LTX-2.3-Voice IC-LoRA training config β€” values become the defaults for
2
  # `accelerate launch src/train.py --config configs/training_args.example.yaml`.
3
  # Any flag explicitly passed on the CLI overrides the YAML.
4
 
@@ -19,7 +19,7 @@ speaker_index:
19
  output_dir: tts_iclora_v1
20
 
21
  # ── Base model ─────────────────────────────────────────────────────────────
22
- # Train your LoRA on top of LTX-2.3-Voice itself (recommended) β€” the trimmed audio
23
  # components are enough; no need to ship the raw LTX-2.3 base.
24
  checkpoint: dramabox-dit-v1.safetensors
25
  full_checkpoint: dramabox-audio-components.safetensors
 
1
+ # DramaBox IC-LoRA training config β€” values become the defaults for
2
  # `accelerate launch src/train.py --config configs/training_args.example.yaml`.
3
  # Any flag explicitly passed on the CLI overrides the YAML.
4
 
 
19
  output_dir: tts_iclora_v1
20
 
21
  # ── Base model ─────────────────────────────────────────────────────────────
22
+ # Train your LoRA on top of DramaBox itself (recommended) β€” the trimmed audio
23
  # components are enough; no need to ship the raw LTX-2.3 base.
24
  checkpoint: dramabox-dit-v1.safetensors
25
  full_checkpoint: dramabox-audio-components.safetensors
configs/val_config.example.yaml CHANGED
@@ -3,7 +3,7 @@
3
  #
4
  # Fields:
5
  # name β€” short tag used as the output filename
6
- # prompt β€” full LTX-2.3-Voice-style scene prompt
7
  # reference β€” (optional) absolute path to a 10+ s voice reference clip;
8
  # omit for prompt-only generation
9
 
 
3
  #
4
  # Fields:
5
  # name β€” short tag used as the output filename
6
+ # prompt β€” full DramaBox-style scene prompt
7
  # reference β€” (optional) absolute path to a 10+ s voice reference clip;
8
  # omit for prompt-only generation
9
 
src/model_downloader.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Download LTX-2.3-Voice models from HuggingFace.
4
 
5
  Models are cached locally after first download.
6
  Gemma text encoder is fetched separately from Google's repo.
@@ -13,17 +13,16 @@ from huggingface_hub import hf_hub_download, snapshot_download
13
 
14
  logger = logging.getLogger(__name__)
15
 
16
- LTX23_VOICE_REPO = "ResembleAI/LTX-2.3-Voice"
17
  GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
18
 
19
  # Default cache directory
20
  DEFAULT_CACHE = os.environ.get(
21
- "LTX23_VOICE_CACHE",
22
- os.path.join(os.path.expanduser("~"), ".cache", "ltx-2.3-voice"),
23
  )
24
 
25
- # Model files in the HF repo (flat structure). The on-disk filenames stayed
26
- # `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
27
  MODEL_FILES = {
28
  "transformer": "dramabox-dit-v1.safetensors",
29
  "audio_components": "dramabox-audio-components.safetensors",
@@ -36,7 +35,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
36
 
37
  Args:
38
  name: One of 'transformer', 'audio_components', 'silence_latent'
39
- cache_dir: Local cache directory (default: ~/.cache/ltx-2.3-voice)
40
 
41
  Returns:
42
  Local file path
@@ -47,10 +46,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
47
  raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
48
 
49
  repo_path = MODEL_FILES[name]
50
- logger.info(f"Fetching {name} from {LTX23_VOICE_REPO}/{repo_path}...")
51
 
52
  local_path = hf_hub_download(
53
- repo_id=LTX23_VOICE_REPO,
54
  filename=repo_path,
55
  cache_dir=cache_dir,
56
  token=os.environ.get("HF_TOKEN"),
 
1
  #!/usr/bin/env python3
2
  """
3
+ Download Dramabox models from HuggingFace.
4
 
5
  Models are cached locally after first download.
6
  Gemma text encoder is fetched separately from Google's repo.
 
13
 
14
  logger = logging.getLogger(__name__)
15
 
16
+ DRAMABOX_REPO = "ResembleAI/Dramabox"
17
  GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
18
 
19
  # Default cache directory
20
  DEFAULT_CACHE = os.environ.get(
21
+ "DRAMABOX_CACHE",
22
+ os.path.join(os.path.expanduser("~"), ".cache", "dramabox"),
23
  )
24
 
25
+ # Model files in the HF repo (flat structure)
 
26
  MODEL_FILES = {
27
  "transformer": "dramabox-dit-v1.safetensors",
28
  "audio_components": "dramabox-audio-components.safetensors",
 
35
 
36
  Args:
37
  name: One of 'transformer', 'audio_components', 'silence_latent'
38
+ cache_dir: Local cache directory (default: ~/.cache/dramabox)
39
 
40
  Returns:
41
  Local file path
 
46
  raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
47
 
48
  repo_path = MODEL_FILES[name]
49
+ logger.info(f"Fetching {name} from {DRAMABOX_REPO}/{repo_path}...")
50
 
51
  local_path = hf_hub_download(
52
+ repo_id=DRAMABOX_REPO,
53
  filename=repo_path,
54
  cache_dir=cache_dir,
55
  token=os.environ.get("HF_TOKEN"),
src/validate.py CHANGED
@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
29
  )
30
  GEMMA_ROOT = os.environ.get(
31
  "GEMMA_ROOT",
32
- os.path.expanduser("~/.cache/ltx-2.3-voice/gemma-3-12b-it-bnb-4bit"),
33
  )
34
 
35
 
 
29
  )
30
  GEMMA_ROOT = os.environ.get(
31
  "GEMMA_ROOT",
32
+ os.path.expanduser("~/.cache/dramabox/gemma-3-12b-it-bnb-4bit"),
33
  )
34
 
35