Spaces:

ResembleAI
/

Dramabox

Running on Zero

App Files Files Community

Manmay Nakhashi commited on 26 days ago

Commit

fdc2b0b

1 Parent(s): 1636761

Revert: keep DramaBox naming (rebrand reverted per CEO)

Browse files

Files changed (6) hide show

README.md +11 -11
app.py +6 -6
configs/training_args.example.yaml +2 -2
configs/val_config.example.yaml +1 -1
src/model_downloader.py +8 -9
src/validate.py +1 -1

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: LTX-2.3-Voice
-emoji: 🎙️
 colorFrom: red
 colorTo: indigo
 sdk: gradio
@@ -9,19 +9,19 @@ app_file: app.py
 pinned: true
 license: other
 license_name: ltx-2-community
-license_link: https://huggingface.co/ResembleAI/LTX-2.3-Voice/blob/main/LICENSE
 hf_oauth: false
-short_description: Expressive TTS with voice cloning — LTX-2.3-Voice demo
 ---
-# LTX-2.3-Voice — Expressive TTS with Voice Cloning
 Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
 | | |
 |---|---|
-| 🤗 **Model** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/ResembleAI/LTX-2.3-Voice) |
-| 🎭 **Demo Space** | [`ResembleAI/LTX-2.3-Voice`](https://huggingface.co/spaces/ResembleAI/LTX-2.3-Voice) (ZeroGPU) |
 | 📜 **License** | LTX-2 Community License — see [`LICENSE`](LICENSE) |
 ## Models
@@ -111,9 +111,9 @@ print(detector.get_watermark(wav, sample_rate=sr))   # confidence ≈ 1.0
 Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
-## Training a LoRA on top of LTX-2.3-Voice
-You can fine-tune your own LoRA using LTX-2.3-Voice itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
 ### 1. Prepare your index file
@@ -178,14 +178,14 @@ preprocessed/
 ### 3. Train
-Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the LTX-2.3-Voice files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
 ```bash
 accelerate launch src/train.py \
   --config configs/training_args.example.yaml
 ```
-The trainer attaches a fresh LoRA to the audio branch on top of the LTX-2.3-Voice checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
 To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML — `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.

 ---
+title: DramaBox
+emoji: 🎭
 colorFrom: red
 colorTo: indigo
 sdk: gradio
 pinned: true
 license: other
 license_name: ltx-2-community
+license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
 hf_oauth: false
+short_description: Expressive TTS with voice cloning — DramaBox demo
 ---
+# DramaBox — Expressive TTS with Voice Cloning
 Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
 | | |
 |---|---|
+| 🤗 **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
+| 🎭 **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
 | 📜 **License** | LTX-2 Community License — see [`LICENSE`](LICENSE) |
 ## Models
 Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
+## Training a LoRA on top of DramaBox
+You can fine-tune your own LoRA using DramaBox itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
 ### 1. Prepare your index file
 ### 3. Train
+Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
 ```bash
 accelerate launch src/train.py \
   --config configs/training_args.example.yaml
 ```
+The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
 To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML — `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.

app.py CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-"""LTX-2.3-Voice — Gradio demo (warm server).
 Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
 generated audio is invisibly watermarked with Resemble Perth before being
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths  # noqa: E402
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
-logging.info("Fetching LTX-2.3-Voice checkpoints from HuggingFace (cached after first run)...")
 PATHS = get_all_paths()
 # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
 # `spaces` package patches torch so that .to("cuda") at import time pins the
 # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
 # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
-logging.info("Loading LTX-2.3-Voice warm server (Gemma + DiT + VAE + Decoder)...")
 tts = TTSServer(
     checkpoint=PATHS["transformer"],
     full_checkpoint=PATHS["audio_components"],
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
         raise gr.Error("Prompt is empty.")
     t0 = time.time()
     ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
-    output = tempfile.mktemp(suffix=".wav", prefix="ltx23voice_")
     tts.generate_to_file(
         prompt=prompt,
         output=output,
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
 # ── UI ──────────────────────────────────────────────────────────────────────
 with gr.Blocks(
-    title="LTX-2.3-Voice — Expressive TTS",
     theme=gr.themes.Default(),
     css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
     analytics_enabled=False,
 ) as app:
-    gr.Markdown("# 🎭 LTX-2.3-Voice — Expressive TTS with Voice Cloning")
     gr.Markdown(
         "Write a scene prompt, optionally upload a 10-second voice reference, "
         "and generate. Audio is automatically watermarked with "

 #!/usr/bin/env python3
+"""DramaBox — Gradio demo (warm server).
 Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
 generated audio is invisibly watermarked with Resemble Perth before being
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logging.info("Fetching DramaBox checkpoints from HuggingFace (cached after first run)...")
 PATHS = get_all_paths()
 # Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
 # `spaces` package patches torch so that .to("cuda") at import time pins the
 # weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
 # onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
+logging.info("Loading DramaBox warm server (Gemma + DiT + VAE + Decoder)...")
 tts = TTSServer(
     checkpoint=PATHS["transformer"],
     full_checkpoint=PATHS["audio_components"],
         raise gr.Error("Prompt is empty.")
     t0 = time.time()
     ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
+    output = tempfile.mktemp(suffix=".wav", prefix="dramabox_")
     tts.generate_to_file(
         prompt=prompt,
         output=output,
 # ── UI ──────────────────────────────────────────────────────────────────────
 with gr.Blocks(
+    title="DramaBox — Expressive TTS",
     theme=gr.themes.Default(),
     css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
     analytics_enabled=False,
 ) as app:
+    gr.Markdown("# 🎭 DramaBox — Expressive TTS with Voice Cloning")
     gr.Markdown(
         "Write a scene prompt, optionally upload a 10-second voice reference, "
         "and generate. Audio is automatically watermarked with "

configs/training_args.example.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
-# LTX-2.3-Voice IC-LoRA training config — values become the defaults for
 # `accelerate launch src/train.py --config configs/training_args.example.yaml`.
 # Any flag explicitly passed on the CLI overrides the YAML.
@@ -19,7 +19,7 @@ speaker_index:
 output_dir: tts_iclora_v1
 # ── Base model ─────────────────────────────────────────────────────────────
-# Train your LoRA on top of LTX-2.3-Voice itself (recommended) — the trimmed audio
 # components are enough; no need to ship the raw LTX-2.3 base.
 checkpoint: dramabox-dit-v1.safetensors
 full_checkpoint: dramabox-audio-components.safetensors

+# DramaBox IC-LoRA training config — values become the defaults for
 # `accelerate launch src/train.py --config configs/training_args.example.yaml`.
 # Any flag explicitly passed on the CLI overrides the YAML.
 output_dir: tts_iclora_v1
 # ── Base model ─────────────────────────────────────────────────────────────
+# Train your LoRA on top of DramaBox itself (recommended) — the trimmed audio
 # components are enough; no need to ship the raw LTX-2.3 base.
 checkpoint: dramabox-dit-v1.safetensors
 full_checkpoint: dramabox-audio-components.safetensors

configs/val_config.example.yaml CHANGED Viewed

@@ -3,7 +3,7 @@
 #
 # Fields:
 #   name      — short tag used as the output filename
-#   prompt    — full LTX-2.3-Voice-style scene prompt
 #   reference — (optional) absolute path to a 10+ s voice reference clip;
 #               omit for prompt-only generation

 #
 # Fields:
 #   name      — short tag used as the output filename
+#   prompt    — full DramaBox-style scene prompt
 #   reference — (optional) absolute path to a 10+ s voice reference clip;
 #               omit for prompt-only generation

src/model_downloader.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Download LTX-2.3-Voice models from HuggingFace.
 Models are cached locally after first download.
 Gemma text encoder is fetched separately from Google's repo.
@@ -13,17 +13,16 @@ from huggingface_hub import hf_hub_download, snapshot_download
 logger = logging.getLogger(__name__)
-LTX23_VOICE_REPO = "ResembleAI/LTX-2.3-Voice"
 GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
 # Default cache directory
 DEFAULT_CACHE = os.environ.get(
-    "LTX23_VOICE_CACHE",
-    os.path.join(os.path.expanduser("~"), ".cache", "ltx-2.3-voice"),
 )
-# Model files in the HF repo (flat structure). The on-disk filenames stayed
-# `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
 MODEL_FILES = {
     "transformer": "dramabox-dit-v1.safetensors",
     "audio_components": "dramabox-audio-components.safetensors",
@@ -36,7 +35,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
     Args:
         name: One of 'transformer', 'audio_components', 'silence_latent'
-        cache_dir: Local cache directory (default: ~/.cache/ltx-2.3-voice)
     Returns:
         Local file path
@@ -47,10 +46,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
         raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
     repo_path = MODEL_FILES[name]
-    logger.info(f"Fetching {name} from {LTX23_VOICE_REPO}/{repo_path}...")
     local_path = hf_hub_download(
-        repo_id=LTX23_VOICE_REPO,
         filename=repo_path,
         cache_dir=cache_dir,
         token=os.environ.get("HF_TOKEN"),

 #!/usr/bin/env python3
 """
+Download Dramabox models from HuggingFace.
 Models are cached locally after first download.
 Gemma text encoder is fetched separately from Google's repo.
 logger = logging.getLogger(__name__)
+DRAMABOX_REPO = "ResembleAI/Dramabox"
 GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
 # Default cache directory
 DEFAULT_CACHE = os.environ.get(
+    "DRAMABOX_CACHE",
+    os.path.join(os.path.expanduser("~"), ".cache", "dramabox"),
 )
+# Model files in the HF repo (flat structure)
 MODEL_FILES = {
     "transformer": "dramabox-dit-v1.safetensors",
     "audio_components": "dramabox-audio-components.safetensors",
     Args:
         name: One of 'transformer', 'audio_components', 'silence_latent'
+        cache_dir: Local cache directory (default: ~/.cache/dramabox)
     Returns:
         Local file path
         raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
     repo_path = MODEL_FILES[name]
+    logger.info(f"Fetching {name} from {DRAMABOX_REPO}/{repo_path}...")
     local_path = hf_hub_download(
+        repo_id=DRAMABOX_REPO,
         filename=repo_path,
         cache_dir=cache_dir,
         token=os.environ.get("HF_TOKEN"),

src/validate.py CHANGED Viewed

@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
 )
 GEMMA_ROOT = os.environ.get(
     "GEMMA_ROOT",
-    os.path.expanduser("~/.cache/ltx-2.3-voice/gemma-3-12b-it-bnb-4bit"),
 )

 )
 GEMMA_ROOT = os.environ.get(
     "GEMMA_ROOT",
+    os.path.expanduser("~/.cache/dramabox/gemma-3-12b-it-bnb-4bit"),
 )