Spaces:
Running on Zero
Running on Zero
Manmay Nakhashi commited on
Commit Β·
fdc2b0b
1
Parent(s): 1636761
Revert: keep DramaBox naming (rebrand reverted per CEO)
Browse files- README.md +11 -11
- app.py +6 -6
- configs/training_args.example.yaml +2 -2
- configs/val_config.example.yaml +1 -1
- src/model_downloader.py +8 -9
- src/validate.py +1 -1
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
|
@@ -9,19 +9,19 @@ app_file: app.py
|
|
| 9 |
pinned: true
|
| 10 |
license: other
|
| 11 |
license_name: ltx-2-community
|
| 12 |
-
license_link: https://huggingface.co/ResembleAI/
|
| 13 |
hf_oauth: false
|
| 14 |
-
short_description: Expressive TTS with voice cloning β
|
| 15 |
---
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
|
| 20 |
|
| 21 |
| | |
|
| 22 |
|---|---|
|
| 23 |
-
| π€ **Model** | [`ResembleAI/
|
| 24 |
-
| π **Demo Space** | [`ResembleAI/
|
| 25 |
| π **License** | LTX-2 Community License β see [`LICENSE`](LICENSE) |
|
| 26 |
|
| 27 |
## Models
|
|
@@ -111,9 +111,9 @@ print(detector.get_watermark(wav, sample_rate=sr)) # confidence β 1.0
|
|
| 111 |
|
| 112 |
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
|
| 113 |
|
| 114 |
-
## Training a LoRA on top of
|
| 115 |
|
| 116 |
-
You can fine-tune your own LoRA using
|
| 117 |
|
| 118 |
### 1. Prepare your index file
|
| 119 |
|
|
@@ -178,14 +178,14 @@ preprocessed/
|
|
| 178 |
|
| 179 |
### 3. Train
|
| 180 |
|
| 181 |
-
Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the
|
| 182 |
|
| 183 |
```bash
|
| 184 |
accelerate launch src/train.py \
|
| 185 |
--config configs/training_args.example.yaml
|
| 186 |
```
|
| 187 |
|
| 188 |
-
The trainer attaches a fresh LoRA to the audio branch on top of the
|
| 189 |
|
| 190 |
To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
|
| 191 |
|
|
|
|
| 1 |
---
|
| 2 |
+
title: DramaBox
|
| 3 |
+
emoji: π
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
|
|
|
| 9 |
pinned: true
|
| 10 |
license: other
|
| 11 |
license_name: ltx-2-community
|
| 12 |
+
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
|
| 13 |
hf_oauth: false
|
| 14 |
+
short_description: Expressive TTS with voice cloning β DramaBox demo
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# DramaBox β Expressive TTS with Voice Cloning
|
| 18 |
|
| 19 |
Prompt-driven TTS with voice cloning, built as an IC-LoRA fine-tune of the **LTX-2.3 3.3B audio-only**. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre.
|
| 20 |
|
| 21 |
| | |
|
| 22 |
|---|---|
|
| 23 |
+
| π€ **Model** | [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox) |
|
| 24 |
+
| π **Demo Space** | [`ResembleAI/Dramabox`](https://huggingface.co/spaces/ResembleAI/Dramabox) (ZeroGPU) |
|
| 25 |
| π **License** | LTX-2 Community License β see [`LICENSE`](LICENSE) |
|
| 26 |
|
| 27 |
## Models
|
|
|
|
| 111 |
|
| 112 |
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
|
| 113 |
|
| 114 |
+
## Training a LoRA on top of DramaBox
|
| 115 |
|
| 116 |
+
You can fine-tune your own LoRA using DramaBox itself as the base β no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.
|
| 117 |
|
| 118 |
### 1. Prepare your index file
|
| 119 |
|
|
|
|
| 178 |
|
| 179 |
### 3. Train
|
| 180 |
|
| 181 |
+
Copy `configs/training_args.example.yaml`, point `data_dir` / `speaker_index` at your preprocessed output, set `checkpoint` + `full_checkpoint` to the DramaBox files, then launch with HuggingFace `accelerate`. Any flag passed on the CLI overrides the YAML.
|
| 182 |
|
| 183 |
```bash
|
| 184 |
accelerate launch src/train.py \
|
| 185 |
--config configs/training_args.example.yaml
|
| 186 |
```
|
| 187 |
|
| 188 |
+
The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
|
| 189 |
|
| 190 |
To monitor training, set `val_config: configs/val_config.example.yaml` in your training YAML β `src/validate.py` is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.
|
| 191 |
|
app.py
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
|
| 4 |
Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
|
| 5 |
generated audio is invisibly watermarked with Resemble Perth before being
|
|
@@ -21,14 +21,14 @@ from model_downloader import get_all_paths # noqa: E402
|
|
| 21 |
|
| 22 |
|
| 23 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
| 24 |
-
logging.info("Fetching
|
| 25 |
PATHS = get_all_paths()
|
| 26 |
|
| 27 |
# Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
|
| 28 |
# `spaces` package patches torch so that .to("cuda") at import time pins the
|
| 29 |
# weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
|
| 30 |
# onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
|
| 31 |
-
logging.info("Loading
|
| 32 |
tts = TTSServer(
|
| 33 |
checkpoint=PATHS["transformer"],
|
| 34 |
full_checkpoint=PATHS["audio_components"],
|
|
@@ -112,7 +112,7 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
|
|
| 112 |
raise gr.Error("Prompt is empty.")
|
| 113 |
t0 = time.time()
|
| 114 |
ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
|
| 115 |
-
output = tempfile.mktemp(suffix=".wav", prefix="
|
| 116 |
tts.generate_to_file(
|
| 117 |
prompt=prompt,
|
| 118 |
output=output,
|
|
@@ -127,12 +127,12 @@ def on_generate(prompt: str, audio_ref, cfg: float, stg: float, dur_mult: float,
|
|
| 127 |
|
| 128 |
# ββ UI ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 129 |
with gr.Blocks(
|
| 130 |
-
title="
|
| 131 |
theme=gr.themes.Default(),
|
| 132 |
css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
|
| 133 |
analytics_enabled=False,
|
| 134 |
) as app:
|
| 135 |
-
gr.Markdown("# π
|
| 136 |
gr.Markdown(
|
| 137 |
"Write a scene prompt, optionally upload a 10-second voice reference, "
|
| 138 |
"and generate. Audio is automatically watermarked with "
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
+
"""DramaBox β Gradio demo (warm server).
|
| 3 |
|
| 4 |
Loads the warm TTSServer once, then handles requests at ~2.5 s each. All
|
| 5 |
generated audio is invisibly watermarked with Resemble Perth before being
|
|
|
|
| 21 |
|
| 22 |
|
| 23 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
| 24 |
+
logging.info("Fetching DramaBox checkpoints from HuggingFace (cached after first run)...")
|
| 25 |
PATHS = get_all_paths()
|
| 26 |
|
| 27 |
# Module-level warm load (same pattern as IndexTTS-2-Demo on ZeroGPU). The
|
| 28 |
# `spaces` package patches torch so that .to("cuda") at import time pins the
|
| 29 |
# weights into ZeroGPU's shared memory; each @spaces.GPU call then maps them
|
| 30 |
# onto the actual GPU instantly. First user request is ~2.5 s instead of ~30 s.
|
| 31 |
+
logging.info("Loading DramaBox warm server (Gemma + DiT + VAE + Decoder)...")
|
| 32 |
tts = TTSServer(
|
| 33 |
checkpoint=PATHS["transformer"],
|
| 34 |
full_checkpoint=PATHS["audio_components"],
|
|
|
|
| 112 |
raise gr.Error("Prompt is empty.")
|
| 113 |
t0 = time.time()
|
| 114 |
ref_path = audio_ref if audio_ref and os.path.exists(str(audio_ref)) else None
|
| 115 |
+
output = tempfile.mktemp(suffix=".wav", prefix="dramabox_")
|
| 116 |
tts.generate_to_file(
|
| 117 |
prompt=prompt,
|
| 118 |
output=output,
|
|
|
|
| 127 |
|
| 128 |
# ββ UI ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 129 |
with gr.Blocks(
|
| 130 |
+
title="DramaBox β Expressive TTS",
|
| 131 |
theme=gr.themes.Default(),
|
| 132 |
css=".prompt-box textarea { font-size: 14px !important; line-height: 1.5 !important; }",
|
| 133 |
analytics_enabled=False,
|
| 134 |
) as app:
|
| 135 |
+
gr.Markdown("# π DramaBox β Expressive TTS with Voice Cloning")
|
| 136 |
gr.Markdown(
|
| 137 |
"Write a scene prompt, optionally upload a 10-second voice reference, "
|
| 138 |
"and generate. Audio is automatically watermarked with "
|
configs/training_args.example.yaml
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
# `accelerate launch src/train.py --config configs/training_args.example.yaml`.
|
| 3 |
# Any flag explicitly passed on the CLI overrides the YAML.
|
| 4 |
|
|
@@ -19,7 +19,7 @@ speaker_index:
|
|
| 19 |
output_dir: tts_iclora_v1
|
| 20 |
|
| 21 |
# ββ Base model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
-
# Train your LoRA on top of
|
| 23 |
# components are enough; no need to ship the raw LTX-2.3 base.
|
| 24 |
checkpoint: dramabox-dit-v1.safetensors
|
| 25 |
full_checkpoint: dramabox-audio-components.safetensors
|
|
|
|
| 1 |
+
# DramaBox IC-LoRA training config β values become the defaults for
|
| 2 |
# `accelerate launch src/train.py --config configs/training_args.example.yaml`.
|
| 3 |
# Any flag explicitly passed on the CLI overrides the YAML.
|
| 4 |
|
|
|
|
| 19 |
output_dir: tts_iclora_v1
|
| 20 |
|
| 21 |
# ββ Base model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
+
# Train your LoRA on top of DramaBox itself (recommended) β the trimmed audio
|
| 23 |
# components are enough; no need to ship the raw LTX-2.3 base.
|
| 24 |
checkpoint: dramabox-dit-v1.safetensors
|
| 25 |
full_checkpoint: dramabox-audio-components.safetensors
|
configs/val_config.example.yaml
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
#
|
| 4 |
# Fields:
|
| 5 |
# name β short tag used as the output filename
|
| 6 |
-
# prompt β full
|
| 7 |
# reference β (optional) absolute path to a 10+ s voice reference clip;
|
| 8 |
# omit for prompt-only generation
|
| 9 |
|
|
|
|
| 3 |
#
|
| 4 |
# Fields:
|
| 5 |
# name β short tag used as the output filename
|
| 6 |
+
# prompt β full DramaBox-style scene prompt
|
| 7 |
# reference β (optional) absolute path to a 10+ s voice reference clip;
|
| 8 |
# omit for prompt-only generation
|
| 9 |
|
src/model_downloader.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Download
|
| 4 |
|
| 5 |
Models are cached locally after first download.
|
| 6 |
Gemma text encoder is fetched separately from Google's repo.
|
|
@@ -13,17 +13,16 @@ from huggingface_hub import hf_hub_download, snapshot_download
|
|
| 13 |
|
| 14 |
logger = logging.getLogger(__name__)
|
| 15 |
|
| 16 |
-
|
| 17 |
GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
|
| 18 |
|
| 19 |
# Default cache directory
|
| 20 |
DEFAULT_CACHE = os.environ.get(
|
| 21 |
-
"
|
| 22 |
-
os.path.join(os.path.expanduser("~"), ".cache", "
|
| 23 |
)
|
| 24 |
|
| 25 |
-
# Model files in the HF repo (flat structure)
|
| 26 |
-
# `dramabox-*.safetensors` after the rebrand to avoid a 8 GB re-upload.
|
| 27 |
MODEL_FILES = {
|
| 28 |
"transformer": "dramabox-dit-v1.safetensors",
|
| 29 |
"audio_components": "dramabox-audio-components.safetensors",
|
|
@@ -36,7 +35,7 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
|
|
| 36 |
|
| 37 |
Args:
|
| 38 |
name: One of 'transformer', 'audio_components', 'silence_latent'
|
| 39 |
-
cache_dir: Local cache directory (default: ~/.cache/
|
| 40 |
|
| 41 |
Returns:
|
| 42 |
Local file path
|
|
@@ -47,10 +46,10 @@ def get_model_path(name: str, cache_dir: str = None) -> str:
|
|
| 47 |
raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
|
| 48 |
|
| 49 |
repo_path = MODEL_FILES[name]
|
| 50 |
-
logger.info(f"Fetching {name} from {
|
| 51 |
|
| 52 |
local_path = hf_hub_download(
|
| 53 |
-
repo_id=
|
| 54 |
filename=repo_path,
|
| 55 |
cache_dir=cache_dir,
|
| 56 |
token=os.environ.get("HF_TOKEN"),
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Download Dramabox models from HuggingFace.
|
| 4 |
|
| 5 |
Models are cached locally after first download.
|
| 6 |
Gemma text encoder is fetched separately from Google's repo.
|
|
|
|
| 13 |
|
| 14 |
logger = logging.getLogger(__name__)
|
| 15 |
|
| 16 |
+
DRAMABOX_REPO = "ResembleAI/Dramabox"
|
| 17 |
GEMMA_REPO = "unsloth/gemma-3-12b-it-bnb-4bit"
|
| 18 |
|
| 19 |
# Default cache directory
|
| 20 |
DEFAULT_CACHE = os.environ.get(
|
| 21 |
+
"DRAMABOX_CACHE",
|
| 22 |
+
os.path.join(os.path.expanduser("~"), ".cache", "dramabox"),
|
| 23 |
)
|
| 24 |
|
| 25 |
+
# Model files in the HF repo (flat structure)
|
|
|
|
| 26 |
MODEL_FILES = {
|
| 27 |
"transformer": "dramabox-dit-v1.safetensors",
|
| 28 |
"audio_components": "dramabox-audio-components.safetensors",
|
|
|
|
| 35 |
|
| 36 |
Args:
|
| 37 |
name: One of 'transformer', 'audio_components', 'silence_latent'
|
| 38 |
+
cache_dir: Local cache directory (default: ~/.cache/dramabox)
|
| 39 |
|
| 40 |
Returns:
|
| 41 |
Local file path
|
|
|
|
| 46 |
raise ValueError(f"Unknown model: {name}. Choose from: {list(MODEL_FILES.keys())}")
|
| 47 |
|
| 48 |
repo_path = MODEL_FILES[name]
|
| 49 |
+
logger.info(f"Fetching {name} from {DRAMABOX_REPO}/{repo_path}...")
|
| 50 |
|
| 51 |
local_path = hf_hub_download(
|
| 52 |
+
repo_id=DRAMABOX_REPO,
|
| 53 |
filename=repo_path,
|
| 54 |
cache_dir=cache_dir,
|
| 55 |
token=os.environ.get("HF_TOKEN"),
|
src/validate.py
CHANGED
|
@@ -29,7 +29,7 @@ DEV_FULL_CKPT = os.environ.get(
|
|
| 29 |
)
|
| 30 |
GEMMA_ROOT = os.environ.get(
|
| 31 |
"GEMMA_ROOT",
|
| 32 |
-
os.path.expanduser("~/.cache/
|
| 33 |
)
|
| 34 |
|
| 35 |
|
|
|
|
| 29 |
)
|
| 30 |
GEMMA_ROOT = os.environ.get(
|
| 31 |
"GEMMA_ROOT",
|
| 32 |
+
os.path.expanduser("~/.cache/dramabox/gemma-3-12b-it-bnb-4bit"),
|
| 33 |
)
|
| 34 |
|
| 35 |
|