Spaces:
Running on Zero
Running on Zero
Restore Space README with sdk frontmatter
Browse files
README.md
CHANGED
|
@@ -1,162 +1,42 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
-
DramaBox/
|
| 9 |
-
βββ src/
|
| 10 |
-
β βββ inference.py # TTS inference with voice cloning
|
| 11 |
-
β βββ inference_server.py # Warm server (~2.5s per generation)
|
| 12 |
-
β βββ audio_conditioning.py # Reference audio conditioning
|
| 13 |
-
β βββ model_downloader.py # Auto-download models from HuggingFace
|
| 14 |
-
βββ patches/
|
| 15 |
-
β βββ attention.py # dtype fix for mask allocation
|
| 16 |
-
β βββ guiders.py # Per-token CFG clamping
|
| 17 |
-
βββ assets/
|
| 18 |
-
β βββ silence_latent_frame.pt
|
| 19 |
-
βββ evals/
|
| 20 |
-
β βββ eval_short.txt # 30 short prompts (~5-15s)
|
| 21 |
-
β βββ eval_long.txt # 15 long prompts (~20-37s)
|
| 22 |
-
β βββ eval_expressive.txt # 15 expressive prompts (laughs, sighs, stammers)
|
| 23 |
-
βββ scripts/
|
| 24 |
-
β βββ inference.sh # Inference wrapper
|
| 25 |
-
β βββ eval.sh # Evaluation runner
|
| 26 |
-
βββ app.py # Gradio demo app
|
| 27 |
-
βββ ltx2/ # LTX-2 dependency packages
|
| 28 |
-
βββ README.md
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
## Models
|
| 32 |
-
|
| 33 |
-
Models auto-download from [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox) on HuggingFace.
|
| 34 |
-
|
| 35 |
-
| Model | Size | Description |
|
| 36 |
-
|-------|------|-------------|
|
| 37 |
-
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer |
|
| 38 |
-
| `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE + vocoder + text projection |
|
| 39 |
-
| [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder (auto-downloaded) |
|
| 40 |
-
|
| 41 |
-
**VRAM**: ~24 GB peak | **Speed**: ~2.5s per generation (warm server, H100)
|
| 42 |
-
|
| 43 |
-
## Quick Start
|
| 44 |
-
|
| 45 |
-
### Warm Server (recommended, ~2.5s per request)
|
| 46 |
-
|
| 47 |
-
```python
|
| 48 |
-
from src.inference_server import TTSServer
|
| 49 |
-
|
| 50 |
-
server = TTSServer(device="cuda")
|
| 51 |
-
|
| 52 |
-
server.generate_to_file(
|
| 53 |
-
prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
|
| 54 |
-
output="output.wav",
|
| 55 |
-
voice_ref="reference.wav", # optional, 10+ seconds
|
| 56 |
-
)
|
| 57 |
-
```
|
| 58 |
-
|
| 59 |
-
### Gradio App
|
| 60 |
-
|
| 61 |
-
```bash
|
| 62 |
-
GEMINI_API_KEY=your_key CUDA_VISIBLE_DEVICES=4 python app.py
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
### CLI Inference
|
| 66 |
-
|
| 67 |
-
```bash
|
| 68 |
-
python src/inference.py \
|
| 69 |
-
--voice-sample reference.wav \
|
| 70 |
-
--prompt 'A woman speaks warmly, "Hello, how are you today?"' \
|
| 71 |
-
--output output.wav \
|
| 72 |
-
--cfg-scale 2.5 --stg-scale 1.5
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
### Evaluation
|
| 76 |
|
| 77 |
-
|
| 78 |
-
bash scripts/eval.sh --eval expressive --output eval_results/
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
## Inference Settings
|
| 82 |
-
|
| 83 |
-
| Parameter | Default | Notes |
|
| 84 |
-
|-----------|---------|-------|
|
| 85 |
-
| cfg-scale | 2.5 | Lower = more natural, higher = more text following |
|
| 86 |
-
| stg-scale | 1.5 | Skip-token guidance |
|
| 87 |
-
| rescale | 0 | No rescaling |
|
| 88 |
-
| modality | 1 | No modality guidance |
|
| 89 |
-
| duration-multiplier | 1.1 | 10% breathing room |
|
| 90 |
-
| steps | 30 | Euler flow matching |
|
| 91 |
-
|
| 92 |
-
## Prompt Writing Guide
|
| 93 |
-
|
| 94 |
-
**Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
|
| 95 |
-
|
| 96 |
-
### What works inside quotes (model produces actual sounds)
|
| 97 |
-
- Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
|
| 98 |
-
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
|
| 99 |
-
|
| 100 |
-
### What goes outside quotes (stage directions)
|
| 101 |
-
- `She sighs deeply.` `He gulps nervously.` `A long pause.`
|
| 102 |
-
- `Her voice cracks.` `He clears his throat.` `She scoffs.`
|
| 103 |
-
|
| 104 |
-
### Never inside quotes (model speaks them literally)
|
| 105 |
-
- Ahem, Pfft, Sigh, Gasp, Cough
|
| 106 |
-
|
| 107 |
-
### Tips
|
| 108 |
-
- Match gender/age in speaker description to voice reference
|
| 109 |
-
- Break long dialogue into segments with acting directions between them
|
| 110 |
-
- End prompt at the last closing quote mark (no trailing descriptions)
|
| 111 |
-
|
| 112 |
-
## Watermarking
|
| 113 |
|
| 114 |
-
Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
|
| 115 |
-
|
| 116 |
-
```python
|
| 117 |
-
import perth, librosa
|
| 118 |
-
wav, sr = librosa.load("output.wav", sr=None, mono=True)
|
| 119 |
-
detector = perth.PerthImplicitWatermarker()
|
| 120 |
-
print(detector.get_watermark(wav, sample_rate=sr)) # confidence β 1.0 for our outputs
|
| 121 |
```
|
| 122 |
-
|
| 123 |
-
Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
|
| 124 |
-
|
| 125 |
-
## Training
|
| 126 |
-
|
| 127 |
-
DramaBox is an IC-LoRA fine-tune of the LTX-2.3 22B audio-only branch. To train your own:
|
| 128 |
-
|
| 129 |
-
```bash
|
| 130 |
-
# 1. Preprocess raw (audio, transcript) pairs β audio_latents/ + conditions/
|
| 131 |
-
python src/preprocess.py \
|
| 132 |
-
--dataset-type manifest \
|
| 133 |
-
--index your_data.jsonl \
|
| 134 |
-
--output-dir /path/to/preprocessed/ \
|
| 135 |
-
--checkpoint dramabox-audio-components.safetensors \
|
| 136 |
-
--gemma-root /path/to/gemma-3-12b-it-bnb-4bit/
|
| 137 |
-
|
| 138 |
-
# 2. Edit configs/training_args.example.yaml β your data paths
|
| 139 |
-
|
| 140 |
-
# 3. Launch (uses HuggingFace accelerate)
|
| 141 |
-
bash scripts/train.sh \
|
| 142 |
-
--config configs/training_args.example.yaml \
|
| 143 |
-
--gpus 0,1,2,3,4,5,6 \
|
| 144 |
-
--train-val-gpu 7
|
| 145 |
```
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
| `src/train.py` | IC-LoRA training loop with peft, accelerate multi-GPU, periodic validation |
|
| 151 |
-
| `src/validate.py` | Spawned by `train.py` at each save step; runs the warm validator on a held-out prompt set |
|
| 152 |
-
| `scripts/train.sh` | YAML-config wrapper around `accelerate launch src/train.py` |
|
| 153 |
-
|
| 154 |
-
LoRA targets the audio branch only: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
|
| 155 |
-
|
| 156 |
-
## Language
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
##
|
| 161 |
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: DramaBox
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: red
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
+
license: other
|
| 11 |
+
license_name: ltx-2-community
|
| 12 |
+
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
|
| 13 |
+
hardware: l40s
|
| 14 |
+
short_description: Expressive TTS with voice cloning β DramaBox demo
|
| 15 |
+
---
|
| 16 |
|
| 17 |
+
# DramaBox β Expressive TTS Demo
|
| 18 |
|
| 19 |
+
Live demo of [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox). Write a scene prompt, optionally upload a 10-second voice reference, and generate. Audio is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth).
|
| 20 |
|
| 21 |
+
The model checkpoints download automatically on first launch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
## Prompt format
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
```
|
| 26 |
+
<speaker description>, "<dialogue>" <action direction> "<more dialogue>"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
```
|
| 28 |
|
| 29 |
+
- **Inside double quotes**: dialogue and phonetic sounds (`"Hahaha"`, `"Mmmmm"`, `"Ugh"`)
|
| 30 |
+
- **Outside quotes**: stage directions (`She sighs.`, `He clears his throat.`)
|
| 31 |
+
- **Avoid inside quotes**: `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough` β the model will speak them literally.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
See the **Load an example prompt** dropdown for ready-made scene templates.
|
| 34 |
|
| 35 |
+
## Files
|
| 36 |
|
| 37 |
+
- `app.py` β Gradio UI
|
| 38 |
+
- `src/inference_server.py` β warm `TTSServer` (single load, ~2.5s/request)
|
| 39 |
+
- `src/inference.py` β CLI inference
|
| 40 |
+
- `src/model_downloader.py` β auto-fetches model from HuggingFace
|
| 41 |
+
- `ltx2/` β vendored LTX-2 pipelines
|
| 42 |
+
- `requirements.txt` β Python deps (includes `resemble-perth`)
|