Manmay commited on
Commit
1850702
Β·
verified Β·
1 Parent(s): 08c5e28

Restore Space README with sdk frontmatter

Browse files
Files changed (1) hide show
  1. README.md +31 -151
README.md CHANGED
@@ -1,162 +1,42 @@
1
- # Dramabox - Expressive TTS with Voice Cloning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- Prompt-driven TTS with voice cloning built on a 3.3B Diffusion Transformer with flow matching.
4
 
5
- ## Folder Structure
6
 
7
- ```
8
- DramaBox/
9
- β”œβ”€β”€ src/
10
- β”‚ β”œβ”€β”€ inference.py # TTS inference with voice cloning
11
- β”‚ β”œβ”€β”€ inference_server.py # Warm server (~2.5s per generation)
12
- β”‚ β”œβ”€β”€ audio_conditioning.py # Reference audio conditioning
13
- β”‚ └── model_downloader.py # Auto-download models from HuggingFace
14
- β”œβ”€β”€ patches/
15
- β”‚ β”œβ”€β”€ attention.py # dtype fix for mask allocation
16
- β”‚ └── guiders.py # Per-token CFG clamping
17
- β”œβ”€β”€ assets/
18
- β”‚ └── silence_latent_frame.pt
19
- β”œβ”€β”€ evals/
20
- β”‚ β”œβ”€β”€ eval_short.txt # 30 short prompts (~5-15s)
21
- β”‚ β”œβ”€β”€ eval_long.txt # 15 long prompts (~20-37s)
22
- β”‚ └── eval_expressive.txt # 15 expressive prompts (laughs, sighs, stammers)
23
- β”œβ”€β”€ scripts/
24
- β”‚ β”œβ”€β”€ inference.sh # Inference wrapper
25
- β”‚ └── eval.sh # Evaluation runner
26
- β”œβ”€β”€ app.py # Gradio demo app
27
- β”œβ”€β”€ ltx2/ # LTX-2 dependency packages
28
- └── README.md
29
- ```
30
-
31
- ## Models
32
-
33
- Models auto-download from [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox) on HuggingFace.
34
-
35
- | Model | Size | Description |
36
- |-------|------|-------------|
37
- | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer |
38
- | `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE + vocoder + text projection |
39
- | [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) | ~8 GB | Text encoder (auto-downloaded) |
40
-
41
- **VRAM**: ~24 GB peak | **Speed**: ~2.5s per generation (warm server, H100)
42
-
43
- ## Quick Start
44
-
45
- ### Warm Server (recommended, ~2.5s per request)
46
-
47
- ```python
48
- from src.inference_server import TTSServer
49
-
50
- server = TTSServer(device="cuda")
51
-
52
- server.generate_to_file(
53
- prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
54
- output="output.wav",
55
- voice_ref="reference.wav", # optional, 10+ seconds
56
- )
57
- ```
58
-
59
- ### Gradio App
60
-
61
- ```bash
62
- GEMINI_API_KEY=your_key CUDA_VISIBLE_DEVICES=4 python app.py
63
- ```
64
-
65
- ### CLI Inference
66
-
67
- ```bash
68
- python src/inference.py \
69
- --voice-sample reference.wav \
70
- --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
71
- --output output.wav \
72
- --cfg-scale 2.5 --stg-scale 1.5
73
- ```
74
-
75
- ### Evaluation
76
 
77
- ```bash
78
- bash scripts/eval.sh --eval expressive --output eval_results/
79
- ```
80
-
81
- ## Inference Settings
82
-
83
- | Parameter | Default | Notes |
84
- |-----------|---------|-------|
85
- | cfg-scale | 2.5 | Lower = more natural, higher = more text following |
86
- | stg-scale | 1.5 | Skip-token guidance |
87
- | rescale | 0 | No rescaling |
88
- | modality | 1 | No modality guidance |
89
- | duration-multiplier | 1.1 | 10% breathing room |
90
- | steps | 30 | Euler flow matching |
91
-
92
- ## Prompt Writing Guide
93
-
94
- **Structure:** `<speaker description>, "<dialogue>" <action direction> "<more dialogue>"`
95
-
96
- ### What works inside quotes (model produces actual sounds)
97
- - Laughs: `"Hahaha"` `"Hehehe"` (always one word, never separated)
98
- - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
99
-
100
- ### What goes outside quotes (stage directions)
101
- - `She sighs deeply.` `He gulps nervously.` `A long pause.`
102
- - `Her voice cracks.` `He clears his throat.` `She scoffs.`
103
-
104
- ### Never inside quotes (model speaks them literally)
105
- - Ahem, Pfft, Sigh, Gasp, Cough
106
-
107
- ### Tips
108
- - Match gender/age in speaker description to voice reference
109
- - Break long dialogue into segments with acting directions between them
110
- - End prompt at the last closing quote mark (no trailing descriptions)
111
-
112
- ## Watermarking
113
 
114
- Every audio output from `inference.py` and `inference_server.TTSServer.generate_to_file` is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth) β€” an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
115
-
116
- ```python
117
- import perth, librosa
118
- wav, sr = librosa.load("output.wav", sr=None, mono=True)
119
- detector = perth.PerthImplicitWatermarker()
120
- print(detector.get_watermark(wav, sample_rate=sr)) # confidence β‰ˆ 1.0 for our outputs
121
  ```
122
-
123
- Pass `--no-watermark` to `inference.py` (or `watermark=False` to `generate_to_file`) to disable for debugging.
124
-
125
- ## Training
126
-
127
- DramaBox is an IC-LoRA fine-tune of the LTX-2.3 22B audio-only branch. To train your own:
128
-
129
- ```bash
130
- # 1. Preprocess raw (audio, transcript) pairs β†’ audio_latents/ + conditions/
131
- python src/preprocess.py \
132
- --dataset-type manifest \
133
- --index your_data.jsonl \
134
- --output-dir /path/to/preprocessed/ \
135
- --checkpoint dramabox-audio-components.safetensors \
136
- --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/
137
-
138
- # 2. Edit configs/training_args.example.yaml β†’ your data paths
139
-
140
- # 3. Launch (uses HuggingFace accelerate)
141
- bash scripts/train.sh \
142
- --config configs/training_args.example.yaml \
143
- --gpus 0,1,2,3,4,5,6 \
144
- --train-val-gpu 7
145
  ```
146
 
147
- | Script | Purpose |
148
- |---|---|
149
- | `src/preprocess.py` | Encode audio (Audio VAE) + text (Gemma) into training-ready `.pt` files |
150
- | `src/train.py` | IC-LoRA training loop with peft, accelerate multi-GPU, periodic validation |
151
- | `src/validate.py` | Spawned by `train.py` at each save step; runs the warm validator on a held-out prompt set |
152
- | `scripts/train.sh` | YAML-config wrapper around `accelerate launch src/train.py` |
153
-
154
- LoRA targets the audio branch only: `audio_attn1.{to_q,to_k,to_v,to_out.0}` + `audio_ff.{net.0.proj,net.2}` Γ— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.
155
-
156
- ## Language
157
 
158
- English.
159
 
160
- ## License
161
 
162
- Built on [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks. Distributed under the LTX-2 Community License Agreement β€” see [`LICENSE`](LICENSE).
 
 
 
 
 
 
1
+ ---
2
+ title: DramaBox
3
+ emoji: 🎭
4
+ colorFrom: red
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 4.44.1
8
+ app_file: app.py
9
+ pinned: true
10
+ license: other
11
+ license_name: ltx-2-community
12
+ license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
13
+ hardware: l40s
14
+ short_description: Expressive TTS with voice cloning β€” DramaBox demo
15
+ ---
16
 
17
+ # DramaBox β€” Expressive TTS Demo
18
 
19
+ Live demo of [`ResembleAI/Dramabox`](https://huggingface.co/ResembleAI/Dramabox). Write a scene prompt, optionally upload a 10-second voice reference, and generate. Audio is automatically watermarked with [Resemble Perth](https://github.com/resemble-ai/Perth).
20
 
21
+ The model checkpoints download automatically on first launch.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ## Prompt format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
 
 
 
 
 
 
 
25
  ```
26
+ <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
+ - **Inside double quotes**: dialogue and phonetic sounds (`"Hahaha"`, `"Mmmmm"`, `"Ugh"`)
30
+ - **Outside quotes**: stage directions (`She sighs.`, `He clears his throat.`)
31
+ - **Avoid inside quotes**: `Ahem`, `Pfft`, `Sigh`, `Gasp`, `Cough` β€” the model will speak them literally.
 
 
 
 
 
 
 
32
 
33
+ See the **Load an example prompt** dropdown for ready-made scene templates.
34
 
35
+ ## Files
36
 
37
+ - `app.py` β€” Gradio UI
38
+ - `src/inference_server.py` β€” warm `TTSServer` (single load, ~2.5s/request)
39
+ - `src/inference.py` β€” CLI inference
40
+ - `src/model_downloader.py` β€” auto-fetches model from HuggingFace
41
+ - `ltx2/` β€” vendored LTX-2 pipelines
42
+ - `requirements.txt` β€” Python deps (includes `resemble-perth`)