Manmay commited on
Commit
84b8b88
·
0 Parent(s):

Initial release: Dramabox v1 - Expressive TTS with Voice Cloning

Browse files
.gitattributes ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ samples/01_queen_sighs_rage.wav filter=lfs diff=lfs merge=lfs -text
37
+ samples/refs/01_queen_sighs_rage.wav filter=lfs diff=lfs merge=lfs -text
38
+ samples/04_catgirl_giggles_snort.wav filter=lfs diff=lfs merge=lfs -text
39
+ samples/refs/04_catgirl_giggles_snort.wav filter=lfs diff=lfs merge=lfs -text
40
+ samples/06_arnie_panting_triumph.wav filter=lfs diff=lfs merge=lfs -text
41
+ samples/09_villain_sinister_laugh.wav filter=lfs diff=lfs merge=lfs -text
42
+ samples/refs/09_villain_sinister_laugh.wav filter=lfs diff=lfs merge=lfs -text
43
+ samples/13_conan_wheezing_laughter.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - hi
5
+ - es
6
+ - de
7
+ - fr
8
+ - ja
9
+ - it
10
+ - ko
11
+ - pt
12
+ - zh
13
+ license: other
14
+ pipeline_tag: text-to-speech
15
+ tags:
16
+ - tts
17
+ - voice-cloning
18
+ - audio-generation
19
+ - diffusion-transformer
20
+ - flow-matching
21
+ - ltx-2
22
+ library_name: ltx-audio-tts
23
+ ---
24
+
25
+ # Dramabox - Expressive TTS with Voice Cloning
26
+
27
+ Dramabox generates expressive, emotionally rich speech from scene descriptions with optional voice cloning. Built on a 3.3B Diffusion Transformer with flow matching, conditioned on Gemma 3 12B text embeddings.
28
+
29
+ ## Audio Samples
30
+
31
+ ### Regal Queen - Cold Fury to Venomous Whisper
32
+
33
+ **Prompt:** A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."
34
+
35
+ <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/01_queen_sighs_rage.wav"></audio>
36
+
37
+ ### Catgirl - Uncontrollable Giggling
38
+
39
+ **Prompt:** A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"
40
+
41
+ <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/04_catgirl_giggles_snort.wav"></audio>
42
+
43
+ ### Action Hero - Panting Triumph
44
+
45
+ **Prompt:** A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."
46
+
47
+ <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/06_arnie_panting_triumph.wav"></audio>
48
+
49
+ ### Villain - Sinister Laugh
50
+
51
+ **Prompt:** A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heheheh. Hahahahahahaha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heheheh."
52
+
53
+ <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/09_villain_sinister_laugh.wav"></audio>
54
+
55
+ ### Talk Show Host - Wheezing Laughter
56
+
57
+ **Prompt:** A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"
58
+
59
+ <audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/13_conan_wheezing_laughter.wav"></audio>
60
+
61
+ ---
62
+
63
+ ## Model Description
64
+
65
+ Dramabox is a prompt-driven TTS model where **the text prompt controls everything** - speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. With voice cloning, a 10-second reference clip conditions the model to reproduce the speaker's timbre and characteristics.
66
+
67
+ ### Key Features
68
+
69
+ - **Prompt-driven expressiveness** - laughs, sighs, whispers, shouts, emotional transitions all controlled by the scene description
70
+ - **Voice cloning** from 10s reference audio
71
+ - **10 languages** - EN, HI, ES, DE, FR, JA, IT, KO, PT, ZH
72
+ - **Fast inference** - ~2.5s per generation with warm server on H100
73
+
74
+ ### Architecture
75
+
76
+ | Component | Details |
77
+ |-----------|---------|
78
+ | **Transformer** | 3.3B parameter DiT, 48 layers, flow matching (30-step Euler) |
79
+ | **Text Encoder** | Gemma 3 12B (q4 quantized) + learned embeddings processor |
80
+ | **Audio VAE** | Encodes/decodes 48kHz audio via mel spectrogram latents |
81
+ | **Voice Cloning** | Reference audio tokens appended to target with asymmetric attention mask |
82
+
83
+ ## Files
84
+
85
+ | File | Size | Description |
86
+ |------|------|-------------|
87
+ | `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (voice cloning weights merged) |
88
+ | `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE encoder/decoder + vocoder + text projection |
89
+ | `assets/silence_latent_frame.pt` | 1.5 KB | VAE-encoded silence frame |
90
+ | `config.json` | - | Model configuration |
91
+
92
+ **Additional requirement**: [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) (text encoder, pre-quantized 4-bit, auto-downloaded)
93
+
94
+ ## Quick Start
95
+
96
+ ```python
97
+ from inference_server import TTSServer
98
+
99
+ # Models auto-download from HuggingFace
100
+ server = TTSServer(device="cuda", bnb_4bit=True)
101
+
102
+ # Text-to-speech
103
+ server.generate_to_file(
104
+ prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
105
+ output="output.wav",
106
+ )
107
+
108
+ # Voice cloning
109
+ server.generate_to_file(
110
+ prompt='A woman speaks warmly, "Hello, how are you today?"',
111
+ output="cloned.wav",
112
+ voice_ref="reference.wav", # 10+ seconds of target voice
113
+ )
114
+ ```
115
+
116
+ ## Prompt Format
117
+
118
+ The prompt is a scene description that controls how the model speaks:
119
+
120
+ ```
121
+ <speaker description>, "<dialogue>" <action direction> "<more dialogue>"
122
+ ```
123
+
124
+ ### What Works Inside Quotes (model produces actual sounds)
125
+ - Laughs: `"Hahaha"` `"Hehehe"` (always as one word, never separated)
126
+ - Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`
127
+
128
+ ### What Goes Outside Quotes (stage directions)
129
+ - `She sighs deeply.` `He gulps nervously.` `A long pause.`
130
+ - `Her voice cracks.` `He clears his throat.` `She scoffs.`
131
+
132
+ ### Never Inside Quotes (model speaks them literally)
133
+ - Ahem, Pfft, Sigh, Gasp, Cough
134
+
135
+ ## Inference Settings
136
+
137
+ | Parameter | Default | Notes |
138
+ |-----------|---------|-------|
139
+ | cfg_scale | 2.5 | Text adherence (lower = more natural) |
140
+ | stg_scale | 1.5 | Skip-token guidance |
141
+ | rescale | 0.0 | No rescaling |
142
+ | modality | 1.0 | No modality guidance |
143
+ | duration_multiplier | 1.1 | 10% extra breathing room |
144
+ | steps | 30 | Euler flow matching |
145
+
146
+ ## VRAM Requirements
147
+
148
+ | Setup | VRAM | Speed |
149
+ |-------|------|-------|
150
+ | Warm server (recommended) | **~24 GB** | **~2.5s** |
151
+ | Cold inference (per-call loading) | ~8 GB peak | ~30s |
152
+
153
+ ## Supported Languages
154
+
155
+ English, Hindi, Spanish, German, French, Japanese, Italian, Korean, Portuguese, Mandarin
156
+
157
+ ## License
158
+
159
+ Built on [LTX-2.3](https://github.com/Lightricks/LTX-2) by Lightricks. Please refer to the LTX-2 license for usage terms.
assets/silence_latent_frame.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f73746d2163f8f1742c5de89005404ccaeeff05154bbb10a3337bf9bd13f161c
3
+ size 1501
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "dramabox-tts",
3
+ "architecture": "DiT-FlowMatching",
4
+ "base_model": "ltx-2.3-22b-dev-audio-only",
5
+ "parameters": "3.3B",
6
+ "num_layers": 48,
7
+ "audio_inner_dim": 2048,
8
+ "audio_num_attention_heads": 32,
9
+ "audio_attention_head_dim": 64,
10
+ "audio_cross_attention_dim": 2048,
11
+ "denoising_steps": 30,
12
+ "scheduler": "euler_flow_matching",
13
+ "text_encoder": "google/gemma-3-12b-it-qat-q4_0-unquantized",
14
+ "text_encoder_hidden_size": 3840,
15
+ "ic_lora": {
16
+ "rank": 128,
17
+ "alpha": 128,
18
+ "merged": true,
19
+ "training_version": "v13",
20
+ "text_dropout": 0.4,
21
+ "training_steps": "v12@3000 + v13@1000"
22
+ },
23
+ "audio": {
24
+ "sample_rate": 48000,
25
+ "vae_channels": 8,
26
+ "mel_bins": 16,
27
+ "fps": 25.0
28
+ },
29
+ "inference_defaults": {
30
+ "cfg_scale": 2.5,
31
+ "stg_scale": 1.5,
32
+ "rescale_scale": 0.0,
33
+ "modality_scale": 1.0,
34
+ "duration_multiplier": 1.1,
35
+ "seed": 42
36
+ },
37
+ "files": {
38
+ "transformer": "dramabox-dit-v1.safetensors",
39
+ "audio_components": "dramabox-audio-components.safetensors",
40
+ "silence_latent": "assets/silence_latent_frame.pt"
41
+ }
42
+ }
dramabox-audio-components.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6bb7195f91ffac65f8773215851bf751c86bab9f7d130e9fc29e9fef2bd7954
3
+ size 2676984708
dramabox-dit-v1.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:248d292627f8fa67ed3e587171c28051edb2c06ce7d2d2a9e15132f0bff0540f
3
+ size 6573055336
samples/01_queen_sighs_rage.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:758fb1412f9af73721e59a6e4c949bbd14aeda802d52786471fe3130a84a447e
3
+ size 4855758
samples/04_catgirl_giggles_snort.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d12ac377fceed7488493e6ee4ae9c8d7f9294bb64822068fc54e2e6350ca1453
3
+ size 7620558
samples/06_arnie_panting_triumph.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68b67153ba93c500254f963840b46da1439785e50b3ff432df8f9d8c3f47a035
3
+ size 6268878
samples/09_villain_sinister_laugh.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd2368c3fad976f9ce54cf8d0608a78574ab83a7584d5754c3703c9ade64fb69
3
+ size 5285838
samples/13_conan_wheezing_laughter.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8ee6e28c11c7844599213f8eebe72ab20dc43c55621ca453e16cad0609d45d3
3
+ size 7190478
samples/refs/01_queen_sighs_rage.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0bf624251cc325098863e3b5e280505c4dccfd5591e6312a1844a467b1a3f14
3
+ size 351616
samples/refs/04_catgirl_giggles_snort.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6e4a21b962c30a2644a6e7f6b5e2b0a7db8b63d2cf2efa69b009bd9b62b0bf3
3
+ size 414478
samples/refs/09_villain_sinister_laugh.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:41f266980881a7c61027f73831b559dde846469e74966d37bb06c52992ae472c
3
+ size 349946