scenema-ai commited on
Commit
67537d3
·
verified ·
1 Parent(s): 1a0bb71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -3
README.md CHANGED
@@ -1,3 +1,257 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - es
7
+ - it
8
+ - pt
9
+ - ja
10
+ - zh
11
+ - ko
12
+ - ru
13
+ - ar
14
+ - hi
15
+ - sw
16
+ license: mit
17
+ tags:
18
+ - audio-generation
19
+ - diffusion
20
+ - text-to-audio
21
+ - voice-cloning
22
+ - speech-generation
23
+ - expressive-speech
24
+ - voice-acting
25
+ pipeline_tag: text-to-audio
26
+ library_name: scenema-audio
27
+ ---
28
+
29
+ # Scenema Audio
30
+
31
+ **Zero-shot expressive voice cloning and speech generation.**
32
+
33
+ **[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)**
34
+
35
+ [![Demo Video](https://img.youtube.com/vi/DW1JzkZn_u0/maxresdefault.jpg)](https://youtu.be/DW1JzkZn_u0)
36
+
37
+ Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.
38
+
39
+ Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.
40
+
41
+ ## Capabilities
42
+
43
+ - **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
44
+ - **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
45
+ - **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
46
+ - **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
47
+ - **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
48
+ - **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.
49
+
50
+ ## Model Checkpoints
51
+
52
+ | File | Size | Description |
53
+ |------|------|-------------|
54
+ | `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |
55
+ | `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |
56
+ | `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |
57
+ | `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |
58
+
59
+ ## Quick Start
60
+
61
+ ```bash
62
+ git clone https://github.com/ScenemaAI/scenema-audio.git
63
+ cd scenema-audio
64
+
65
+ export HF_TOKEN=your_huggingface_token
66
+ docker compose up
67
+ ```
68
+
69
+ Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.
70
+
71
+ ## Prompt Format
72
+
73
+ ```xml
74
+ <speak voice="VOICE_DESCRIPTION" gender="male|female"
75
+ scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
76
+ <action>Performance direction.</action>
77
+ Speech text here.
78
+ </speak>
79
+ ```
80
+
81
+ | Attribute | Required | Default | Description |
82
+ |-----------|----------|---------|-------------|
83
+ | `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. |
84
+ | `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. |
85
+ | `scene` | No | | Environmental context. Conditions the ambient audio around the speech. |
86
+ | `language` | No | `"en"` | Language code. |
87
+
88
+ ### Voice Description
89
+
90
+ The `voice` attribute is the primary control. The richer and more specific, the better:
91
+
92
+ - **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance
93
+ - **Emotional state**: rage, tenderness, exhaustion, excitement, grief
94
+ - **Speaking style**: pacing, emphasis, pauses, enunciation
95
+ - **Character archetypes**: "Think Tony Soprano having a breakdown"
96
+ - **Age and gender**: child, elderly, young woman, teenage boy
97
+ - **Accents**: British, Southern American, New Jersey Italian American
98
+
99
+ ### Action Tags
100
+
101
+ `<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:
102
+
103
+ ```xml
104
+ <speak voice="Middle-aged man, warm but weathered." gender="male">
105
+ <action>Calm, almost casual. Staring at his hands.</action>
106
+ I used to think I had all the time in the world.
107
+ <action>Voice tightens. Fighting to stay composed.</action>
108
+ Then one Tuesday morning, the doctor said three words that changed everything.
109
+ <action>Long pause. Deep breath. Raw but steady.</action>
110
+ And I realized I hadn't called my son in six months.
111
+ </speak>
112
+ ```
113
+
114
+ ### Voice Cloning
115
+
116
+ Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.
117
+
118
+ ```json
119
+ {
120
+ "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
121
+ "reference_voice_url": "https://example.com/reference.wav"
122
+ }
123
+ ```
124
+
125
+ Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
126
+
127
+ ## Examples
128
+
129
+ ### Emotional Acting
130
+
131
+ ```xml
132
+ <speak voice="A man on the edge. Explosive rage. Italian-American inflection."
133
+ gender="male" scene="A dimly lit office, late at night">
134
+ <action>He stands up slowly, voice dangerously low</action>
135
+ You come into my house, you eat my food, and then you got the nerve
136
+ to tell me how to run my business.
137
+ <action>Voice rising, finger pointing</action>
138
+ I built this thing from nothing while you were sitting on your ass.
139
+ </speak>
140
+ ```
141
+
142
+ ### Child Voice
143
+
144
+ ```xml
145
+ <speak voice="A six-year-old girl, bright and excited, speaking fast
146
+ with breathless enthusiasm. Slight lisp on S sounds."
147
+ gender="female">
148
+ Mommy look! There is a rainbow and it goes all the way across the whole sky!
149
+ </speak>
150
+ ```
151
+
152
+ ### Scene-Aware Audio
153
+
154
+ ```xml
155
+ <speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
156
+ gender="male" scene="Open dock in a thunderstorm, heavy rain"
157
+ shot="scene">
158
+ <sound>Heavy rain and wind howling</sound>
159
+ <action>He shouts over the storm</action>
160
+ Get the lines! She is pulling loose!
161
+ <sound>Thunder cracks overhead</sound>
162
+ Move! I said move!
163
+ </speak>
164
+ ```
165
+
166
+ ## API Reference
167
+
168
+ ### POST /generate
169
+
170
+ | Field | Type | Default | Description |
171
+ |-------|------|---------|-------------|
172
+ | `prompt` | string | **required** | `<speak>` XML string |
173
+ | `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. |
174
+ | `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. |
175
+ | `background_sfx` | bool | `false` | Keep generated sound effects in the output. |
176
+ | `validate` | bool | `true` | Whisper speech validation with retry on garbled output. |
177
+ | `seed` | int | `-1` | Generation seed. `-1` for random. |
178
+ | `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. |
179
+ | `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). |
180
+ | `skip_vc` | bool | `false` | Skip voice conversion post-processing. |
181
+ | `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). |
182
+ | `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). |
183
+
184
+ ### Response
185
+
186
+ Returns JSON with base64-encoded WAV audio:
187
+
188
+ ```json
189
+ {
190
+ "status": "succeeded",
191
+ "audio": "<base64-encoded WAV>",
192
+ "content_type": "audio/wav",
193
+ "metadata": {
194
+ "duration_s": 12.4,
195
+ "sample_rate": 48000,
196
+ "processing_ms": 8200,
197
+ "seed": 42
198
+ }
199
+ }
200
+ ```
201
+
202
+ ## Architecture
203
+
204
+ ```
205
+ XML prompt (voice + scene + action tags + text)
206
+ -> Gemma 3 12B text encoding
207
+ -> 8-step distilled latent diffusion
208
+ -> Audio VAE decoding
209
+ -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
210
+ -> SeedVC voice identity transfer (when reference provided or multi-chunk)
211
+ -> Output WAV (48kHz stereo)
212
+ ```
213
+
214
+ For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.
215
+
216
+ ## VRAM Requirements
217
+
218
+ | VRAM | Audio Model | Gemma | Notes |
219
+ |------|------------|-------|-------|
220
+ | 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. |
221
+ | 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. |
222
+ | 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. |
223
+
224
+ VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.
225
+
226
+ ## Performance
227
+
228
+ Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:
229
+
230
+ | Configuration | Total Time | Real-Time Factor |
231
+ |--------------|-----------|-----------------|
232
+ | bf16 + bf16 streaming | 83s | 0.66x |
233
+ | INT8 + NF4 (all GPU) | 35s | 1.57x |
234
+
235
+ ## Limitations
236
+
237
+ - **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns.
238
+ - **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically.
239
+ - **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
240
+ - **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
241
+ - **Generation speed**: 3-8 seconds per 15-second segment depending on hardware.
242
+ - **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability.
243
+ - **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.
244
+
245
+ ## Acknowledgments
246
+
247
+ - [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
248
+ - [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
249
+ - [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
250
+ - [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
251
+ - [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration
252
+
253
+ ## License
254
+
255
+ MIT License. See [LICENSE](LICENSE) for details.
256
+
257
+ Gemma 3 12B (text encoder) is a gated model requiring Google's terms of use.