File size: 10,943 Bytes
67537d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17b0043
 
 
67537d3
 
 
 
 
 
 
 
17b0043
2c837f8
67537d3
6102cb0
67537d3
 
 
 
 
 
 
 
6f4f583
67537d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17b0043
67537d3
17b0043
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
language:
  - en
  - de
  - fr
  - es
  - it
  - pt
  - ja
  - zh
  - ko
  - ru
  - ar
  - hi
  - sw
license: other
license_name: ltx-2-community
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
tags:
  - audio-generation
  - diffusion
  - text-to-audio
  - voice-cloning
  - speech-generation
  - expressive-speech
  - voice-acting
  - text-to-speech
pipeline_tag: text-to-speech
library_name: scenema-audio
inference: false
---

# Scenema Audio

**Zero-shot expressive voice cloning and speech generation.**

**[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)**

**[Watch the demo video on YouTube](https://youtu.be/VnEQ_ImOaAc)**

Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.

## Capabilities

- **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
- **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
- **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
- **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
- **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
- **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.

## Model Checkpoints

| File | Size | Description |
|------|------|-------------|
| `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |
| `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |
| `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |
| `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |

## Quick Start

```bash
git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio

export HF_TOKEN=your_huggingface_token
docker compose up
```

Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.

## Prompt Format

```xml
<speak voice="VOICE_DESCRIPTION" gender="male|female"
       scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
  <action>Performance direction.</action>
  Speech text here.
</speak>
```

| Attribute | Required | Default | Description |
|-----------|----------|---------|-------------|
| `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. |
| `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. |
| `scene` | No | | Environmental context. Conditions the ambient audio around the speech. |
| `language` | No | `"en"` | Language code. |

### Voice Description

The `voice` attribute is the primary control. The richer and more specific, the better:

- **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance
- **Emotional state**: rage, tenderness, exhaustion, excitement, grief
- **Speaking style**: pacing, emphasis, pauses, enunciation
- **Character archetypes**: "Think Tony Soprano having a breakdown"
- **Age and gender**: child, elderly, young woman, teenage boy
- **Accents**: British, Southern American, New Jersey Italian American

### Action Tags

`<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:

```xml
<speak voice="Middle-aged man, warm but weathered." gender="male">
  <action>Calm, almost casual. Staring at his hands.</action>
  I used to think I had all the time in the world.
  <action>Voice tightens. Fighting to stay composed.</action>
  Then one Tuesday morning, the doctor said three words that changed everything.
  <action>Long pause. Deep breath. Raw but steady.</action>
  And I realized I hadn't called my son in six months.
</speak>
```

### Voice Cloning

Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.

```json
{
  "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
  "reference_voice_url": "https://example.com/reference.wav"
}
```

Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

## Examples

### Emotional Acting

```xml
<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
       gender="male" scene="A dimly lit office, late at night">
  <action>He stands up slowly, voice dangerously low</action>
  You come into my house, you eat my food, and then you got the nerve
  to tell me how to run my business.
  <action>Voice rising, finger pointing</action>
  I built this thing from nothing while you were sitting on your ass.
</speak>
```

### Child Voice

```xml
<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
  Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>
```

### Scene-Aware Audio

```xml
<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
       gender="male" scene="Open dock in a thunderstorm, heavy rain"
       shot="scene">
  <sound>Heavy rain and wind howling</sound>
  <action>He shouts over the storm</action>
  Get the lines! She is pulling loose!
  <sound>Thunder cracks overhead</sound>
  Move! I said move!
</speak>
```

## API Reference

### POST /generate

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `prompt` | string | **required** | `<speak>` XML string |
| `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. |
| `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. |
| `background_sfx` | bool | `false` | Keep generated sound effects in the output. |
| `validate` | bool | `true` | Whisper speech validation with retry on garbled output. |
| `seed` | int | `-1` | Generation seed. `-1` for random. |
| `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. |
| `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). |
| `skip_vc` | bool | `false` | Skip voice conversion post-processing. |
| `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). |
| `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). |

### Response

Returns JSON with base64-encoded WAV audio:

```json
{
  "status": "succeeded",
  "audio": "<base64-encoded WAV>",
  "content_type": "audio/wav",
  "metadata": {
    "duration_s": 12.4,
    "sample_rate": 48000,
    "processing_ms": 8200,
    "seed": 42
  }
}
```

## Architecture

```
XML prompt (voice + scene + action tags + text)
  -> Gemma 3 12B text encoding
  -> 8-step distilled latent diffusion
  -> Audio VAE decoding
  -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
  -> SeedVC voice identity transfer (when reference provided or multi-chunk)
  -> Output WAV (48kHz stereo)
```

For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.

## VRAM Requirements

| VRAM | Audio Model | Gemma | Notes |
|------|------------|-------|-------|
| 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. |
| 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. |
| 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. |

VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.

## Performance

Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:

| Configuration | Total Time | Real-Time Factor |
|--------------|-----------|-----------------|
| bf16 + bf16 streaming | 83s | 0.66x |
| INT8 + NF4 (all GPU) | 35s | 1.57x |

## Limitations

- **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns.
- **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically.
- **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
- **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
- **Generation speed**: 3-8 seconds per 15-second segment depending on hardware.
- **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability.
- **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.

## Acknowledgments

- [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
- [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
- [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
- [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
- [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration

## License

The model weights are released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). Scenema Audio's audio diffusion transformer is derived from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s audiovisual model, and its weights are subject to the same terms.

The inference code and server are released under the [MIT License](https://github.com/ScenemaAI/scenema-audio/blob/main/LICENSE).

[Gemma 3 12B](https://ai.google.dev/gemma/terms) (text encoder) is a gated model requiring acceptance of Google's terms of use.