scenema-audio / README.md
scenema-ai's picture
Update README.md
17b0043 verified
---
language:
- en
- de
- fr
- es
- it
- pt
- ja
- zh
- ko
- ru
- ar
- hi
- sw
license: other
license_name: ltx-2-community
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
tags:
- audio-generation
- diffusion
- text-to-audio
- voice-cloning
- speech-generation
- expressive-speech
- voice-acting
- text-to-speech
pipeline_tag: text-to-speech
library_name: scenema-audio
inference: false
---
# Scenema Audio
**Zero-shot expressive voice cloning and speech generation.**
**[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)**
**[Watch the demo video on YouTube](https://youtu.be/VnEQ_ImOaAc)**
Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.
Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.
## Capabilities
- **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
- **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
- **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
- **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
- **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
- **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.
## Model Checkpoints
| File | Size | Description |
|------|------|-------------|
| `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |
| `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |
| `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |
| `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |
## Quick Start
```bash
git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio
export HF_TOKEN=your_huggingface_token
docker compose up
```
Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.
## Prompt Format
```xml
<speak voice="VOICE_DESCRIPTION" gender="male|female"
scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
<action>Performance direction.</action>
Speech text here.
</speak>
```
| Attribute | Required | Default | Description |
|-----------|----------|---------|-------------|
| `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. |
| `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. |
| `scene` | No | | Environmental context. Conditions the ambient audio around the speech. |
| `language` | No | `"en"` | Language code. |
### Voice Description
The `voice` attribute is the primary control. The richer and more specific, the better:
- **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance
- **Emotional state**: rage, tenderness, exhaustion, excitement, grief
- **Speaking style**: pacing, emphasis, pauses, enunciation
- **Character archetypes**: "Think Tony Soprano having a breakdown"
- **Age and gender**: child, elderly, young woman, teenage boy
- **Accents**: British, Southern American, New Jersey Italian American
### Action Tags
`<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:
```xml
<speak voice="Middle-aged man, warm but weathered." gender="male">
<action>Calm, almost casual. Staring at his hands.</action>
I used to think I had all the time in the world.
<action>Voice tightens. Fighting to stay composed.</action>
Then one Tuesday morning, the doctor said three words that changed everything.
<action>Long pause. Deep breath. Raw but steady.</action>
And I realized I hadn't called my son in six months.
</speak>
```
### Voice Cloning
Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.
```json
{
"prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
"reference_voice_url": "https://example.com/reference.wav"
}
```
Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
## Examples
### Emotional Acting
```xml
<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
gender="male" scene="A dimly lit office, late at night">
<action>He stands up slowly, voice dangerously low</action>
You come into my house, you eat my food, and then you got the nerve
to tell me how to run my business.
<action>Voice rising, finger pointing</action>
I built this thing from nothing while you were sitting on your ass.
</speak>
```
### Child Voice
```xml
<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>
```
### Scene-Aware Audio
```xml
<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
gender="male" scene="Open dock in a thunderstorm, heavy rain"
shot="scene">
<sound>Heavy rain and wind howling</sound>
<action>He shouts over the storm</action>
Get the lines! She is pulling loose!
<sound>Thunder cracks overhead</sound>
Move! I said move!
</speak>
```
## API Reference
### POST /generate
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `prompt` | string | **required** | `<speak>` XML string |
| `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. |
| `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. |
| `background_sfx` | bool | `false` | Keep generated sound effects in the output. |
| `validate` | bool | `true` | Whisper speech validation with retry on garbled output. |
| `seed` | int | `-1` | Generation seed. `-1` for random. |
| `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. |
| `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). |
| `skip_vc` | bool | `false` | Skip voice conversion post-processing. |
| `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). |
| `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). |
### Response
Returns JSON with base64-encoded WAV audio:
```json
{
"status": "succeeded",
"audio": "<base64-encoded WAV>",
"content_type": "audio/wav",
"metadata": {
"duration_s": 12.4,
"sample_rate": 48000,
"processing_ms": 8200,
"seed": 42
}
}
```
## Architecture
```
XML prompt (voice + scene + action tags + text)
-> Gemma 3 12B text encoding
-> 8-step distilled latent diffusion
-> Audio VAE decoding
-> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
-> SeedVC voice identity transfer (when reference provided or multi-chunk)
-> Output WAV (48kHz stereo)
```
For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.
## VRAM Requirements
| VRAM | Audio Model | Gemma | Notes |
|------|------------|-------|-------|
| 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. |
| 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. |
| 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. |
VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.
## Performance
Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:
| Configuration | Total Time | Real-Time Factor |
|--------------|-----------|-----------------|
| bf16 + bf16 streaming | 83s | 0.66x |
| INT8 + NF4 (all GPU) | 35s | 1.57x |
## Limitations
- **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns.
- **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically.
- **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
- **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
- **Generation speed**: 3-8 seconds per 15-second segment depending on hardware.
- **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability.
- **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.
## Acknowledgments
- [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
- [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
- [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
- [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
- [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration
## License
The model weights are released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). Scenema Audio's audio diffusion transformer is derived from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s audiovisual model, and its weights are subject to the same terms.
The inference code and server are released under the [MIT License](https://github.com/ScenemaAI/scenema-audio/blob/main/LICENSE).
[Gemma 3 12B](https://ai.google.dev/gemma/terms) (text encoder) is a gated model requiring acceptance of Google's terms of use.