File size: 10,943 Bytes

---
language:
  - en
  - de
  - fr
  - es
  - it
  - pt
  - ja
  - zh
  - ko
  - ru
  - ar
  - hi
  - sw
license: other
license_name: ltx-2-community
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
tags:
  - audio-generation
  - diffusion
  - text-to-audio
  - voice-cloning
  - speech-generation
  - expressive-speech
  - voice-acting
  - text-to-speech
pipeline_tag: text-to-speech
library_name: scenema-audio
inference: false
---

# Scenema Audio

**Zero-shot expressive voice cloning and speech generation.**

**[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)**

**[Watch the demo video on YouTube](https://youtu.be/VnEQ_ImOaAc)**

Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.

## Capabilities

- **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
- **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
- **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
- **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
- **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
- **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.

## Model Checkpoints

| File | Size | Description |
|------|------|-------------|
| `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |
| `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |
| `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |
| `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |

## Quick Start

```bash
git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio

export HF_TOKEN=your_huggingface_token
docker compose up
```

Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.

## Prompt Format

```xml
<speak voice="VOICE_DESCRIPTION" gender="male|female"
       scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
  <action>Performance direction.</action>
  Speech text here.
</speak>
```

| Attribute | Required | Default | Description |
|-----------|----------|---------|-------------|
| `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. |
| `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. |
| `scene` | No | | Environmental context. Conditions the ambient audio around the speech. |
| `language` | No | `"en"` | Language code. |

### Voice Description

The `voice` attribute is the primary control. The richer and more specific, the better:

- **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance
- **Emotional state**: rage, tenderness, exhaustion, excitement, grief
- **Speaking style**: pacing, emphasis, pauses, enunciation
- **Character archetypes**: "Think Tony Soprano having a breakdown"
- **Age and gender**: child, elderly, young woman, teenage boy
- **Accents**: British, Southern American, New Jersey Italian American

### Action Tags

`<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:

```xml
<speak voice="Middle-aged man, warm but weathered." gender="male">
  <action>Calm, almost casual. Staring at his hands.</action>
  I used to think I had all the time in the world.
  <action>Voice tightens. Fighting to stay composed.</action>
  Then one Tuesday morning, the doctor said three words that changed everything.
  <action>Long pause. Deep breath. Raw but steady.</action>
  And I realized I hadn't called my son in six months.
</speak>
```

### Voice Cloning

Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.

```json
{
  "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
  "reference_voice_url": "https://example.com/reference.wav"
}
```

Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

## Examples

### Emotional Acting

```xml
<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
       gender="male" scene="A dimly lit office, late at night">
  <action>He stands up slowly, voice dangerously low</action>
  You come into my house, you eat my food, and then you got the nerve
  to tell me how to run my business.
  <action>Voice rising, finger pointing</action>
  I built this thing from nothing while you were sitting on your ass.
</speak>
```

### Child Voice

```xml
<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
  Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>
```

### Scene-Aware Audio

```xml
<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
       gender="male" scene="Open dock in a thunderstorm, heavy rain"
       shot="scene">
  <sound>Heavy rain and wind howling</sound>
  <action>He shouts over the storm</action>
  Get the lines! She is pulling loose!
  <sound>Thunder cracks overhead</sound>
  Move! I said move!
</speak>
```

## API Reference

### POST /generate

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `prompt` | string | **required** | `<speak>` XML string |
| `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. |
| `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. |
| `background_sfx` | bool | `false` | Keep generated sound effects in the output. |
| `validate` | bool | `true` | Whisper speech validation with retry on garbled output. |
| `seed` | int | `-1` | Generation seed. `-1` for random. |
| `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. |
| `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). |
| `skip_vc` | bool | `false` | Skip voice conversion post-processing. |
| `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). |
| `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). |

### Response

Returns JSON with base64-encoded WAV audio:

```json
{
  "status": "succeeded",
  "audio": "<base64-encoded WAV>",
  "content_type": "audio/wav",
  "metadata": {
    "duration_s": 12.4,
    "sample_rate": 48000,
    "processing_ms": 8200,
    "seed": 42
  }
}
```

## Architecture

```
XML prompt (voice + scene + action tags + text)
  -> Gemma 3 12B text encoding
  -> 8-step distilled latent diffusion
  -> Audio VAE decoding
  -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
  -> SeedVC voice identity transfer (when reference provided or multi-chunk)
  -> Output WAV (48kHz stereo)
```

For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.

## VRAM Requirements

| VRAM | Audio Model | Gemma | Notes |
|------|------------|-------|-------|
| 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. |
| 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. |
| 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. |

VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.

## Performance

Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:

| Configuration | Total Time | Real-Time Factor |
|--------------|-----------|-----------------|
| bf16 + bf16 streaming | 83s | 0.66x |
| INT8 + NF4 (all GPU) | 35s | 1.57x |

## Limitations

- **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns.
- **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically.
- **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
- **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
- **Generation speed**: 3-8 seconds per 15-second segment depending on hardware.
- **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability.
- **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.

## Acknowledgments

- [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
- [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
- [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
- [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
- [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration

## License

The model weights are released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). Scenema Audio's audio diffusion transformer is derived from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s audiovisual model, and its weights are subject to the same terms.

The inference code and server are released under the [MIT License](https://github.com/ScenemaAI/scenema-audio/blob/main/LICENSE).

[Gemma 3 12B](https://ai.google.dev/gemma/terms) (text encoder) is a gated model requiring acceptance of Google's terms of use.