| --- |
| language: |
| - en |
| - de |
| - fr |
| - es |
| - it |
| - pt |
| - ja |
| - zh |
| - ko |
| - ru |
| - ar |
| - hi |
| - sw |
| license: other |
| license_name: ltx-2-community |
| license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE |
| tags: |
| - audio-generation |
| - diffusion |
| - text-to-audio |
| - voice-cloning |
| - speech-generation |
| - expressive-speech |
| - voice-acting |
| - text-to-speech |
| pipeline_tag: text-to-speech |
| library_name: scenema-audio |
| inference: false |
| --- |
| |
| # Scenema Audio |
|
|
| **Zero-shot expressive voice cloning and speech generation.** |
|
|
| **[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)** |
|
|
| **[Watch the demo video on YouTube](https://youtu.be/VnEQ_ImOaAc)** |
|
|
| Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it. |
|
|
| Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified. |
|
|
| ## Capabilities |
|
|
| - **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags. |
| - **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults. |
| - **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice. |
| - **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment. |
| - **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments. |
| - **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili. |
|
|
| ## Model Checkpoints |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) | |
| | `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) | |
| | `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection | |
| | `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding | |
|
|
| ## Quick Start |
|
|
| ```bash |
| git clone https://github.com/ScenemaAI/scenema-audio.git |
| cd scenema-audio |
| |
| export HF_TOKEN=your_huggingface_token |
| docker compose up |
| ``` |
|
|
| Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation. |
|
|
| ## Prompt Format |
|
|
| ```xml |
| <speak voice="VOICE_DESCRIPTION" gender="male|female" |
| scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE"> |
| <action>Performance direction.</action> |
| Speech text here. |
| </speak> |
| ``` |
|
|
| | Attribute | Required | Default | Description | |
| |-----------|----------|---------|-------------| |
| | `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. | |
| | `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. | |
| | `scene` | No | | Environmental context. Conditions the ambient audio around the speech. | |
| | `language` | No | `"en"` | Language code. | |
|
|
| ### Voice Description |
|
|
| The `voice` attribute is the primary control. The richer and more specific, the better: |
|
|
| - **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance |
| - **Emotional state**: rage, tenderness, exhaustion, excitement, grief |
| - **Speaking style**: pacing, emphasis, pauses, enunciation |
| - **Character archetypes**: "Think Tony Soprano having a breakdown" |
| - **Age and gender**: child, elderly, young woman, teenage boy |
| - **Accents**: British, Southern American, New Jersey Italian American |
|
|
| ### Action Tags |
|
|
| `<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery: |
|
|
| ```xml |
| <speak voice="Middle-aged man, warm but weathered." gender="male"> |
| <action>Calm, almost casual. Staring at his hands.</action> |
| I used to think I had all the time in the world. |
| <action>Voice tightens. Fighting to stay composed.</action> |
| Then one Tuesday morning, the doctor said three words that changed everything. |
| <action>Long pause. Deep breath. Raw but steady.</action> |
| And I realized I hadn't called my son in six months. |
| </speak> |
| ``` |
|
|
| ### Voice Cloning |
|
|
| Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance. |
|
|
| ```json |
| { |
| "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>", |
| "reference_voice_url": "https://example.com/reference.wav" |
| } |
| ``` |
|
|
| Any voice can perform any emotion, even if that voice has never been recorded in that emotional state. |
|
|
| ## Examples |
|
|
| ### Emotional Acting |
|
|
| ```xml |
| <speak voice="A man on the edge. Explosive rage. Italian-American inflection." |
| gender="male" scene="A dimly lit office, late at night"> |
| <action>He stands up slowly, voice dangerously low</action> |
| You come into my house, you eat my food, and then you got the nerve |
| to tell me how to run my business. |
| <action>Voice rising, finger pointing</action> |
| I built this thing from nothing while you were sitting on your ass. |
| </speak> |
| ``` |
|
|
| ### Child Voice |
|
|
| ```xml |
| <speak voice="A six-year-old girl, bright and excited, speaking fast |
| with breathless enthusiasm. Slight lisp on S sounds." |
| gender="female"> |
| Mommy look! There is a rainbow and it goes all the way across the whole sky! |
| </speak> |
| ``` |
|
|
| ### Scene-Aware Audio |
|
|
| ```xml |
| <speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind." |
| gender="male" scene="Open dock in a thunderstorm, heavy rain" |
| shot="scene"> |
| <sound>Heavy rain and wind howling</sound> |
| <action>He shouts over the storm</action> |
| Get the lines! She is pulling loose! |
| <sound>Thunder cracks overhead</sound> |
| Move! I said move! |
| </speak> |
| ``` |
|
|
| ## API Reference |
|
|
| ### POST /generate |
|
|
| | Field | Type | Default | Description | |
| |-------|------|---------|-------------| |
| | `prompt` | string | **required** | `<speak>` XML string | |
| | `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. | |
| | `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. | |
| | `background_sfx` | bool | `false` | Keep generated sound effects in the output. | |
| | `validate` | bool | `true` | Whisper speech validation with retry on garbled output. | |
| | `seed` | int | `-1` | Generation seed. `-1` for random. | |
| | `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. | |
| | `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). | |
| | `skip_vc` | bool | `false` | Skip voice conversion post-processing. | |
| | `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). | |
| | `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). | |
|
|
| ### Response |
|
|
| Returns JSON with base64-encoded WAV audio: |
|
|
| ```json |
| { |
| "status": "succeeded", |
| "audio": "<base64-encoded WAV>", |
| "content_type": "audio/wav", |
| "metadata": { |
| "duration_s": 12.4, |
| "sample_rate": 48000, |
| "processing_ms": 8200, |
| "seed": 42 |
| } |
| } |
| ``` |
|
|
| ## Architecture |
|
|
| ``` |
| XML prompt (voice + scene + action tags + text) |
| -> Gemma 3 12B text encoding |
| -> 8-step distilled latent diffusion |
| -> Audio VAE decoding |
| -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true) |
| -> SeedVC voice identity transfer (when reference provided or multi-chunk) |
| -> Output WAV (48kHz stereo) |
| ``` |
|
|
| For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning. |
|
|
| ## VRAM Requirements |
|
|
| | VRAM | Audio Model | Gemma | Notes | |
| |------|------------|-------|-------| |
| | 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. | |
| | 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. | |
| | 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. | |
|
|
| VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations. |
|
|
| ## Performance |
|
|
| Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio: |
|
|
| | Configuration | Total Time | Real-Time Factor | |
| |--------------|-----------|-----------------| |
| | bf16 + bf16 streaming | 83s | 0.66x | |
| | INT8 + NF4 (all GPU) | 35s | 1.57x | |
|
|
| ## Limitations |
|
|
| - **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns. |
| - **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically. |
| - **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone). |
| - **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language. |
| - **Generation speed**: 3-8 seconds per 15-second segment depending on hardware. |
| - **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability. |
| - **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access. |
|
|
| ## Acknowledgments |
|
|
| - [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model |
| - [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder |
| - [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement |
| - [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation |
| - [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration |
|
|
| ## License |
|
|
| The model weights are released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). Scenema Audio's audio diffusion transformer is derived from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s audiovisual model, and its weights are subject to the same terms. |
|
|
| The inference code and server are released under the [MIT License](https://github.com/ScenemaAI/scenema-audio/blob/main/LICENSE). |
|
|
| [Gemma 3 12B](https://ai.google.dev/gemma/terms) (text encoder) is a gated model requiring acceptance of Google's terms of use. |
|
|