ScenemaAI
/

scenema-audio

+---
+language:
+  - en
+  - de
+  - fr
+  - es
+  - it
+  - pt
+  - ja
+  - zh
+  - ko
+  - ru
+  - ar
+  - hi
+  - sw
+license: mit
+tags:
+  - audio-generation
+  - diffusion
+  - text-to-audio
+  - voice-cloning
+  - speech-generation
+  - expressive-speech
+  - voice-acting
+pipeline_tag: text-to-audio
+library_name: scenema-audio
+---
+# Scenema Audio
+**Zero-shot expressive voice cloning and speech generation.**
+**[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)**
+[![Demo Video](https://img.youtube.com/vi/DW1JzkZn_u0/maxresdefault.jpg)](https://youtu.be/DW1JzkZn_u0)
+Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.
+Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.
+## Capabilities
+- **Emotional acting**: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
+- **Child voices**: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
+- **Scene-aware audio**: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
+- **Zero-shot voice cloning**: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
+- **Long-form narration**: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
+- **Multilingual**: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.
+## Model Checkpoints
+| File | Size | Description |
+|------|------|-------------|
+| `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |
+| `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |
+| `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |
+| `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |
+## Quick Start
+```bash
+git clone https://github.com/ScenemaAI/scenema-audio.git
+cd scenema-audio
+export HF_TOKEN=your_huggingface_token
+docker compose up
+```
+Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.
+## Prompt Format
+```xml
+<speak voice="VOICE_DESCRIPTION" gender="male|female"
+       scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
+  <action>Performance direction.</action>
+  Speech text here.
+</speak>
+```
+| Attribute | Required | Default | Description |
+|-----------|----------|---------|-------------|
+| `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. |
+| `gender` | Yes | | `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. |
+| `scene` | No | | Environmental context. Conditions the ambient audio around the speech. |
+| `language` | No | `"en"` | Language code. |
+### Voice Description
+The `voice` attribute is the primary control. The richer and more specific, the better:
+- **Vocal qualities**: timbre, pitch, breathiness, rasp, resonance
+- **Emotional state**: rage, tenderness, exhaustion, excitement, grief
+- **Speaking style**: pacing, emphasis, pauses, enunciation
+- **Character archetypes**: "Think Tony Soprano having a breakdown"
+- **Age and gender**: child, elderly, young woman, teenage boy
+- **Accents**: British, Southern American, New Jersey Italian American
+### Action Tags
+`<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:
+```xml
+<speak voice="Middle-aged man, warm but weathered." gender="male">
+  <action>Calm, almost casual. Staring at his hands.</action>
+  I used to think I had all the time in the world.
+  <action>Voice tightens. Fighting to stay composed.</action>
+  Then one Tuesday morning, the doctor said three words that changed everything.
+  <action>Long pause. Deep breath. Raw but steady.</action>
+  And I realized I hadn't called my son in six months.
+</speak>
+```
+### Voice Cloning
+Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.
+```json
+{
+  "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
+  "reference_voice_url": "https://example.com/reference.wav"
+}
+```
+Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
+## Examples
+### Emotional Acting
+```xml
+<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
+       gender="male" scene="A dimly lit office, late at night">
+  <action>He stands up slowly, voice dangerously low</action>
+  You come into my house, you eat my food, and then you got the nerve
+  to tell me how to run my business.
+  <action>Voice rising, finger pointing</action>
+  I built this thing from nothing while you were sitting on your ass.
+</speak>
+```
+### Child Voice
+```xml
+<speak voice="A six-year-old girl, bright and excited, speaking fast
+with breathless enthusiasm. Slight lisp on S sounds."
+gender="female">
+  Mommy look! There is a rainbow and it goes all the way across the whole sky!
+</speak>
+```
+### Scene-Aware Audio
+```xml
+<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
+       gender="male" scene="Open dock in a thunderstorm, heavy rain"
+       shot="scene">
+  <sound>Heavy rain and wind howling</sound>
+  <action>He shouts over the storm</action>
+  Get the lines! She is pulling loose!
+  <sound>Thunder cracks overhead</sound>
+  Move! I said move!
+</speak>
+```
+## API Reference
+### POST /generate
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `prompt` | string | **required** | `<speak>` XML string |
+| `mode` | string | `"generate"` | `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. |
+| `reference_voice_url` | string | `null` | URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. |
+| `background_sfx` | bool | `false` | Keep generated sound effects in the output. |
+| `validate` | bool | `true` | Whisper speech validation with retry on garbled output. |
+| `seed` | int | `-1` | Generation seed. `-1` for random. |
+| `pace` | float | `1.5` | Duration allocation multiplier. Higher = slower speech. |
+| `min_match_ratio` | float | `0.90` | Whisper validation threshold (0.0-1.0). |
+| `skip_vc` | bool | `false` | Skip voice conversion post-processing. |
+| `vc_steps` | int | `25` | SeedVC diffusion steps (10-50). |
+| `vc_cfg_rate` | float | `0.5` | SeedVC guidance rate (0.0-1.0). |
+### Response
+Returns JSON with base64-encoded WAV audio:
+```json
+{
+  "status": "succeeded",
+  "audio": "<base64-encoded WAV>",
+  "content_type": "audio/wav",
+  "metadata": {
+    "duration_s": 12.4,
+    "sample_rate": 48000,
+    "processing_ms": 8200,
+    "seed": 42
+  }
+}
+```
+## Architecture
+```
+XML prompt (voice + scene + action tags + text)
+  -> Gemma 3 12B text encoding
+  -> 8-step distilled latent diffusion
+  -> Audio VAE decoding
+  -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
+  -> SeedVC voice identity transfer (when reference provided or multi-chunk)
+  -> Output WAV (48kHz stereo)
+```
+For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.
+## VRAM Requirements
+| VRAM | Audio Model | Gemma | Notes |
+|------|------------|-------|-------|
+| 16 GB | INT8 (4.9 GB) | CPU streaming | Needs 32 GB system RAM. ~7s/chunk encode. |
+| 24 GB | INT8 (4.9 GB) | NF4 on GPU (~8 GB) | Default config. ~0.2s/chunk encode. |
+| 48 GB | bf16 (9.8 GB) | bf16 on GPU (24 GB) | Best quality. All models resident. |
+VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.
+## Performance
+Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:
+| Configuration | Total Time | Real-Time Factor |
+|--------------|-----------|-----------------|
+| bf16 + bf16 streaming | 83s | 0.66x |
+| INT8 + NF4 (all GPU) | 35s | 1.57x |
+## Limitations
+- **Pronunciation**: Occasionally garbles complex multi-syllable words and proper nouns.
+- **15-second generation window**: Each segment capped at ~15s. Longer text splits automatically.
+- **Emotional range with voice cloning**: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
+- **Multilingual pronunciation**: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
+- **Generation speed**: 3-8 seconds per 15-second segment depending on hardware.
+- **Reference audio quality**: Low-quality references degrade output. Use clean audio with some emotional variability.
+- **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.
+## Acknowledgments
+- [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
+- [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
+- [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
+- [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
+- [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration
+## License
+MIT License. See [LICENSE](LICENSE) for details.
+Gemma 3 12B (text encoder) is a gated model requiring Google's terms of use.