Duplicate from ScenemaAI/scenema-audio

88b9f42 7 days ago

10.9 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- it
	- pt
	- ja
	- zh
	- ko
	- ru
	- ar
	- hi
	- sw
	license: other
	license_name: ltx-2-community
	license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
	tags:
	- audio-generation
	- diffusion
	- text-to-audio
	- voice-cloning
	- speech-generation
	- expressive-speech
	- voice-acting
	- text-to-speech
	pipeline_tag: text-to-speech
	library_name: scenema-audio
	inference: false
	---

	# Scenema Audio

	Zero-shot expressive voice cloning and speech generation.

	[Visit scenema.ai/audio to hear all demos and try it out.](https://scenema.ai/audio)

	[Watch the demo video on YouTube](https://youtu.be/VnEQ_ImOaAc)

	Every existing text-to-speech system converts words into sound, but none of them perform. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

	Built on an audio diffusion transformer extracted from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.

	## Capabilities

	- Emotional acting: Rage, grief, joy, fear, exhaustion. Emotional state shifts within a single generation via action tags.
	- Child voices: Six-year-olds, toddlers, teenagers. Naturally voiced, not pitch-shifted adults.
	- Scene-aware audio: Describe the environment and the model generates speech with rain, thunder, crowds, or any ambient audio alongside the voice.
	- Zero-shot voice cloning: Provide 10-20 seconds of reference audio with some emotional variability. The model transfers the voice identity onto any emotional performance. No fine-tuning, no enrollment.
	- Long-form narration: Generates any length of audio by automatically splitting text and maintaining voice continuity across segments.
	- Multilingual: English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese, Korean, Russian, Arabic, Hindi, Swahili.

	## Model Checkpoints

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `scenema-audio-transformer.safetensors` \| 9.8 GB \| Audio diffusion transformer (bf16) \|
	\| `scenema-audio-transformer-int8.safetensors` \| 4.9 GB \| Audio diffusion transformer (INT8, identical quality) \|
	\| `scenema-audio-pipeline.safetensors` \| 6.7 GB \| Audio VAE decoder + vocoder + text projection \|
	\| `scenema-audio-vae-encoder.safetensors` \| 42.7 MB \| Audio VAE encoder for reference voice encoding \|

	## Quick Start

	```bash
	git clone https://github.com/ScenemaAI/scenema-audio.git
	cd scenema-audio

	export HF_TOKEN=your_huggingface_token
	docker compose up
	```

	Models are downloaded on first start (~38 GB) and cached in a Docker volume. See the [GitHub repo](https://github.com/ScenemaAI/scenema-audio) for full documentation.

	## Prompt Format

	```xml
	<speak voice="VOICE_DESCRIPTION" gender="male\|female"
	scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
	<action>Performance direction.</action>
	Speech text here.
	</speak>
	```

	\| Attribute \| Required \| Default \| Description \|
	\|-----------\|----------\|---------\|-------------\|
	\| `voice` \| Yes \| \| Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. \|
	\| `gender` \| Yes \| \| `"male"` or `"female"`. Controls pronoun assignment in compiled prompts. \|
	\| `scene` \| No \| \| Environmental context. Conditions the ambient audio around the speech. \|
	\| `language` \| No \| `"en"` \| Language code. \|

	### Voice Description

	The `voice` attribute is the primary control. The richer and more specific, the better:

	- Vocal qualities: timbre, pitch, breathiness, rasp, resonance
	- Emotional state: rage, tenderness, exhaustion, excitement, grief
	- Speaking style: pacing, emphasis, pauses, enunciation
	- Character archetypes: "Think Tony Soprano having a breakdown"
	- Age and gender: child, elderly, young woman, teenage boy
	- Accents: British, Southern American, New Jersey Italian American

	### Action Tags

	`<action>` tags are stage directions that shape HOW speech is delivered. Place them between speech segments to direct emotional shifts, pacing, and physical delivery:

	```xml
	<speak voice="Middle-aged man, warm but weathered." gender="male">
	<action>Calm, almost casual. Staring at his hands.</action>
	I used to think I had all the time in the world.
	<action>Voice tightens. Fighting to stay composed.</action>
	Then one Tuesday morning, the doctor said three words that changed everything.
	<action>Long pause. Deep breath. Raw but steady.</action>
	And I realized I hadn't called my son in six months.
	</speak>
	```

	### Voice Cloning

	Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt and transfers the reference voice's identity onto the performance.

	```json
	{
	"prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it</action>What are you waiting for?!</speak>",
	"reference_voice_url": "https://example.com/reference.wav"
	}
	```

	Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

	## Examples

	### Emotional Acting

	```xml
	<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
	gender="male" scene="A dimly lit office, late at night">
	<action>He stands up slowly, voice dangerously low</action>
	You come into my house, you eat my food, and then you got the nerve
	to tell me how to run my business.
	<action>Voice rising, finger pointing</action>
	I built this thing from nothing while you were sitting on your ass.
	</speak>
	```

	### Child Voice

	```xml
	<speak voice="A six-year-old girl, bright and excited, speaking fast
	with breathless enthusiasm. Slight lisp on S sounds."
	gender="female">
	Mommy look! There is a rainbow and it goes all the way across the whole sky!
	</speak>
	```

	### Scene-Aware Audio

	```xml
	<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
	gender="male" scene="Open dock in a thunderstorm, heavy rain"
	shot="scene">
	<sound>Heavy rain and wind howling</sound>
	<action>He shouts over the storm</action>
	Get the lines! She is pulling loose!
	<sound>Thunder cracks overhead</sound>
	Move! I said move!
	</speak>
	```

	## API Reference

	### POST /generate

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `prompt` \| string \| required \| `<speak>` XML string \|
	\| `mode` \| string \| `"generate"` \| `"generate"` for full pipeline. `"voice_design"` for 15s voice preview. \|
	\| `reference_voice_url` \| string \| `null` \| URL to reference audio for zero-shot voice cloning. 10-20 seconds with emotional variability is ideal. \|
	\| `background_sfx` \| bool \| `false` \| Keep generated sound effects in the output. \|
	\| `validate` \| bool \| `true` \| Whisper speech validation with retry on garbled output. \|
	\| `seed` \| int \| `-1` \| Generation seed. `-1` for random. \|
	\| `pace` \| float \| `1.5` \| Duration allocation multiplier. Higher = slower speech. \|
	\| `min_match_ratio` \| float \| `0.90` \| Whisper validation threshold (0.0-1.0). \|
	\| `skip_vc` \| bool \| `false` \| Skip voice conversion post-processing. \|
	\| `vc_steps` \| int \| `25` \| SeedVC diffusion steps (10-50). \|
	\| `vc_cfg_rate` \| float \| `0.5` \| SeedVC guidance rate (0.0-1.0). \|

	### Response

	Returns JSON with base64-encoded WAV audio:

	```json
	{
	"status": "succeeded",
	"audio": "<base64-encoded WAV>",
	"content_type": "audio/wav",
	"metadata": {
	"duration_s": 12.4,
	"sample_rate": 48000,
	"processing_ms": 8200,
	"seed": 42
	}
	}
	```

	## Architecture

	```
	XML prompt (voice + scene + action tags + text)
	-> Gemma 3 12B text encoding
	-> 8-step distilled latent diffusion
	-> Audio VAE decoding
	-> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
	-> SeedVC voice identity transfer (when reference provided or multi-chunk)
	-> Output WAV (48kHz stereo)
	```

	For longer text, the system splits at sentence boundaries using Kokoro phoneme-level duration estimation and maintains voice continuity between segments via A2V latent conditioning.

	## VRAM Requirements

	\| VRAM \| Audio Model \| Gemma \| Notes \|
	\|------\|------------\|-------\|-------\|
	\| 16 GB \| INT8 (4.9 GB) \| CPU streaming \| Needs 32 GB system RAM. ~7s/chunk encode. \|
	\| 24 GB \| INT8 (4.9 GB) \| NF4 on GPU (~8 GB) \| Default config. ~0.2s/chunk encode. \|
	\| 48 GB \| bf16 (9.8 GB) \| bf16 on GPU (24 GB) \| Best quality. All models resident. \|

	VRAM strategy is auto-detected. [SageAttention 2](https://github.com/thu-ml/SageAttention) recommended for all configurations.

	## Performance

	Benchmarked on NVIDIA RTX 4090 (24 GB), ~55 seconds of output audio:

	\| Configuration \| Total Time \| Real-Time Factor \|
	\|--------------\|-----------\|-----------------\|
	\| bf16 + bf16 streaming \| 83s \| 0.66x \|
	\| INT8 + NF4 (all GPU) \| 35s \| 1.57x \|

	## Limitations

	- Pronunciation: Occasionally garbles complex multi-syllable words and proper nouns.
	- 15-second generation window: Each segment capped at ~15s. Longer text splits automatically.
	- Emotional range with voice cloning: Identity transfer can reduce emotional extremes. Use a strong archetype in the voice description and provide reference audio with natural emotional variability (10-20 seconds, not monotone).
	- Multilingual pronunciation: Language switching mid-speech may cause phonetic drift. Use separate requests per language.
	- Generation speed: 3-8 seconds per 15-second segment depending on hardware.
	- Reference audio quality: Low-quality references degrade output. Use clean audio with some emotional variability.
	- Gemma 3 12B is gated: Requires accepting Google's terms of use and a HuggingFace token with access.

	## Acknowledgments

	- [LTX-2](https://github.com/Lightricks/LTX-2) by Lightricks for the base audiovisual model
	- [Gemma 3](https://ai.google.dev/gemma) by Google for the text encoder
	- [SeedVC](https://github.com/Plachtaa/seed-vc) by Plachta for voice refinement
	- [Kokoro](https://github.com/hexgrad/kokoro) by hexgrad for duration estimation
	- [SageAttention](https://github.com/thu-ml/SageAttention) for attention acceleration

	## License

	The model weights are released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). Scenema Audio's audio diffusion transformer is derived from [LTX 2.3](https://github.com/Lightricks/LTX-2)'s audiovisual model, and its weights are subject to the same terms.

	The inference code and server are released under the [MIT License](https://github.com/ScenemaAI/scenema-audio/blob/main/LICENSE).

	[Gemma 3 12B](https://ai.google.dev/gemma/terms) (text encoder) is a gated model requiring acceptance of Google's terms of use.