scenema-audio / README.md
multimodalart
Resolve model paths absolute; drop persistent-storage assumption
14a5337
---
title: Scenema Audio
emoji: 🎙️
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 6.14.0
python_version: '3.12'
app_file: app.py
pinned: false
hardware: zero-a10g
short_description: Zero-shot expressive voice cloning and speech generation
---
# Scenema Audio (ZeroGPU)
Gradio wrapper around [ScenemaAI/scenema-audio](https://github.com/ScenemaAI/scenema-audio).
Zero-shot expressive voice cloning and speech generation with emotion, pacing,
and breath control, built on an audio diffusion transformer extracted from
[LTX 2.3](https://github.com/Lightricks/LTX-2).
## Cold start
First request downloads ~38 GB of model weights:
- `scenema-audio-transformer-int8.safetensors` (~4.9 GB)
- `scenema-audio-pipeline.safetensors` (~6.7 GB)
- `google/gemma-3-12b-it` (~24 GB, **gated** — requires `HF_TOKEN` secret)
- SeedVC + BigVGAN + Whisper checkpoints (~3 GB)
- MelBandRoFormer (~436 MB)
Set `HF_TOKEN` in the Space secrets with access to `google/gemma-3-12b-it`.
## License
- **Model weights:** LTX-2 Community License Agreement
- **Code:** MIT