scenema-audio / README.md
multimodalart
Resolve model paths absolute; drop persistent-storage assumption
14a5337
metadata
title: Scenema Audio
emoji: 🎙️
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 6.14.0
python_version: '3.12'
app_file: app.py
pinned: false
hardware: zero-a10g
short_description: Zero-shot expressive voice cloning and speech generation

Scenema Audio (ZeroGPU)

Gradio wrapper around ScenemaAI/scenema-audio.

Zero-shot expressive voice cloning and speech generation with emotion, pacing, and breath control, built on an audio diffusion transformer extracted from LTX 2.3.

Cold start

First request downloads ~38 GB of model weights:

  • scenema-audio-transformer-int8.safetensors (~4.9 GB)
  • scenema-audio-pipeline.safetensors (~6.7 GB)
  • google/gemma-3-12b-it (~24 GB, gated — requires HF_TOKEN secret)
  • SeedVC + BigVGAN + Whisper checkpoints (~3 GB)
  • MelBandRoFormer (~436 MB)

Set HF_TOKEN in the Space secrets with access to google/gemma-3-12b-it.

License

  • Model weights: LTX-2 Community License Agreement
  • Code: MIT