Scenema Audio Low VRAM Mode

This repository mirrors the official Scenema Audio model files and includes a code mirror of the low-VRAM runtime fork for Windows/WSL Docker systems. Canonical development remains on GitHub, while this Hugging Face repo keeps the all-in-one model/code snapshot.

Runtime fork:

GitHub: Turbalo/scenema-audio-low-vram

Upstream model:

Hugging Face: ScenemaAI/scenema-audio

This fork is a rebuild of the official Scenema Audio runtime focused on running the pipeline on 16 GB VRAM hardware. It does not create a new model architecture and does not change the original model weights.

What This Fork Changes

Uses the official Scenema Audio INT8 transformer as the default audio model.
Uses unsloth/gemma-3-12b-it-bnb-4bit as the default low-VRAM Gemma text encoder.
Encodes all text chunks first, then unloads Gemma before audio diffusion.
Keeps model files in Docker/WSL volumes instead of slow Windows bind mounts.
Adds a Windows PowerShell one-command launcher.
Adds Gradio progress percentage, current stage, elapsed time, and a Stop button.
Adds GET /progress/{job_id} and POST /cancel/{job_id} API endpoints.
Saves successful generations as WAV plus JSON metadata.
Releases idle model memory after each request.
Trims Python/CUDA allocator memory after requests to reduce RSS growth across repeated generations.
Adds optional Scenema audio offload before Gemma, Whisper, and SeedVC phases; bf16 mode enables it automatically.

The default low-VRAM runtime does not download full google/gemma-3-12b-it. Full Gemma comparison is available as an optional mode in the GitHub repo.

The default runtime uses the INT8 Scenema audio transformer. The official bf16 Scenema audio transformer can be tested separately as an optional A/B mode.

Quick Start

Recommended lightweight clone from GitHub:

git clone https://github.com/Turbalo/scenema-audio-low-vram.git
cd scenema-audio-low-vram
.\Install-And-Run.ps1

All-in-one Hugging Face clone with model files in the same repository:

git clone https://huggingface.co/CENSORED666/scenema-audio-low-vram-mode
cd scenema-audio-low-vram-mode
.\Install-And-Run.ps1

The Docker runtime still caches active model files into Docker volumes on first run. The GitHub clone is smaller; the Hugging Face clone is useful when you want the mirrored code and original model files together.

Then open:

http://127.0.0.1:8000/ui/

Requirements:

Windows 11
WSL2
Docker Desktop with NVIDIA GPU support
NVIDIA GPU with 16 GB VRAM tested
Hugging Face token for upstream model downloads
NVMe SSD strongly recommended for Docker/WSL storage

More detailed instructions:

How The Low-VRAM Mode Works

The 16 GB VRAM runtime keeps only the model needed for the current phase active:

Plan chunks.
Load NF4 Gemma.
Encode all chunk prompts into conditioning tensors.
Unload Gemma.
Run Scenema audio diffusion chunk by chunk.
Run optional validation, vocal cleanup, and SeedVC.
Save output and offload idle models.

Model weights normally move through the machine like this:

disk -> RAM -> VRAM

Because the runtime unloads Gemma after text encoding, later requests may need to reload it. NVMe storage for Docker/WSL is strongly recommended. HDDs or slow SATA SSDs can make reloads feel like the app is stuck.

Default runtime environment:

AUDIO_CKPT=/app/models/scenema-audio-transformer-int8.safetensors
GEMMA_REPO=unsloth/gemma-3-12b-it-bnb-4bit
GEMMA_ROOT=/app/models/gemma-3-12b-it-bnb-4bit
GEMMA_QUANTIZE=nf4
GEMMA_OFFLOAD=unload
OFFLOAD_AUDIO_BEFORE_AUX=0

Optional bf16 audio mode keeps Gemma in the same NF4 unload setup, but switches the Scenema audio checkpoint. It also sets OFFLOAD_AUDIO_BEFORE_AUX=1 and 64 GB system RAM is recommended:

AUDIO_CKPT=/app/models/scenema-audio-transformer.safetensors

See optional bf16 audio mode.

Model Files In This Repository

This Hugging Face repo keeps the Scenema Audio model assets:

File	Description
`scenema-audio-transformer.safetensors`	Audio diffusion transformer, bf16
`scenema-audio-transformer-int8.safetensors`	Audio diffusion transformer, INT8
`scenema-audio-pipeline.safetensors`	Audio VAE decoder, vocoder, text projection
`scenema-audio-vae-encoder.safetensors`	Audio VAE encoder for reference conditioning
`sageattention-2.2.0-cp312-cp312-linux_x86_64.whl`	Optional SageAttention wheel used by the runtime image
`config.json`	Basic model metadata

The GitHub runtime downloads the needed files into Docker volumes on first start.

Benchmarks

Test system: RTX 4080 SUPER 16 GB, Windows/WSL Docker, seed 120, background_sfx=true, validate=false.

Pair	Mode	Audio duration	Processing time	Peak VRAM
Long prompt	NF4 low-VRAM	50.34s	108.36s	14,488 MB
Long prompt	Optional full Gemma	46.24s	414.21s	13,220 MB
Short prompt	NF4 low-VRAM	12.20s	19.72s	14,488 MB
Short prompt	Optional full Gemma	10.66s	194.98s	8,428 MB

Observed result:

NF4 low-VRAM mode is the practical default for iteration.
Full Gemma followed prompt context more closely in listening tests, but was much slower.
Russian stress/pronunciation errors appeared in both modes, so they were not solved by replacing NF4 Gemma with full Gemma.

What To Use Where

Use this Hugging Face repository for:

model files
model card
links to the low-VRAM runtime

Use the GitHub repository for:

Docker runtime
Windows launch scripts
Gradio UI changes
API changes
tests and implementation code
optional bf16 audio transformer runtime mode

GitHub runtime:

https://github.com/Turbalo/scenema-audio-low-vram

Credits

This low-VRAM runtime builds on:

ScenemaAI/scenema-audio - original Scenema Audio model and pipeline.
ScenemaAI - project page and demos.
Lightricks/LTX-2 - LTX audiovisual components used by Scenema Audio.
Google Gemma - Gemma 3 12B instruction model family.
unsloth/gemma-3-12b-it-bnb-4bit - default NF4 Gemma checkpoint for low-VRAM mode.
Plachtaa/seed-vc - SeedVC voice conversion.
Kijai/ComfyUI-MelBandRoFormer and MelBandRoFormer weights - vocal/background separation.
NVIDIA BigVGAN - vocoder used by SeedVC.
OpenAI Whisper and faster-whisper - optional speech validation.
Kokoro - duration estimation for chunk planning.

License

The model weights remain under the LTX-2 Community License Agreement. Scenema Audio's audio diffusion transformer is derived from LTX 2.3's audiovisual model, and its weights are subject to the same terms.

Runtime code changes are MIT-licensed where applicable, following the upstream repository.

Model weights, Gemma, SeedVC, BigVGAN, Whisper, Kokoro, and related components remain governed by their respective licenses and terms.

Security Notes

The PowerShell files are plain-text launch helpers for Docker Compose. They are not obfuscated and do not download or execute hidden binaries outside the Docker runtime.

VirusTotal API scan results for the tracked PowerShell scripts on 2026-05-17:

Script	SHA256	VirusTotal result
`Generate-TestAudio.ps1`	`507f59f53fb58ebf40491ca8328fd1b401c8953e707eee1d96b7d252f48d17fc`	0 malicious / 0 suspicious (report)
`Install-And-Run.ps1`	`1b54bea310b340fe02c4bc389c001746ed4cf124d4527e8c4b49a7fa932080a8`	0 malicious / 0 suspicious (report)
`Start-ScenemaAudio-BF16Audio.ps1`	`09e281e2ef88e72da7092a38162bcb803474d7717b759593545583368e1076d6`	0 malicious / 0 suspicious (report)
`Start-ScenemaAudio-FullGemma.ps1`	`cc508bf96d46d0f7ac9bbaa94d58f5f6e048fb0a8b2c887372b444e1a0f27c2f`	0 malicious / 0 suspicious (report)
`Start-ScenemaAudio.ps1`	`6345bc98078296e030f12d48e1ca484df2a1efcb8c1874ff85e80b000dd121ab`	0 malicious / 0 suspicious (report)
`Stop-ScenemaAudio-BF16Audio.ps1`	`998919de5a75c5602cad38899b927a291378c9b6978b56240aa2f157160ebc47`	0 malicious / 0 suspicious (report)
`Stop-ScenemaAudio-FullGemma.ps1`	`412e2587db5b2e93b02b31fe60659a2d9f12d9bdbadc9b855f2f34eaaac8f6b5`	0 malicious / 0 suspicious (report)

VirusTotal results can change if files are re-analyzed later. Inspect the scripts before running them if your local policy requires it.

Downloads last month: 61