Scenema Audio Low VRAM Mode

This repository mirrors the official Scenema Audio model files and includes a code mirror of the low-VRAM runtime fork for Windows/WSL Docker systems. Canonical development remains on GitHub, while this Hugging Face repo keeps the all-in-one model/code snapshot.

Runtime fork:

GitHub: Turbalo/scenema-audio-low-vram

Model swap pipeline

Upstream model:

Hugging Face: ScenemaAI/scenema-audio

This fork is a rebuild of the official Scenema Audio runtime focused on running the pipeline on 16 GB VRAM hardware. It does not create a new model architecture and does not change the original model weights.

What This Fork Changes

  • Uses the official Scenema Audio INT8 transformer as the default audio model.
  • Uses unsloth/gemma-3-12b-it-bnb-4bit as the default low-VRAM Gemma text encoder.
  • Encodes all text chunks first, then unloads Gemma before audio diffusion.
  • Keeps model files in Docker/WSL volumes instead of slow Windows bind mounts.
  • Adds a Windows PowerShell one-command launcher.
  • Adds Gradio progress percentage, current stage, elapsed time, and a Stop button.
  • Adds GET /progress/{job_id} and POST /cancel/{job_id} API endpoints.
  • Saves successful generations as WAV plus JSON metadata.
  • Releases idle model memory after each request.
  • Trims Python/CUDA allocator memory after requests to reduce RSS growth across repeated generations.
  • Adds optional Scenema audio offload before Gemma, Whisper, and SeedVC phases; bf16 mode enables it automatically.

The default low-VRAM runtime does not download full google/gemma-3-12b-it. Full Gemma comparison is available as an optional mode in the GitHub repo.

The default runtime uses the INT8 Scenema audio transformer. The official bf16 Scenema audio transformer can be tested separately as an optional A/B mode.

Quick Start

Recommended lightweight clone from GitHub:

git clone https://github.com/Turbalo/scenema-audio-low-vram.git
cd scenema-audio-low-vram
.\Install-And-Run.ps1

All-in-one Hugging Face clone with model files in the same repository:

git clone https://huggingface.co/CENSORED666/scenema-audio-low-vram-mode
cd scenema-audio-low-vram-mode
.\Install-And-Run.ps1

The Docker runtime still caches active model files into Docker volumes on first run. The GitHub clone is smaller; the Hugging Face clone is useful when you want the mirrored code and original model files together.

Then open:

http://127.0.0.1:8000/ui/

Requirements:

  • Windows 11
  • WSL2
  • Docker Desktop with NVIDIA GPU support
  • NVIDIA GPU with 16 GB VRAM tested
  • Hugging Face token for upstream model downloads
  • NVMe SSD strongly recommended for Docker/WSL storage

More detailed instructions:

How The Low-VRAM Mode Works

The 16 GB VRAM runtime keeps only the model needed for the current phase active:

  1. Plan chunks.
  2. Load NF4 Gemma.
  3. Encode all chunk prompts into conditioning tensors.
  4. Unload Gemma.
  5. Run Scenema audio diffusion chunk by chunk.
  6. Run optional validation, vocal cleanup, and SeedVC.
  7. Save output and offload idle models.

Model weights normally move through the machine like this:

disk -> RAM -> VRAM

Because the runtime unloads Gemma after text encoding, later requests may need to reload it. NVMe storage for Docker/WSL is strongly recommended. HDDs or slow SATA SSDs can make reloads feel like the app is stuck.

Default runtime environment:

AUDIO_CKPT=/app/models/scenema-audio-transformer-int8.safetensors
GEMMA_REPO=unsloth/gemma-3-12b-it-bnb-4bit
GEMMA_ROOT=/app/models/gemma-3-12b-it-bnb-4bit
GEMMA_QUANTIZE=nf4
GEMMA_OFFLOAD=unload
OFFLOAD_AUDIO_BEFORE_AUX=0

Optional bf16 audio mode keeps Gemma in the same NF4 unload setup, but switches the Scenema audio checkpoint. It also sets OFFLOAD_AUDIO_BEFORE_AUX=1 and 64 GB system RAM is recommended:

AUDIO_CKPT=/app/models/scenema-audio-transformer.safetensors

See optional bf16 audio mode.

Model Files In This Repository

This Hugging Face repo keeps the Scenema Audio model assets:

File Description
scenema-audio-transformer.safetensors Audio diffusion transformer, bf16
scenema-audio-transformer-int8.safetensors Audio diffusion transformer, INT8
scenema-audio-pipeline.safetensors Audio VAE decoder, vocoder, text projection
scenema-audio-vae-encoder.safetensors Audio VAE encoder for reference conditioning
sageattention-2.2.0-cp312-cp312-linux_x86_64.whl Optional SageAttention wheel used by the runtime image
config.json Basic model metadata

The GitHub runtime downloads the needed files into Docker volumes on first start.

Benchmarks

Test system: RTX 4080 SUPER 16 GB, Windows/WSL Docker, seed 120, background_sfx=true, validate=false.

Pair Mode Audio duration Processing time Peak VRAM
Long prompt NF4 low-VRAM 50.34s 108.36s 14,488 MB
Long prompt Optional full Gemma 46.24s 414.21s 13,220 MB
Short prompt NF4 low-VRAM 12.20s 19.72s 14,488 MB
Short prompt Optional full Gemma 10.66s 194.98s 8,428 MB

Observed result:

  • NF4 low-VRAM mode is the practical default for iteration.
  • Full Gemma followed prompt context more closely in listening tests, but was much slower.
  • Russian stress/pronunciation errors appeared in both modes, so they were not solved by replacing NF4 Gemma with full Gemma.

What To Use Where

Use this Hugging Face repository for:

  • model files
  • model card
  • links to the low-VRAM runtime

Use the GitHub repository for:

  • Docker runtime
  • Windows launch scripts
  • Gradio UI changes
  • API changes
  • tests and implementation code
  • optional bf16 audio transformer runtime mode

GitHub runtime:

https://github.com/Turbalo/scenema-audio-low-vram

Credits

This low-VRAM runtime builds on:

License

The model weights remain under the LTX-2 Community License Agreement. Scenema Audio's audio diffusion transformer is derived from LTX 2.3's audiovisual model, and its weights are subject to the same terms.

Runtime code changes are MIT-licensed where applicable, following the upstream repository.

Model weights, Gemma, SeedVC, BigVGAN, Whisper, Kokoro, and related components remain governed by their respective licenses and terms.

Security Notes

The PowerShell files are plain-text launch helpers for Docker Compose. They are not obfuscated and do not download or execute hidden binaries outside the Docker runtime.

VirusTotal API scan results for the tracked PowerShell scripts on 2026-05-17:

Script SHA256 VirusTotal result
Generate-TestAudio.ps1 507f59f53fb58ebf40491ca8328fd1b401c8953e707eee1d96b7d252f48d17fc 0 malicious / 0 suspicious (report)
Install-And-Run.ps1 1b54bea310b340fe02c4bc389c001746ed4cf124d4527e8c4b49a7fa932080a8 0 malicious / 0 suspicious (report)
Start-ScenemaAudio-BF16Audio.ps1 09e281e2ef88e72da7092a38162bcb803474d7717b759593545583368e1076d6 0 malicious / 0 suspicious (report)
Start-ScenemaAudio-FullGemma.ps1 cc508bf96d46d0f7ac9bbaa94d58f5f6e048fb0a8b2c887372b444e1a0f27c2f 0 malicious / 0 suspicious (report)
Start-ScenemaAudio.ps1 6345bc98078296e030f12d48e1ca484df2a1efcb8c1874ff85e80b000dd121ab 0 malicious / 0 suspicious (report)
Stop-ScenemaAudio-BF16Audio.ps1 998919de5a75c5602cad38899b927a291378c9b6978b56240aa2f157160ebc47 0 malicious / 0 suspicious (report)
Stop-ScenemaAudio-FullGemma.ps1 412e2587db5b2e93b02b31fe60659a2d9f12d9bdbadc9b855f2f34eaaac8f6b5 0 malicious / 0 suspicious (report)

VirusTotal results can change if files are re-analyzed later. Inspect the scripts before running them if your local policy requires it.

Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support