Scenema Audio Low VRAM Mode
This repository mirrors the official Scenema Audio model files and includes a code mirror of the low-VRAM runtime fork for Windows/WSL Docker systems. Canonical development remains on GitHub, while this Hugging Face repo keeps the all-in-one model/code snapshot.
Runtime fork:
GitHub: Turbalo/scenema-audio-low-vram
Upstream model:
Hugging Face: ScenemaAI/scenema-audio
This fork is a rebuild of the official Scenema Audio runtime focused on running the pipeline on 16 GB VRAM hardware. It does not create a new model architecture and does not change the original model weights.
What This Fork Changes
- Uses the official Scenema Audio INT8 transformer as the default audio model.
- Uses unsloth/gemma-3-12b-it-bnb-4bit as the default low-VRAM Gemma text encoder.
- Encodes all text chunks first, then unloads Gemma before audio diffusion.
- Keeps model files in Docker/WSL volumes instead of slow Windows bind mounts.
- Adds a Windows PowerShell one-command launcher.
- Adds Gradio progress percentage, current stage, elapsed time, and a Stop button.
- Adds
GET /progress/{job_id}andPOST /cancel/{job_id}API endpoints. - Saves successful generations as WAV plus JSON metadata.
- Releases idle model memory after each request.
- Trims Python/CUDA allocator memory after requests to reduce RSS growth across repeated generations.
- Adds optional Scenema audio offload before Gemma, Whisper, and SeedVC phases; bf16 mode enables it automatically.
The default low-VRAM runtime does not download full google/gemma-3-12b-it. Full Gemma comparison is available as an optional mode in the GitHub repo.
The default runtime uses the INT8 Scenema audio transformer. The official bf16 Scenema audio transformer can be tested separately as an optional A/B mode.
Quick Start
Recommended lightweight clone from GitHub:
git clone https://github.com/Turbalo/scenema-audio-low-vram.git
cd scenema-audio-low-vram
.\Install-And-Run.ps1
All-in-one Hugging Face clone with model files in the same repository:
git clone https://huggingface.co/CENSORED666/scenema-audio-low-vram-mode
cd scenema-audio-low-vram-mode
.\Install-And-Run.ps1
The Docker runtime still caches active model files into Docker volumes on first run. The GitHub clone is smaller; the Hugging Face clone is useful when you want the mirrored code and original model files together.
Then open:
http://127.0.0.1:8000/ui/
Requirements:
- Windows 11
- WSL2
- Docker Desktop with NVIDIA GPU support
- NVIDIA GPU with 16 GB VRAM tested
- Hugging Face token for upstream model downloads
- NVMe SSD strongly recommended for Docker/WSL storage
More detailed instructions:
How The Low-VRAM Mode Works
The 16 GB VRAM runtime keeps only the model needed for the current phase active:
- Plan chunks.
- Load NF4 Gemma.
- Encode all chunk prompts into conditioning tensors.
- Unload Gemma.
- Run Scenema audio diffusion chunk by chunk.
- Run optional validation, vocal cleanup, and SeedVC.
- Save output and offload idle models.
Model weights normally move through the machine like this:
disk -> RAM -> VRAM
Because the runtime unloads Gemma after text encoding, later requests may need to reload it. NVMe storage for Docker/WSL is strongly recommended. HDDs or slow SATA SSDs can make reloads feel like the app is stuck.
Default runtime environment:
AUDIO_CKPT=/app/models/scenema-audio-transformer-int8.safetensors
GEMMA_REPO=unsloth/gemma-3-12b-it-bnb-4bit
GEMMA_ROOT=/app/models/gemma-3-12b-it-bnb-4bit
GEMMA_QUANTIZE=nf4
GEMMA_OFFLOAD=unload
OFFLOAD_AUDIO_BEFORE_AUX=0
Optional bf16 audio mode keeps Gemma in the same NF4 unload setup, but switches the Scenema audio checkpoint. It also sets OFFLOAD_AUDIO_BEFORE_AUX=1 and 64 GB system RAM is recommended:
AUDIO_CKPT=/app/models/scenema-audio-transformer.safetensors
Model Files In This Repository
This Hugging Face repo keeps the Scenema Audio model assets:
| File | Description |
|---|---|
scenema-audio-transformer.safetensors |
Audio diffusion transformer, bf16 |
scenema-audio-transformer-int8.safetensors |
Audio diffusion transformer, INT8 |
scenema-audio-pipeline.safetensors |
Audio VAE decoder, vocoder, text projection |
scenema-audio-vae-encoder.safetensors |
Audio VAE encoder for reference conditioning |
sageattention-2.2.0-cp312-cp312-linux_x86_64.whl |
Optional SageAttention wheel used by the runtime image |
config.json |
Basic model metadata |
The GitHub runtime downloads the needed files into Docker volumes on first start.
Benchmarks
Test system: RTX 4080 SUPER 16 GB, Windows/WSL Docker, seed 120, background_sfx=true, validate=false.
| Pair | Mode | Audio duration | Processing time | Peak VRAM |
|---|---|---|---|---|
| Long prompt | NF4 low-VRAM | 50.34s | 108.36s | 14,488 MB |
| Long prompt | Optional full Gemma | 46.24s | 414.21s | 13,220 MB |
| Short prompt | NF4 low-VRAM | 12.20s | 19.72s | 14,488 MB |
| Short prompt | Optional full Gemma | 10.66s | 194.98s | 8,428 MB |
Observed result:
- NF4 low-VRAM mode is the practical default for iteration.
- Full Gemma followed prompt context more closely in listening tests, but was much slower.
- Russian stress/pronunciation errors appeared in both modes, so they were not solved by replacing NF4 Gemma with full Gemma.
What To Use Where
Use this Hugging Face repository for:
- model files
- model card
- links to the low-VRAM runtime
Use the GitHub repository for:
- Docker runtime
- Windows launch scripts
- Gradio UI changes
- API changes
- tests and implementation code
- optional bf16 audio transformer runtime mode
GitHub runtime:
https://github.com/Turbalo/scenema-audio-low-vram
Credits
This low-VRAM runtime builds on:
- ScenemaAI/scenema-audio - original Scenema Audio model and pipeline.
- ScenemaAI - project page and demos.
- Lightricks/LTX-2 - LTX audiovisual components used by Scenema Audio.
- Google Gemma - Gemma 3 12B instruction model family.
- unsloth/gemma-3-12b-it-bnb-4bit - default NF4 Gemma checkpoint for low-VRAM mode.
- Plachtaa/seed-vc - SeedVC voice conversion.
- Kijai/ComfyUI-MelBandRoFormer and MelBandRoFormer weights - vocal/background separation.
- NVIDIA BigVGAN - vocoder used by SeedVC.
- OpenAI Whisper and faster-whisper - optional speech validation.
- Kokoro - duration estimation for chunk planning.
License
The model weights remain under the LTX-2 Community License Agreement. Scenema Audio's audio diffusion transformer is derived from LTX 2.3's audiovisual model, and its weights are subject to the same terms.
Runtime code changes are MIT-licensed where applicable, following the upstream repository.
Model weights, Gemma, SeedVC, BigVGAN, Whisper, Kokoro, and related components remain governed by their respective licenses and terms.
Security Notes
The PowerShell files are plain-text launch helpers for Docker Compose. They are not obfuscated and do not download or execute hidden binaries outside the Docker runtime.
VirusTotal API scan results for the tracked PowerShell scripts on 2026-05-17:
| Script | SHA256 | VirusTotal result |
|---|---|---|
Generate-TestAudio.ps1 |
507f59f53fb58ebf40491ca8328fd1b401c8953e707eee1d96b7d252f48d17fc |
0 malicious / 0 suspicious (report) |
Install-And-Run.ps1 |
1b54bea310b340fe02c4bc389c001746ed4cf124d4527e8c4b49a7fa932080a8 |
0 malicious / 0 suspicious (report) |
Start-ScenemaAudio-BF16Audio.ps1 |
09e281e2ef88e72da7092a38162bcb803474d7717b759593545583368e1076d6 |
0 malicious / 0 suspicious (report) |
Start-ScenemaAudio-FullGemma.ps1 |
cc508bf96d46d0f7ac9bbaa94d58f5f6e048fb0a8b2c887372b444e1a0f27c2f |
0 malicious / 0 suspicious (report) |
Start-ScenemaAudio.ps1 |
6345bc98078296e030f12d48e1ca484df2a1efcb8c1874ff85e80b000dd121ab |
0 malicious / 0 suspicious (report) |
Stop-ScenemaAudio-BF16Audio.ps1 |
998919de5a75c5602cad38899b927a291378c9b6978b56240aa2f157160ebc47 |
0 malicious / 0 suspicious (report) |
Stop-ScenemaAudio-FullGemma.ps1 |
412e2587db5b2e93b02b31fe60659a2d9f12d9bdbadc9b855f2f34eaaac8f6b5 |
0 malicious / 0 suspicious (report) |
VirusTotal results can change if files are re-analyzed later. Inspect the scripts before running them if your local policy requires it.
- Downloads last month
- 61
