language: en
tags:
- audio-generation
- music-generation
- diffusion
- lora
- dora
- stable-audio
- text-to-audio
- structural-acoustic-modeling
- belief-space
- eisbach
license: other
library_name: stable-audio
base_model: stabilityai/stable-audio-3-medium
datasets:
- MusicCaps
pipeline_tag: text-to-audio
widget:
- text: >-
Little Piglet Prince chamber fairy-tale portrait, celesta, music box, toy
piano, pizzicato violins, warm cello, a small prince walking through a
moonlit castle, gentle royal lullaby, peaceful ending, 72 BPM
- text: >-
Raccoon Mathematician chamber portrait, pizzicato strings, marimba,
vibraphone, precise and playful melody with asymmetrical time signatures,
elegant logic turning into music, 80 BPM
- text: >-
Professor Pallas Cat chamber portrait, bassoon, contrabassoon, low
clarinet, viola, cello, grumpy but wise, slow thoughtful melody with
sudden flashes of dry wit, scholarly atmosphere, 66 BPM
- text: >-
Seal Lawyer chamber portrait, cello, double bass, bassoon, french horn,
harp, marimba, smooth gliding phrases like ocean currents, harp
objections, marimba gavel taps, peaceful ending, 72 BPM
Eisbach-3B
A belief-space fine-tuned music generation model with structural acoustic modeling capabilities.
Eisbach-3B is a fine-tuned variant of Stable Audio 3 Medium (2.3B-parameter Diffusion Transformer), trained with the Eisbach log-barrier β a self-referential confidence-gating mechanism that reshapes the training dynamics to favor temporally structured, acoustically differentiated outputs. The model specializes in long-form instrumental music with clear structural development, distinct instrumental separation, and narrative coherence.
Core Innovation: Structural Acoustic Modeling
Standard diffusion fine-tuning treats all training samples equally, which tends to produce outputs that regress toward the mean β safe, repetitive textures with limited development. Eisbach-3B introduces a belief-space constraint during training:
- At each denoising step, the DiT output is converted into a temporal energy distribution via softmax over the time axis.
- The entropy of this distribution measures how "structured" the model believes its own prediction to be β low entropy means sharp temporal energy contrast (clear onsets, distinct instrumental events); high entropy means diffuse, uniform energy (muddy textures).
- A log-barrier penalty converts entropy into a per-sample weight: confident, structured predictions receive full gradient; uncertain, flat predictions are damped.
This creates an implicit structural selection pressure β over thousands of training steps, only samples that elicit structured outputs from the model effectively contribute to parameter updates. The model learns not just to predict noise correctly, but to do so with temporal energy structures that the model itself recognizes as well-formed.
The result is a model that exhibits, without any explicit architectural priors for music structure:
- Multi-part form development (intro β development β climax β resolution)
- Four-voice harmonic organization emerging from the training dynamics
- Instrumental separation in the frequency domain (each instrument occupies distinct spectral regions)
- Textural variation across time (not just timbre-swapping, but genuine metamorphosis of musical material)
Training Details
| Field | Value |
|---|---|
| Base model | Stable Audio 3 Medium (2.3B DiT, 44.1kHz stereo) |
| Adapter | DoRA-rows (Weight-Decomposed Low-Rank Adaptation) |
| Rank / Alpha | 16 / 16 |
| Training steps | 1,000 (convergence sweet spot) |
| Batch size | 4 |
| Learning rate | 5e-5 (AdamW, Ξ²=[0.9, 0.95]) |
| Dataset | MusicCaps (~5,500 clips, diverse genres and instrumentations) |
| Eisbach barrier Ξ» | 0.5 |
| Timestep sampler | Truncated logit-normal |
| Objective | v-prediction with OT coupling |
The DoRA decomposition (magnitude Γ direction) synergizes with the barrier: the direction component learns what structure to produce, while the magnitude component learns how strongly to articulate it. Pure LoRA cannot disentangle these β the barrier's selection pressure acts on both simultaneously.
Usage
from stable_audio_3 import StableAudioModel
model = StableAudioModel.from_pretrained("./Eisbach-3B")
audio = model.generate(
prompt=(
"House music that encapsulates the feeling of being at a festival "
"in the sunny weather with all your friends 124 BPM"
),
duration=180
)
Recommended sampling parameters
| Parameter | Recommendation |
|---|---|
| Steps | 8 (fast) to 50 (refined) |
| CFG scale | 2.5β4.0 |
| Duration | 30sβ180s (model trained on 60s crops, generalizes to longer) |
| Sampler | pingpong (recommended for musical structure) |
| Prompt style | Descriptive, literary, specify instrumentation + tempo + emotional arc |
Prompt design tips
Eisbach-3B responds best to scene-based, literary prompts rather than technical tag lists. The model has learned to associate narrative descriptions with structural musical form.
Effective example:
"A small prince walking carefully through a moonlit castle made of books and pillows, celesta and pizzicato violins, gentle royal lullaby, innocent and round, slightly clumsy but noble, clear motif development, peaceful ending, 72 BPM."
This works better than: "celesta, pizzicato strings, lullaby, 72bpm, classical" β because the scene description cues the model to organize temporal energy into a coherent narrative arc.
Audio Samples (1000-step checkpoint)
All samples are 2-minute stereo WAV files at 44.1kHz, generated with 8 diffusion steps and CFG=3.0. Each prompt was designed to test a different axis of structural complexity β harmonic organization, metric irregularity, registral contrast, and textural dialogue.
1. Little Piglet Prince
βΆ Listen to Little Piglet Prince
Little Piglet Prince chamber fairy-tale portrait, two-minute instrumental music sketch, celesta, music box, toy piano, pizzicato violins, warm cello, soft clarinet, muted bassoon, light harp, tiny wooden percussion, a small prince walking carefully through a moonlit castle made of books and pillows, golden candlelight, velvet curtains, gentle royal lullaby melody, innocent and round, slightly clumsy but noble, tender storybook magic, intimate chamber music, warm low strings, delicate reverb, clear motif development, peaceful ending, 72 BPM, no vocals.
Tests: harmonic clarity (triadic lullaby), registral balance (celesta/cello spanning 5 octaves), "clumsy but noble" rubato.
Notable behavior: Four-voice harmonic texture emerges β bass (cello/bassoon), tenor (clarinet), alto (pizzicato violins), soprano (celesta/music box). Clear ABA' form with modified recapitulation.
2. Raccoon Mathematician
βΆ Listen to Raccoon Mathematician
Raccoon Mathematician chamber character portrait, two-minute instrumental music sketch, pizzicato strings, marimba, vibraphone, celesta, soft bassoon, light woodwinds, a clever raccoon solving equations on a chalkboard made of stars, chalk dust floating like galaxies, precise and playful melody with asymmetrical time signatures, patterns that fold and unfold, curious counterpoint, elegant logic turning into music, warm cello punctuations, delicate reverb, clear structural development from problem to solution, peaceful resolution, 80 BPM, no vocals.
Tests: asymmetric meter, contrapuntal independence, rhythmic precision under metric ambiguity.
Notable behavior: Melodic fragments treated as "variables" β stated, transformed (inversion, augmentation), and combined in the final section like a mathematical proof reaching QED. The "patterns that fold and unfold" directive produces audible fractal-like recursion in the marimba line.
3. Professor Pallas Cat
βΆ Listen to Professor Pallas Cat
Professor Pallas Cat chamber character portrait, two-minute instrumental music sketch, bassoon, contrabassoon, low clarinet, viola, cello, timpani, deep marimba, a grumpy but wise Pallas cat lecturing in a dusty university hall filled with ancient books, round and flat-faced dignity, slow thoughtful melody with sudden flashes of dry wit, pedantic counterpoint, low register warmth, occasional surprised grace notes, delicate reverb, scholarly atmosphere, clear lecture structure, peaceful ending, 66 BPM, no vocals.
Tests: low-register clarity (all instruments below viola range), sudden registral/timbral disruptions ("dry wit" grace notes), dense counterpoint without muddiness.
Notable behavior: The barrier's frequency-domain differentiation is most audible here β bassoon, contrabassoon, bass clarinet, and cello remain individually distinguishable despite all operating below ~600Hz. The "surprised grace notes" emerge as abrupt registral leaps (bassoon jumping two octaves for a single staccato) that cut through the texture without destabilizing it.
4. Seal Lawyer
Seal Lawyer chamber character portrait, two-minute instrumental music sketch, cello, double bass, bassoon, french horn, harp, marimba, soft percussion, a rotund seal in a tiny waistcoat arguing a case in an underwater courtroom, smooth gliding phrases like ocean currents, dignified but slightly comical, warm low strings carrying the argument, harp objections, marimba gavel taps, delicate reverb, clear rhetorical structure with an elegant closing statement, peaceful ending, 72 BPM, no vocals.
Tests: instrumental dialogue (strings = argument, harp = objection, marimba = gavel), rhetorical structure (opening statement β argument β rebuttal β closing), comedic timing in instrumental music.
Notable behavior: The "harp objections" manifest as glissando interruptions that increase in frequency during the middle section (the "argument" phase), then resolve into a single extended harp arpeggio in the "closing statement." The marimba "gavel" punctuates structural boundaries with a consistent rhythmic motif. The model has learned to use timbral contrast as a proxy for dramatic conflict.
Limitations
- Not optimized for vocals or lyrics. The training focused on instrumental music; vocal synthesis quality is not tested.
Citation
If you use Eisbach-3B in your research, please cite:
@misc{eisbach3b2026,
title={Eisbach-3B: Belief-Space Fine-Tuning for Structural Acoustic Modeling},
author={Reasoning Kingdom},
year={2026},
note={Fine-tuned variant of Stable Audio 3 Medium with Eisbach log-barrier training},
}
License
Derived from Stable Audio 3 Medium. Subject to Stability AI's model license terms.
Eisbach project β May 2026