sonic-plantain / README.md
phanerozoic's picture
Expand license + dataset attribution
71114d0 verified
---
language: en
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-base-4B
library_name: diffusers
tags:
- text-to-spectrogram
- audio-synthesis
- lora
- flux2
- arxiv:2604.20329
pipeline_tag: text-to-image
---
# sonic-plantain
A LoRA adapter on FLUX.2 Klein (4B) that generates magnitude-spectrogram visualizations of English speech from text prompts. Reframes audio synthesis as image generation: the prompt describes the speech to be uttered, the model produces an RGB-encoded spectrogram, and an inverse bijection recovers the magnitude. Phase recovery via Griffin-Lim returns audible audio.
This adapter tests one claim from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) β€” that the recipe (instruction-tune a strong image generator on a small mixture of task-specific data with an invertible RGB encoding) extends past traditional computer-vision tasks to audio.
## Method
1. **Reframe text-to-speech as text-to-image.** The training target for each transcript is its magnitude spectrogram, encoded as an RGB image. At inference time, the prompt describes the desired speech and the model emits a spectrogram that decodes to audio.
2. **Bijective magnitude↔RGB encoding.** Linear-amplitude STFT magnitude is converted to dB and clipped to [βˆ’80, 0] dB, normalized to a curve parameter `u ∈ [0, 1]`, then piecewise-linearly interpolated along a 7-segment Hamiltonian path through the corners of the RGB cube (black β†’ blue β†’ cyan β†’ green β†’ yellow β†’ red β†’ magenta β†’ white). The inverse projects predicted RGB onto the nearest cube edge.
3. **Audio params.** 16 kHz sample rate, n_fft = 1024, hop = 256, 5-second clips. STFT magnitude (513 frequency bins Γ— 313 time frames) is placed top-left in a 768 Γ— 768 canvas; the rest is silence-padded.
Training data: LibriSpeech `train.clean.100` (read English speech), ~28,000 clips with transcripts.
## Status
Training in progress. Weights will be added when complete.
## Training
| | |
|---|---|
| Base | `black-forest-labs/FLUX.2-klein-base-4B` |
| Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder |
| Resolution | 768 Γ— 768 |
| Batch size | 4 |
| Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup |
| Max steps | 15 000 |
| Mixed precision | bf16 |
| Training data | LibriSpeech `train.clean.100`, ~28 k transcribed clips |
| Audio params | 16 kHz, n_fft 1024, hop 256, 5-second clips |
| Spectrogram encoding | Linear magnitude β†’ dB clipped [βˆ’80, 0] β†’ Hilbert RGB-cube path |
## Usage
```python
import torch
from diffusers import Flux2KleinPipeline
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/sonic-plantain")
prompt = (
'Generate a magnitude spectrogram of speech reading: "hello world". '
"Time on horizontal axis, frequency on vertical, energy encoded in RGB along "
"a Hilbert path through the color cube: black is silence, blue/cyan is low "
"energy, green/yellow is moderate, red/magenta is high, white is full-scale."
)
img = pipe(
prompt=prompt, height=768, width=768,
guidance_scale=4.0, num_inference_steps=20,
).images[0]
```
The decoder (RGB β†’ magnitude β†’ Griffin-Lim β†’ audio) is in `decode_spectrogram.py`.
## License
The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.
### Training data attribution
- **LibriSpeech** (Panayotov et al., 2015). The `train.clean.100` split of LibriSpeech ASR corpus is the sole training-data source. LibriSpeech is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The corpus is derived from public-domain audiobook recordings on LibriVox. See http://www.openslr.org/12/ for the original distribution.
Downstream users of this adapter who redistribute reconstructed audio derived from training-data spectrograms should preserve LibriSpeech's CC BY 4.0 attribution requirement.
### Base model
Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.
## References
- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
- Panayotov, Chen, Povey, Khudanpur. *LibriSpeech: an ASR corpus based on public domain audio books.* ICASSP 2015.
- Griffin, Lim. *Signal estimation from modified short-time Fourier transform.* IEEE TASSP 1984.