File size: 4,716 Bytes
51df902
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71114d0
 
 
 
 
 
 
 
 
 
 
51df902
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
language: en
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-base-4B
library_name: diffusers
tags:
  - text-to-spectrogram
  - audio-synthesis
  - lora
  - flux2
  - arxiv:2604.20329
pipeline_tag: text-to-image
---

# sonic-plantain

A LoRA adapter on FLUX.2 Klein (4B) that generates magnitude-spectrogram visualizations of English speech from text prompts. Reframes audio synthesis as image generation: the prompt describes the speech to be uttered, the model produces an RGB-encoded spectrogram, and an inverse bijection recovers the magnitude. Phase recovery via Griffin-Lim returns audible audio.

This adapter tests one claim from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) — that the recipe (instruction-tune a strong image generator on a small mixture of task-specific data with an invertible RGB encoding) extends past traditional computer-vision tasks to audio.

## Method

1. **Reframe text-to-speech as text-to-image.** The training target for each transcript is its magnitude spectrogram, encoded as an RGB image. At inference time, the prompt describes the desired speech and the model emits a spectrogram that decodes to audio.
2. **Bijective magnitude↔RGB encoding.** Linear-amplitude STFT magnitude is converted to dB and clipped to [−80, 0] dB, normalized to a curve parameter `u ∈ [0, 1]`, then piecewise-linearly interpolated along a 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The inverse projects predicted RGB onto the nearest cube edge.
3. **Audio params.** 16 kHz sample rate, n_fft = 1024, hop = 256, 5-second clips. STFT magnitude (513 frequency bins × 313 time frames) is placed top-left in a 768 × 768 canvas; the rest is silence-padded.

Training data: LibriSpeech `train.clean.100` (read English speech), ~28,000 clips with transcripts.

## Status

Training in progress. Weights will be added when complete.

## Training

| | |
|---|---|
| Base | `black-forest-labs/FLUX.2-klein-base-4B` |
| Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder |
| Resolution | 768 × 768 |
| Batch size | 4 |
| Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup |
| Max steps | 15 000 |
| Mixed precision | bf16 |
| Training data | LibriSpeech `train.clean.100`, ~28 k transcribed clips |
| Audio params | 16 kHz, n_fft 1024, hop 256, 5-second clips |
| Spectrogram encoding | Linear magnitude → dB clipped [−80, 0] → Hilbert RGB-cube path |

## Usage

```python
import torch
from diffusers import Flux2KleinPipeline

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/sonic-plantain")

prompt = (
    'Generate a magnitude spectrogram of speech reading: "hello world". '
    "Time on horizontal axis, frequency on vertical, energy encoded in RGB along "
    "a Hilbert path through the color cube: black is silence, blue/cyan is low "
    "energy, green/yellow is moderate, red/magenta is high, white is full-scale."
)
img = pipe(
    prompt=prompt, height=768, width=768,
    guidance_scale=4.0, num_inference_steps=20,
).images[0]
```

The decoder (RGB → magnitude → Griffin-Lim → audio) is in `decode_spectrogram.py`.

## License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

### Training data attribution

- **LibriSpeech** (Panayotov et al., 2015). The `train.clean.100` split of LibriSpeech ASR corpus is the sole training-data source. LibriSpeech is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The corpus is derived from public-domain audiobook recordings on LibriVox. See http://www.openslr.org/12/ for the original distribution.

Downstream users of this adapter who redistribute reconstructed audio derived from training-data spectrograms should preserve LibriSpeech's CC BY 4.0 attribution requirement.

### Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.

## References

- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
- Panayotov, Chen, Povey, Khudanpur. *LibriSpeech: an ASR corpus based on public domain audio books.* ICASSP 2015.
- Griffin, Lim. *Signal estimation from modified short-time Fourier transform.* IEEE TASSP 1984.