| --- |
| language: en |
| license: apache-2.0 |
| base_model: black-forest-labs/FLUX.2-klein-base-4B |
| library_name: diffusers |
| tags: |
| - text-to-spectrogram |
| - audio-synthesis |
| - lora |
| - flux2 |
| - arxiv:2604.20329 |
| pipeline_tag: text-to-image |
| --- |
| |
| # sonic-plantain |
|
|
| A LoRA adapter on FLUX.2 Klein (4B) that generates magnitude-spectrogram visualizations of English speech from text prompts. Reframes audio synthesis as image generation: the prompt describes the speech to be uttered, the model produces an RGB-encoded spectrogram, and an inverse bijection recovers the magnitude. Phase recovery via Griffin-Lim returns audible audio. |
|
|
| This adapter tests one claim from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) β that the recipe (instruction-tune a strong image generator on a small mixture of task-specific data with an invertible RGB encoding) extends past traditional computer-vision tasks to audio. |
|
|
| ## Method |
|
|
| 1. **Reframe text-to-speech as text-to-image.** The training target for each transcript is its magnitude spectrogram, encoded as an RGB image. At inference time, the prompt describes the desired speech and the model emits a spectrogram that decodes to audio. |
| 2. **Bijective magnitudeβRGB encoding.** Linear-amplitude STFT magnitude is converted to dB and clipped to [β80, 0] dB, normalized to a curve parameter `u β [0, 1]`, then piecewise-linearly interpolated along a 7-segment Hamiltonian path through the corners of the RGB cube (black β blue β cyan β green β yellow β red β magenta β white). The inverse projects predicted RGB onto the nearest cube edge. |
| 3. **Audio params.** 16 kHz sample rate, n_fft = 1024, hop = 256, 5-second clips. STFT magnitude (513 frequency bins Γ 313 time frames) is placed top-left in a 768 Γ 768 canvas; the rest is silence-padded. |
| |
| Training data: LibriSpeech `train.clean.100` (read English speech), ~28,000 clips with transcripts. |
| |
| ## Status |
| |
| Training in progress. Weights will be added when complete. |
| |
| ## Training |
| |
| | | | |
| |---|---| |
| | Base | `black-forest-labs/FLUX.2-klein-base-4B` | |
| | Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder | |
| | Resolution | 768 Γ 768 | |
| | Batch size | 4 | |
| | Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup | |
| | Max steps | 15 000 | |
| | Mixed precision | bf16 | |
| | Training data | LibriSpeech `train.clean.100`, ~28 k transcribed clips | |
| | Audio params | 16 kHz, n_fft 1024, hop 256, 5-second clips | |
| | Spectrogram encoding | Linear magnitude β dB clipped [β80, 0] β Hilbert RGB-cube path | |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from diffusers import Flux2KleinPipeline |
| |
| pipe = Flux2KleinPipeline.from_pretrained( |
| "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16, |
| ).to("cuda") |
| pipe.load_lora_weights("phanerozoic/sonic-plantain") |
| |
| prompt = ( |
| 'Generate a magnitude spectrogram of speech reading: "hello world". ' |
| "Time on horizontal axis, frequency on vertical, energy encoded in RGB along " |
| "a Hilbert path through the color cube: black is silence, blue/cyan is low " |
| "energy, green/yellow is moderate, red/magenta is high, white is full-scale." |
| ) |
| img = pipe( |
| prompt=prompt, height=768, width=768, |
| guidance_scale=4.0, num_inference_steps=20, |
| ).images[0] |
| ``` |
|
|
| The decoder (RGB β magnitude β Griffin-Lim β audio) is in `decode_spectrogram.py`. |
|
|
| ## License |
|
|
| The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B. |
|
|
| ### Training data attribution |
|
|
| - **LibriSpeech** (Panayotov et al., 2015). The `train.clean.100` split of LibriSpeech ASR corpus is the sole training-data source. LibriSpeech is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The corpus is derived from public-domain audiobook recordings on LibriVox. See http://www.openslr.org/12/ for the original distribution. |
|
|
| Downstream users of this adapter who redistribute reconstructed audio derived from training-data spectrograms should preserve LibriSpeech's CC BY 4.0 attribution requirement. |
|
|
| ### Base model |
|
|
| Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card. |
|
|
| ## References |
|
|
| - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). |
| - Panayotov, Chen, Povey, Khudanpur. *LibriSpeech: an ASR corpus based on public domain audio books.* ICASSP 2015. |
| - Griffin, Lim. *Signal estimation from modified short-time Fourier transform.* IEEE TASSP 1984. |
|
|