| --- |
| language: en |
| license: apache-2.0 |
| base_model: black-forest-labs/FLUX.2-klein-base-4B |
| library_name: diffusers |
| tags: |
| - audio-synthesis |
| - room-impulse-response |
| - acoustics |
| - lora |
| - flux2 |
| - arxiv:2604.20329 |
| pipeline_tag: image-to-image |
| --- |
| |
| # echo-plantain |
|
|
| A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution. |
|
|
| This adapter tests whether the recipe from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system. |
|
|
| ## Method |
|
|
| 1. **Reframe room acoustics as image-to-image.** Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram. |
| 2. **Bijective magnitude↔RGB encoding.** Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve `u ∈ [0, 1]` → 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail. |
| 3. **Audio params.** 16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding. |
| |
| Training data: 10,000 randomly-generated rectangular rooms via [pyroomacoustics](https://github.com/LCAV/pyroomacoustics). Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6. |
| |
| ## Status |
| |
| Training in progress. Weights will be added when complete. |
| |
| ## Training |
| |
| | | | |
| |---|---| |
| | Base | `black-forest-labs/FLUX.2-klein-base-4B` | |
| | Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder | |
| | Resolution | 768 × 768 | |
| | Batch size | 4 | |
| | Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup | |
| | Max steps | 15 000 | |
| | Mixed precision | bf16 | |
| | Training data | 10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6) | |
| | Audio params | 16 kHz, n_fft 1024, hop 256, 1-second RIR clips | |
| | Spectrogram encoding | Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path | |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from diffusers import Flux2KleinPipeline |
| |
| pipe = Flux2KleinPipeline.from_pretrained( |
| "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16, |
| ).to("cuda") |
| pipe.load_lora_weights("phanerozoic/echo-plantain") |
| |
| # A top-down schematic of the target room (see `render_schematic.py` for the |
| # renderer convention: walls as outline, source as red ⊕, listener as blue ⊙, |
| # floor brightness encoding absorption). |
| schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768)) |
| |
| prompt = ( |
| "Generate a room impulse response spectrogram for the depicted space. " |
| "Time on horizontal axis (early reflections at left, late reverb tail extending right), " |
| "frequency on vertical axis. Energy encoded in RGB along a Hilbert path through " |
| "the color cube: black is below noise floor, blue/cyan is faint reflections, " |
| "green/yellow is strong reflections, red/magenta is direct-arrival energy." |
| ) |
| img = pipe( |
| image=schematic, prompt=prompt, height=768, width=768, |
| guidance_scale=4.0, num_inference_steps=20, |
| ).images[0] |
| ``` |
|
|
| The decoder (RGB → magnitude → mono RIR) is in `decode_rir.py`. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb. |
|
|
| ## License |
|
|
| The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B. |
|
|
| ### Training data attribution |
|
|
| The training data is fully synthetic, generated at preparation time from random rectangular room geometries via the [pyroomacoustics](https://github.com/LCAV/pyroomacoustics) Python library (Scheibler, Bezzam, Dokmanić, 2018). Pyroomacoustics is distributed under the MIT License. No external dataset is required to reproduce the training corpus; the dataset-generation script is included in this repository. |
|
|
| ### Base model |
|
|
| Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card. |
|
|
| ## References |
|
|
| - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). |
| - Scheibler, Bezzam, Dokmanić. *Pyroomacoustics: A Python package for audio room simulation and array processing algorithms.* ICASSP 2018. |
| - Allen, Berkley. *Image method for efficiently simulating small-room acoustics.* JASA 1979. |
|
|