Add README
Browse files
README.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: black-forest-labs/FLUX.2-klein-base-4B
|
| 5 |
+
library_name: diffusers
|
| 6 |
+
tags:
|
| 7 |
+
- audio-synthesis
|
| 8 |
+
- room-impulse-response
|
| 9 |
+
- acoustics
|
| 10 |
+
- lora
|
| 11 |
+
- flux2
|
| 12 |
+
- arxiv:2604.20329
|
| 13 |
+
pipeline_tag: image-to-image
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# echo-plantain
|
| 17 |
+
|
| 18 |
+
A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution.
|
| 19 |
+
|
| 20 |
+
This adapter tests whether the recipe from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system.
|
| 21 |
+
|
| 22 |
+
## Method
|
| 23 |
+
|
| 24 |
+
1. **Reframe room acoustics as image-to-image.** Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram.
|
| 25 |
+
2. **Bijective magnitude↔RGB encoding.** Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve `u ∈ [0, 1]` → 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail.
|
| 26 |
+
3. **Audio params.** 16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding.
|
| 27 |
+
|
| 28 |
+
Training data: 10,000 randomly-generated rectangular rooms via [pyroomacoustics](https://github.com/LCAV/pyroomacoustics). Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6.
|
| 29 |
+
|
| 30 |
+
## Status
|
| 31 |
+
|
| 32 |
+
Training in progress. Weights will be added when complete.
|
| 33 |
+
|
| 34 |
+
## Training
|
| 35 |
+
|
| 36 |
+
| | |
|
| 37 |
+
|---|---|
|
| 38 |
+
| Base | `black-forest-labs/FLUX.2-klein-base-4B` |
|
| 39 |
+
| Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder |
|
| 40 |
+
| Resolution | 768 × 768 |
|
| 41 |
+
| Batch size | 4 |
|
| 42 |
+
| Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup |
|
| 43 |
+
| Max steps | 15 000 |
|
| 44 |
+
| Mixed precision | bf16 |
|
| 45 |
+
| Training data | 10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6) |
|
| 46 |
+
| Audio params | 16 kHz, n_fft 1024, hop 256, 1-second RIR clips |
|
| 47 |
+
| Spectrogram encoding | Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path |
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
import torch
|
| 53 |
+
from PIL import Image
|
| 54 |
+
from diffusers import Flux2KleinPipeline
|
| 55 |
+
|
| 56 |
+
pipe = Flux2KleinPipeline.from_pretrained(
|
| 57 |
+
"black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
|
| 58 |
+
).to("cuda")
|
| 59 |
+
pipe.load_lora_weights("phanerozoic/echo-plantain")
|
| 60 |
+
|
| 61 |
+
# A top-down schematic of the target room (see `render_schematic.py` for the
|
| 62 |
+
# renderer convention: walls as outline, source as red ⊕, listener as blue ⊙,
|
| 63 |
+
# floor brightness encoding absorption).
|
| 64 |
+
schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768))
|
| 65 |
+
|
| 66 |
+
prompt = (
|
| 67 |
+
"Generate a room impulse response spectrogram for the depicted space. "
|
| 68 |
+
"Time on horizontal axis (early reflections at left, late reverb tail extending right), "
|
| 69 |
+
"frequency on vertical axis. Energy encoded in RGB along a Hilbert path through "
|
| 70 |
+
"the color cube: black is below noise floor, blue/cyan is faint reflections, "
|
| 71 |
+
"green/yellow is strong reflections, red/magenta is direct-arrival energy."
|
| 72 |
+
)
|
| 73 |
+
img = pipe(
|
| 74 |
+
image=schematic, prompt=prompt, height=768, width=768,
|
| 75 |
+
guidance_scale=4.0, num_inference_steps=20,
|
| 76 |
+
).images[0]
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
The decoder (RGB → magnitude → mono RIR) is in `decode_rir.py`. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb.
|
| 80 |
+
|
| 81 |
+
## License
|
| 82 |
+
|
| 83 |
+
Apache 2.0.
|
| 84 |
+
|
| 85 |
+
## References
|
| 86 |
+
|
| 87 |
+
- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
|
| 88 |
+
- Scheibler, Bezzam, Dokmanić. *Pyroomacoustics: A Python package for audio room simulation and array processing algorithms.* ICASSP 2018.
|
| 89 |
+
- Allen, Berkley. *Image method for efficiently simulating small-room acoustics.* JASA 1979.
|