phanerozoic commited on
Commit
15cfc67
·
verified ·
1 Parent(s): 322cea2

Add README

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ base_model: black-forest-labs/FLUX.2-klein-base-4B
5
+ library_name: diffusers
6
+ tags:
7
+ - audio-synthesis
8
+ - room-impulse-response
9
+ - acoustics
10
+ - lora
11
+ - flux2
12
+ - arxiv:2604.20329
13
+ pipeline_tag: image-to-image
14
+ ---
15
+
16
+ # echo-plantain
17
+
18
+ A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution.
19
+
20
+ This adapter tests whether the recipe from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system.
21
+
22
+ ## Method
23
+
24
+ 1. **Reframe room acoustics as image-to-image.** Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram.
25
+ 2. **Bijective magnitude↔RGB encoding.** Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve `u ∈ [0, 1]` → 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail.
26
+ 3. **Audio params.** 16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding.
27
+
28
+ Training data: 10,000 randomly-generated rectangular rooms via [pyroomacoustics](https://github.com/LCAV/pyroomacoustics). Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6.
29
+
30
+ ## Status
31
+
32
+ Training in progress. Weights will be added when complete.
33
+
34
+ ## Training
35
+
36
+ | | |
37
+ |---|---|
38
+ | Base | `black-forest-labs/FLUX.2-klein-base-4B` |
39
+ | Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder |
40
+ | Resolution | 768 × 768 |
41
+ | Batch size | 4 |
42
+ | Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup |
43
+ | Max steps | 15 000 |
44
+ | Mixed precision | bf16 |
45
+ | Training data | 10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6) |
46
+ | Audio params | 16 kHz, n_fft 1024, hop 256, 1-second RIR clips |
47
+ | Spectrogram encoding | Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path |
48
+
49
+ ## Usage
50
+
51
+ ```python
52
+ import torch
53
+ from PIL import Image
54
+ from diffusers import Flux2KleinPipeline
55
+
56
+ pipe = Flux2KleinPipeline.from_pretrained(
57
+ "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
58
+ ).to("cuda")
59
+ pipe.load_lora_weights("phanerozoic/echo-plantain")
60
+
61
+ # A top-down schematic of the target room (see `render_schematic.py` for the
62
+ # renderer convention: walls as outline, source as red ⊕, listener as blue ⊙,
63
+ # floor brightness encoding absorption).
64
+ schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768))
65
+
66
+ prompt = (
67
+ "Generate a room impulse response spectrogram for the depicted space. "
68
+ "Time on horizontal axis (early reflections at left, late reverb tail extending right), "
69
+ "frequency on vertical axis. Energy encoded in RGB along a Hilbert path through "
70
+ "the color cube: black is below noise floor, blue/cyan is faint reflections, "
71
+ "green/yellow is strong reflections, red/magenta is direct-arrival energy."
72
+ )
73
+ img = pipe(
74
+ image=schematic, prompt=prompt, height=768, width=768,
75
+ guidance_scale=4.0, num_inference_steps=20,
76
+ ).images[0]
77
+ ```
78
+
79
+ The decoder (RGB → magnitude → mono RIR) is in `decode_rir.py`. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb.
80
+
81
+ ## License
82
+
83
+ Apache 2.0.
84
+
85
+ ## References
86
+
87
+ - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
88
+ - Scheibler, Bezzam, Dokmanić. *Pyroomacoustics: A Python package for audio room simulation and array processing algorithms.* ICASSP 2018.
89
+ - Allen, Berkley. *Image method for efficiently simulating small-room acoustics.* JASA 1979.