Initial probe report
Browse files
README.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: black-forest-labs/FLUX.2-klein-base-4B
|
| 5 |
+
library_name: diffusers
|
| 6 |
+
tags:
|
| 7 |
+
- interpretability
|
| 8 |
+
- mechanistic-interpretability
|
| 9 |
+
- probing
|
| 10 |
+
- physical-scale
|
| 11 |
+
- flux2
|
| 12 |
+
- vision-banana
|
| 13 |
+
- arxiv:2604.20329
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# measure-plantain
|
| 17 |
+
|
| 18 |
+
A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents physical scale — interpreting the same depicted object as either millimeter-scale or kilometer-scale — as a per-head axis separable from object identity. Companion to *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)).
|
| 19 |
+
|
| 20 |
+
## Hypothesis
|
| 21 |
+
|
| 22 |
+
If generative pretraining recovers physical-world structure beyond what is necessary for the pixel-prediction objective, the model may carry a representational axis corresponding to the physical scale of depicted objects. A clean test holds the input image constant and varies only the textual scale claim associated with that image, isolating "interpret-this-object-at-scale-X" from "depict-an-object-of-class-X".
|
| 23 |
+
|
| 24 |
+
## Method
|
| 25 |
+
|
| 26 |
+
A single fixed reference image of a neutral matte rounded grey object on a neutral grey background is generated once at 14 inference steps with `guidance_scale=4.0`. The reference is constructed to expose minimal scale cues (no other objects in frame, neutral lighting, no horizon). All subsequent probe passes condition on this same image.
|
| 27 |
+
|
| 28 |
+
25 paired prompts are constructed. Each pair shares prompt structure and word "object"; only the magnitude descriptor differs, ranging from millimeter-and-below ("a 1-millimeter object", "0.1 mm grain", "smaller than a grain of sand") to kilometer-and-above ("a 1-kilometer object", "100-m mountain", "larger than a skyscraper"). The two members of each pair span at least three orders of magnitude in claimed physical scale.
|
| 29 |
+
|
| 30 |
+
For each prompt the pipeline is run image-conditioned on the fixed reference at one inference step with `guidance_scale=1.0`. A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 25 pairs as `t = mean(large − small) / (std / sqrt(N))`. The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling small↔large within each pair).
|
| 31 |
+
|
| 32 |
+
By design, the input image is identical across both conditions and the named object word is identical. Differences in per-head response can only be attributed to the textual scale claim.
|
| 33 |
+
|
| 34 |
+
## Results
|
| 35 |
+
|
| 36 |
+
| metric | observed | null mean | null p99 | observed / null p99 |
|
| 37 |
+
|---|---:|---:|---:|---:|
|
| 38 |
+
| heads with \|t\|>3 | 7 044 (43.2%) | 94 | 818 | 8.6× |
|
| 39 |
+
| heads with \|t\|>5 | 2 549 (15.6%) | 1 | 5 | **510×** |
|
| 40 |
+
| max \|t\| | 14.89 | — | — | — |
|
| 41 |
+
|
| 42 |
+
Approximately 43% of all attention heads respond differentially to the small-scale vs large-scale interpretation of the same neutral reference object. The 2 549 heads with \|t\|>5 against an empirical 99th-percentile null of 5 yields an observed/null ratio of 510×. The strongest selective heads cluster in single transformer block 2, an early-network region also flagged by the perspective-taking probe (otherview-plantain), consistent with text-driven cross-modal modulation occurring early in the network.
|
| 43 |
+
|
| 44 |
+
## License
|
| 45 |
+
|
| 46 |
+
Apache 2.0.
|
| 47 |
+
|
| 48 |
+
## References
|
| 49 |
+
|
| 50 |
+
- Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
|
| 51 |
+
- Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).
|