phanerozoic
/

measure-plantain

+---
+language: en
+license: apache-2.0
+base_model: black-forest-labs/FLUX.2-klein-base-4B
+library_name: diffusers
+tags:
+  - interpretability
+  - mechanistic-interpretability
+  - probing
+  - physical-scale
+  - flux2
+  - vision-banana
+  - arxiv:2604.20329
+---
+# measure-plantain
+A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents physical scale — interpreting the same depicted object as either millimeter-scale or kilometer-scale — as a per-head axis separable from object identity. Companion to *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)).
+## Hypothesis
+If generative pretraining recovers physical-world structure beyond what is necessary for the pixel-prediction objective, the model may carry a representational axis corresponding to the physical scale of depicted objects. A clean test holds the input image constant and varies only the textual scale claim associated with that image, isolating "interpret-this-object-at-scale-X" from "depict-an-object-of-class-X".
+## Method
+A single fixed reference image of a neutral matte rounded grey object on a neutral grey background is generated once at 14 inference steps with `guidance_scale=4.0`. The reference is constructed to expose minimal scale cues (no other objects in frame, neutral lighting, no horizon). All subsequent probe passes condition on this same image.
+25 paired prompts are constructed. Each pair shares prompt structure and word "object"; only the magnitude descriptor differs, ranging from millimeter-and-below ("a 1-millimeter object", "0.1 mm grain", "smaller than a grain of sand") to kilometer-and-above ("a 1-kilometer object", "100-m mountain", "larger than a skyscraper"). The two members of each pair span at least three orders of magnitude in claimed physical scale.
+For each prompt the pipeline is run image-conditioned on the fixed reference at one inference step with `guidance_scale=1.0`. A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 25 pairs as `t = mean(large − small) / (std / sqrt(N))`. The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling small↔large within each pair).
+By design, the input image is identical across both conditions and the named object word is identical. Differences in per-head response can only be attributed to the textual scale claim.
+## Results
+| metric | observed | null mean | null p99 | observed / null p99 |
+|---|---:|---:|---:|---:|
+| heads with \|t\|>3 | 7 044 (43.2%) | 94 | 818 | 8.6× |
+| heads with \|t\|>5 | 2 549 (15.6%) | 1 | 5 | **510×** |
+| max \|t\| | 14.89 | — | — | — |
+Approximately 43% of all attention heads respond differentially to the small-scale vs large-scale interpretation of the same neutral reference object. The 2 549 heads with \|t\|>5 against an empirical 99th-percentile null of 5 yields an observed/null ratio of 510×. The strongest selective heads cluster in single transformer block 2, an early-network region also flagged by the perspective-taking probe (otherview-plantain), consistent with text-driven cross-modal modulation occurring early in the network.
+## License
+Apache 2.0.
+## References
+- Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
+- Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).