measure-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents physical scale — interpreting the same depicted object as either millimeter-scale or kilometer-scale — as a per-head axis separable from object identity. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329).

Hypothesis

If generative pretraining recovers physical-world structure beyond what is necessary for the pixel-prediction objective, the model may carry a representational axis corresponding to the physical scale of depicted objects. A clean test holds the input image constant and varies only the textual scale claim associated with that image, isolating "interpret-this-object-at-scale-X" from "depict-an-object-of-class-X".

Method

A single fixed reference image of a neutral matte rounded grey object on a neutral grey background is generated once at 14 inference steps with guidance_scale=4.0. The reference is constructed to expose minimal scale cues (no other objects in frame, neutral lighting, no horizon). All subsequent probe passes condition on this same image.

25 paired prompts are constructed. Each pair shares prompt structure and word "object"; only the magnitude descriptor differs, ranging from millimeter-and-below ("a 1-millimeter object", "0.1 mm grain", "smaller than a grain of sand") to kilometer-and-above ("a 1-kilometer object", "100-m mountain", "larger than a skyscraper"). The two members of each pair span at least three orders of magnitude in claimed physical scale.

For each prompt the pipeline is run image-conditioned on the fixed reference at one inference step with guidance_scale=1.0. A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 25 pairs as t = mean(large − small) / (std / sqrt(N)). The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling small↔large within each pair).

By design, the input image is identical across both conditions and the named object word is identical. Differences in per-head response can only be attributed to the textual scale claim.

Results

metric observed null mean null p99 observed / null p99
heads with |t|>3 7 044 (43.2%) 94 818 8.6×
heads with |t|>5 2 549 (15.6%) 1 5 510×
max |t| 14.89

Approximately 43% of all attention heads respond differentially to the small-scale vs large-scale interpretation of the same neutral reference object. The 2 549 heads with |t|>5 against an empirical 99th-percentile null of 5 yields an observed/null ratio of 510×. The strongest selective heads cluster in single transformer block 2, an early-network region also flagged by the perspective-taking probe (otherview-plantain), consistent with text-driven cross-modal modulation occurring early in the network.

License

Apache 2.0.

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/measure-plantain

Finetuned
(14)
this model

Paper for phanerozoic/measure-plantain