measure-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents physical scale — interpreting the same depicted object as either millimeter-scale or kilometer-scale — as a per-head axis separable from object identity. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329).

Hypothesis

If generative pretraining recovers physical-world structure beyond what is necessary for the pixel-prediction objective, the model may carry a representational axis corresponding to the physical scale of depicted objects. A clean test holds the input image constant and varies only the textual scale claim associated with that image, isolating "interpret-this-object-at-scale-X" from "depict-an-object-of-class-X".

Method

A single fixed reference image of a neutral matte rounded grey object on a neutral grey background is generated once at 14 inference steps with guidance_scale=4.0. The reference is constructed to expose minimal scale cues (no other objects in frame, neutral lighting, no horizon). All subsequent probe passes condition on this same image.

25 paired prompts are constructed. Each pair shares prompt structure and word "object"; only the magnitude descriptor differs, ranging from millimeter-and-below ("a 1-millimeter object", "0.1 mm grain", "smaller than a grain of sand") to kilometer-and-above ("a 1-kilometer object", "100-m mountain", "larger than a skyscraper"). The two members of each pair span at least three orders of magnitude in claimed physical scale.

For each prompt the pipeline is run image-conditioned on the fixed reference at one inference step with guidance_scale=1.0. A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 25 pairs as t = mean(large − small) / (std / sqrt(N)). The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling small↔large within each pair).

By design, the input image is identical across both conditions and the named object word is identical. Differences in per-head response can only be attributed to the textual scale claim.

Results

metric	observed	null mean	null p99	observed / null p99
heads with \|t\|>3	7 044 (43.2%)	94	818	8.6×
heads with \|t\|>5	2 549 (15.6%)	1	5	510×
max \|t\|	14.89	—	—	—

Approximately 43% of all attention heads respond differentially to the small-scale vs large-scale interpretation of the same neutral reference object. The 2 549 heads with |t|>5 against an empirical 99th-percentile null of 5 yields an observed/null ratio of 510×. The strongest selective heads cluster in single transformer block 2, an early-network region also flagged by the perspective-taking probe (otherview-plantain), consistent with text-driven cross-modal modulation occurring early in the network.

License

Apache 2.0.

References

Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/measure-plantain

Base model

black-forest-labs/FLUX.2-klein-base-4B

Finetuned

(14)

this model

Paper for phanerozoic/measure-plantain

Image Generators are Generalist Vision Learners

Paper • 2604.20329 • Published 16 days ago • 20