otherview-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking β€” rendering a depicted scene from another agent's vantage β€” as a distinct operation from mere mention of that agent in the scene. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329).

Hypothesis

If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent.

Method

A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene … wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with guidance_scale=4.0. All subsequent probe passes condition on this same image.

30 named creatures span a graded perceptual-similarity axis (adult human β†’ tall person β†’ child β†’ toddler β†’ infant β†’ small dog β†’ cat β†’ parrot β†’ snake β†’ spider β†’ ant β†’ fly β†’ moth β†’ bee with ultraviolet vision β†’ echolocating bat β†’ shark with electroreception β†’ sea turtle with magnetoreception β†’ creature with thermal-infrared eyes β†’ alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with guidance_scale=1.0.

A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as t = mean(viewpoint βˆ’ mention) / (std / sqrt(N)). The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpoint↔mention within each pair) and recomputing the per-head t-distribution under each shuffle.

By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes.

Results

metric observed null mean null p99 observed / null p99
heads with |t|>3 12 091 (74%) 61 654 18Γ—
heads with |t|>5 9 574 (59%) 0 3 3 191Γ—
max |t| 31.26 β€” β€” β€”

74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with |t|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191Γ—.

The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense.

License

Apache 2.0.

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/otherview-plantain

Finetuned
(14)
this model

Paper for phanerozoic/otherview-plantain