otherview-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking — rendering a depicted scene from another agent's vantage — as a distinct operation from mere mention of that agent in the scene. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329).

Hypothesis

If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent.

Method

A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene … wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with guidance_scale=4.0. All subsequent probe passes condition on this same image.

30 named creatures span a graded perceptual-similarity axis (adult human → tall person → child → toddler → infant → small dog → cat → parrot → snake → spider → ant → fly → moth → bee with ultraviolet vision → echolocating bat → shark with electroreception → sea turtle with magnetoreception → creature with thermal-infrared eyes → alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with guidance_scale=1.0.

A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as t = mean(viewpoint − mention) / (std / sqrt(N)). The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpoint↔mention within each pair) and recomputing the per-head t-distribution under each shuffle.

By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes.

Results

metric	observed	null mean	null p99	observed / null p99
heads with \|t\|>3	12 091 (74%)	61	654	18×
heads with \|t\|>5	9 574 (59%)	0	3	3 191×
max \|t\|	31.26	—	—	—

74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with |t|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191×.

The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense.

License

Apache 2.0.

References

Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/otherview-plantain

Base model

black-forest-labs/FLUX.2-klein-base-4B

Finetuned

(14)

this model

Paper for phanerozoic/otherview-plantain

Image Generators are Generalist Vision Learners

Paper • 2604.20329 • Published 16 days ago • 20