import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("phanerozoic/otherview-plantain", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]otherview-plantain
A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking β rendering a depicted scene from another agent's vantage β as a distinct operation from mere mention of that agent in the scene. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329).
Hypothesis
If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent.
Method
A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene β¦ wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with guidance_scale=4.0. All subsequent probe passes condition on this same image.
30 named creatures span a graded perceptual-similarity axis (adult human β tall person β child β toddler β infant β small dog β cat β parrot β snake β spider β ant β fly β moth β bee with ultraviolet vision β echolocating bat β shark with electroreception β sea turtle with magnetoreception β creature with thermal-infrared eyes β alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with guidance_scale=1.0.
A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as t = mean(viewpoint β mention) / (std / sqrt(N)). The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpointβmention within each pair) and recomputing the per-head t-distribution under each shuffle.
By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes.
Results
| metric | observed | null mean | null p99 | observed / null p99 |
|---|---|---|---|---|
| heads with |t|>3 | 12 091 (74%) | 61 | 654 | 18Γ |
| heads with |t|>5 | 9 574 (59%) | 0 | 3 | 3 191Γ |
| max |t| | 31.26 | β | β | β |
74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with |t|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191Γ.
The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense.
License
Apache 2.0.
References
- Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
- Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).
- Downloads last month
- -
Model tree for phanerozoic/otherview-plantain
Base model
black-forest-labs/FLUX.2-klein-base-4B