--- language: en license: apache-2.0 base_model: black-forest-labs/FLUX.2-klein-base-4B library_name: diffusers tags: - interpretability - mechanistic-interpretability - probing - perspective-taking - flux2 - vision-banana - arxiv:2604.20329 --- # otherview-plantain A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking — rendering a depicted scene from another agent's vantage — as a distinct operation from mere mention of that agent in the scene. Companion to *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)). ## Hypothesis If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent. ## Method A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene … wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with `guidance_scale=4.0`. All subsequent probe passes condition on this same image. 30 named creatures span a graded perceptual-similarity axis (adult human → tall person → child → toddler → infant → small dog → cat → parrot → snake → spider → ant → fly → moth → bee with ultraviolet vision → echolocating bat → shark with electroreception → sea turtle with magnetoreception → creature with thermal-infrared eyes → alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with `guidance_scale=1.0`. A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as `t = mean(viewpoint − mention) / (std / sqrt(N))`. The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpoint↔mention within each pair) and recomputing the per-head t-distribution under each shuffle. By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes. ## Results | metric | observed | null mean | null p99 | observed / null p99 | |---|---:|---:|---:|---:| | heads with \|t\|>3 | 12 091 (74%) | 61 | 654 | 18× | | heads with \|t\|>5 | 9 574 (59%) | 0 | 3 | **3 191×** | | max \|t\| | 31.26 | — | — | — | 74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with \|t\|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191×. The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense. ## License Apache 2.0. ## References - Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). - Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).