| --- |
| language: en |
| license: apache-2.0 |
| base_model: black-forest-labs/FLUX.2-klein-base-4B |
| library_name: diffusers |
| tags: |
| - interpretability |
| - mechanistic-interpretability |
| - probing |
| - perspective-taking |
| - flux2 |
| - vision-banana |
| - arxiv:2604.20329 |
| --- |
| |
| # otherview-plantain |
|
|
| A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking β rendering a depicted scene from another agent's vantage β as a distinct operation from mere mention of that agent in the scene. Companion to *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)). |
|
|
| ## Hypothesis |
|
|
| If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent. |
|
|
| ## Method |
|
|
| A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene β¦ wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with `guidance_scale=4.0`. All subsequent probe passes condition on this same image. |
|
|
| 30 named creatures span a graded perceptual-similarity axis (adult human β tall person β child β toddler β infant β small dog β cat β parrot β snake β spider β ant β fly β moth β bee with ultraviolet vision β echolocating bat β shark with electroreception β sea turtle with magnetoreception β creature with thermal-infrared eyes β alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with `guidance_scale=1.0`. |
|
|
| A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as `t = mean(viewpoint β mention) / (std / sqrt(N))`. The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpointβmention within each pair) and recomputing the per-head t-distribution under each shuffle. |
|
|
| By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes. |
|
|
| ## Results |
|
|
| | metric | observed | null mean | null p99 | observed / null p99 | |
| |---|---:|---:|---:|---:| |
| | heads with \|t\|>3 | 12 091 (74%) | 61 | 654 | 18Γ | |
| | heads with \|t\|>5 | 9 574 (59%) | 0 | 3 | **3 191Γ** | |
| | max \|t\| | 31.26 | β | β | β | |
|
|
| 74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with \|t\|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191Γ. |
|
|
| The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense. |
|
|
| ## License |
|
|
| Apache 2.0. |
|
|
| ## References |
|
|
| - Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). |
| - Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025). |
|
|