Initial probe report

9be7149 verified 12 days ago

3.94 kB

	---
	language: en
	license: apache-2.0
	base_model: black-forest-labs/FLUX.2-klein-base-4B
	library_name: diffusers
	tags:
	- interpretability
	- mechanistic-interpretability
	- probing
	- perspective-taking
	- flux2
	- vision-banana
	- arxiv:2604.20329
	---

	# otherview-plantain

	A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents perspective-taking — rendering a depicted scene from another agent's vantage — as a distinct operation from mere mention of that agent in the scene. Companion to Image Generators are Generalist Vision Learners (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)).

	## Hypothesis

	If generative pretraining produces structure beyond the perceptual, the model may carry a representational axis corresponding to "shift the rendering frame to agent X's viewpoint" that is separable from "X is depicted in the scene". A clean test holds both the input scene and the named agent constant and varies only the syntactic role of the agent.

	## Method

	A single fixed reference kitchen image is generated once from the prompt "A photographic kitchen scene … wooden countertops, refrigerator, sink under a window, tiled floor, soft daylight, no people present" at 14 inference steps with `guidance_scale=4.0`. All subsequent probe passes condition on this same image.

	30 named creatures span a graded perceptual-similarity axis (adult human → tall person → child → toddler → infant → small dog → cat → parrot → snake → spider → ant → fly → moth → bee with ultraviolet vision → echolocating bat → shark with electroreception → sea turtle with magnetoreception → creature with thermal-infrared eyes → alien with non-visual senses). Each creature defines a paired prompt: "Render this kitchen from the viewpoint of [creature]" (perspective-taking) vs "Render this kitchen with [creature] in it" (mention only). Both prompts run image-conditioned on the fixed reference at one inference step with `guidance_scale=1.0`.

	A forward pre-hook on every transformer attention output projection (5 joint MMDiT blocks + 20 single blocks, 16 320 heads total) captures per-head RMS magnitude of the input activation. The per-head paired t-statistic is computed across 30 pairs as `t = mean(viewpoint − mention) / (std / sqrt(N))`. The empirical null is constructed by 200 independent random sign-flips of pair-member labels (relabelling viewpoint↔mention within each pair) and recomputing the per-head t-distribution under each shuffle.

	By design, differences in per-head response across the two passes cannot be attributed to scene variation (input image is identical), agent variation (named creature is identical), or syntactic length (matched four-word substitution). Only the syntactic role of the creature changes between passes.

	## Results

	\| metric \| observed \| null mean \| null p99 \| observed / null p99 \|
	\|---\|---:\|---:\|---:\|---:\|
	\| heads with \\|t\\|>3 \| 12 091 (74%) \| 61 \| 654 \| 18× \|
	\| heads with \\|t\\|>5 \| 9 574 (59%) \| 0 \| 3 \| 3 191× \|
	\| max \\|t\\| \| 31.26 \| — \| — \| — \|

	74% of all 16 320 attention heads respond differentially to "render this kitchen from X's POV" vs "render this kitchen with X in it" controlled for X being the same creature on both passes and the input image being identical. The 9 574 heads with \\|t\\|>5 against an empirical 99th-percentile null of 3 yields an observed/null ratio of approximately 3 191×.

	The strongest selective heads cluster in joint MMDiT block 2, consistent with text-driven cross-modal modulation occurring early in the network where text-image cross-attention is dense.

	## License

	Apache 2.0.

	## References

	- Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
	- Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).