dense-plantain probe results README

48cbcaa verified 8 days ago

4.11 kB

	---
	language: en
	license: apache-2.0
	base_model: black-forest-labs/FLUX.2-klein-base-4B
	library_name: diffusers
	tags:
	- interpretability
	- per-head-attention
	- paired-prompt-probe
	- human-pose
	- flux2
	- vision-banana
	- arxiv:2604.20329
	pipeline_tag: image-to-image
	---

	# dense-plantain

	A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents canonical body pose as a separable axis from expressive pose deformation, parallel to the canonical-vs-deformed coordinate distinction in dense human-pose models (DensePose, SMPL).

	## Thesis

	Dense human-pose representations canonically encode body shape in a fixed reference pose (T-pose) plus a deformation that maps to the actual pose. Whether image-generation models develop an analogous axis without explicit supervision is the question. dense-plantain pairs descriptions of a body subject in canonical T-pose against the same body subject in an expressive pose, and tests whether per-head attention shifts systematically.

	## Method

	Twenty-five paired prompts holding body subject and setting constant; only pose configuration varies (canonical T-pose vs. expressive pose). Per-head capture protocol identical to the rest of the plantain probe family.

	Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits.

	## Results

	\| Metric \| Value \| Significance \|
	\|--------------------------------\|-----------------\|---------------------------\|
	\| Heads with \\|t\\| > 3 \| 8,263 (50.6%) \| 6.2× empirical null p99 \|
	\| Heads with \\|t\\| > 5 \| 4,614 (28.3%) \| 659× empirical null p99 \|
	\| Heads with \\|d\\| > 0.8 (large) \| 6,164 (37.8%) \| — \|
	\| Split-half r (median) \| 0.863 \| [0.85, 0.87] IQR \|
	\| Max \\|t\\| \| 22.45 \| — \|

	Top blocks by max \\|t\\|:
	- single[0]: max\\|t\\|=22.45, 283/768 heads at \\|t\\|>3, median \\|d\\|=0.45
	- single[3]: max\\|t\\|=22.06, 389/768 heads at \\|t\\|>3, median \\|d\\|=0.61
	- single[9]: max\\|t\\|=21.66, 232/768 heads at \\|t\\|>3, median \\|d\\|=0.39
	- joint[3]: max\\|t\\|=20.18, 138/192 heads at \\|t\\|>3, median \\|d\\|=1.02
	- joint[4]: max\\|t\\|=19.62, 131/192 heads at \\|t\\|>3, median \\|d\\|=1.01

	Interpretation. The body-pose axis is strong (659× null at \|t\|>5) and highly reproducible (split-half r=0.86). Over half of all 16,320 attention heads register the canonical-vs-expressive pose distinction at \|t\|>3, and over a third reach Cohen's d > 0.8. Localization is mixed: max signal is in early single blocks (single[0]/[3]/[9]), but joint-block median \|d\| is highest (≥1.0 in joint[3] and joint[4]), suggesting pose information is initially routed through cross-attention text-image fusion (joint blocks) and then maintained through early single-block transformer processing.

	The result places Klein among the models that have an internal canonical-pose representation without explicit pose supervision, parallel to but more diffuse than the structured axis encoded in DensePose-trained networks.

	## Status

	Probe complete. No LoRA training; this is a base-model interpretability finding.

	## Limitations

	The T-pose vs. expressive-pose pair has unmatched articulation complexity — expressive poses involve more anatomical regions deviating from canonical position than T-pose does. A residual contributor to the signal could be "amount of articulation" rather than canonical-vs-deformed status specifically. A follow-up controlling articulation (e.g., A-pose vs. expressive pose with matched limb configurations) would tighten the claim.

	Body-type and clothing variations are not factored out across the 25 pairs.

	The probe is correlational.

	## License

	Apache 2.0 — matches base FLUX.2 Klein 4B.

	## References

	- Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
	- Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).