dense-plantain probe results README
Browse files
README.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: black-forest-labs/FLUX.2-klein-base-4B
|
| 5 |
+
library_name: diffusers
|
| 6 |
+
tags:
|
| 7 |
+
- interpretability
|
| 8 |
+
- per-head-attention
|
| 9 |
+
- paired-prompt-probe
|
| 10 |
+
- human-pose
|
| 11 |
+
- flux2
|
| 12 |
+
- vision-banana
|
| 13 |
+
- arxiv:2604.20329
|
| 14 |
+
pipeline_tag: image-to-image
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# dense-plantain
|
| 18 |
+
|
| 19 |
+
A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents canonical body pose as a separable axis from expressive pose deformation, parallel to the canonical-vs-deformed coordinate distinction in dense human-pose models (DensePose, SMPL).
|
| 20 |
+
|
| 21 |
+
## Thesis
|
| 22 |
+
|
| 23 |
+
Dense human-pose representations canonically encode body shape in a fixed reference pose (T-pose) plus a deformation that maps to the actual pose. Whether image-generation models develop an analogous axis without explicit supervision is the question. dense-plantain pairs descriptions of a body subject in canonical T-pose against the same body subject in an expressive pose, and tests whether per-head attention shifts systematically.
|
| 24 |
+
|
| 25 |
+
## Method
|
| 26 |
+
|
| 27 |
+
Twenty-five paired prompts holding body subject and setting constant; only pose configuration varies (canonical T-pose vs. expressive pose). Per-head capture protocol identical to the rest of the plantain probe family.
|
| 28 |
+
|
| 29 |
+
Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits.
|
| 30 |
+
|
| 31 |
+
## Results
|
| 32 |
+
|
| 33 |
+
| Metric | Value | Significance |
|
| 34 |
+
|--------------------------------|-----------------|---------------------------|
|
| 35 |
+
| Heads with \|t\| > 3 | 8,263 (50.6%) | 6.2× empirical null p99 |
|
| 36 |
+
| Heads with \|t\| > 5 | 4,614 (28.3%) | 659× empirical null p99 |
|
| 37 |
+
| Heads with \|d\| > 0.8 (large) | 6,164 (37.8%) | — |
|
| 38 |
+
| Split-half r (median) | 0.863 | [0.85, 0.87] IQR |
|
| 39 |
+
| Max \|t\| | 22.45 | — |
|
| 40 |
+
|
| 41 |
+
**Top blocks by max \|t\|:**
|
| 42 |
+
- single[0]: max\|t\|=22.45, 283/768 heads at \|t\|>3, median \|d\|=0.45
|
| 43 |
+
- single[3]: max\|t\|=22.06, 389/768 heads at \|t\|>3, median \|d\|=0.61
|
| 44 |
+
- single[9]: max\|t\|=21.66, 232/768 heads at \|t\|>3, median \|d\|=0.39
|
| 45 |
+
- joint[3]: max\|t\|=20.18, 138/192 heads at \|t\|>3, median \|d\|=1.02
|
| 46 |
+
- joint[4]: max\|t\|=19.62, 131/192 heads at \|t\|>3, median \|d\|=1.01
|
| 47 |
+
|
| 48 |
+
**Interpretation.** The body-pose axis is strong (659× null at |t|>5) and highly reproducible (split-half r=0.86). Over half of all 16,320 attention heads register the canonical-vs-expressive pose distinction at |t|>3, and over a third reach Cohen's d > 0.8. Localization is mixed: max signal is in early single blocks (single[0]/[3]/[9]), but joint-block median |d| is highest (≥1.0 in joint[3] and joint[4]), suggesting pose information is initially routed through cross-attention text-image fusion (joint blocks) and then maintained through early single-block transformer processing.
|
| 49 |
+
|
| 50 |
+
The result places Klein among the models that have an internal canonical-pose representation without explicit pose supervision, parallel to but more diffuse than the structured axis encoded in DensePose-trained networks.
|
| 51 |
+
|
| 52 |
+
## Status
|
| 53 |
+
|
| 54 |
+
Probe complete. No LoRA training; this is a base-model interpretability finding.
|
| 55 |
+
|
| 56 |
+
## Limitations
|
| 57 |
+
|
| 58 |
+
The T-pose vs. expressive-pose pair has unmatched articulation complexity — expressive poses involve more anatomical regions deviating from canonical position than T-pose does. A residual contributor to the signal could be "amount of articulation" rather than canonical-vs-deformed status specifically. A follow-up controlling articulation (e.g., A-pose vs. expressive pose with matched limb configurations) would tighten the claim.
|
| 59 |
+
|
| 60 |
+
Body-type and clothing variations are not factored out across the 25 pairs.
|
| 61 |
+
|
| 62 |
+
The probe is correlational.
|
| 63 |
+
|
| 64 |
+
## License
|
| 65 |
+
|
| 66 |
+
Apache 2.0 — matches base FLUX.2 Klein 4B.
|
| 67 |
+
|
| 68 |
+
## References
|
| 69 |
+
|
| 70 |
+
- Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
|
| 71 |
+
- Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).
|