Image-to-Image
Diffusers
English
interpretability
per-head-attention
paired-prompt-probe
flux2
vision-banana
Instructions to use phanerozoic/moral-plantain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use phanerozoic/moral-plantain with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("phanerozoic/moral-plantain", dtype=torch.bfloat16, device_map="cuda") prompt = "Turn this cat into a dog" input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = pipe(image=input_image, prompt=prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| base_model: black-forest-labs/FLUX.2-klein-base-4B | |
| library_name: diffusers | |
| tags: | |
| - interpretability | |
| - per-head-attention | |
| - paired-prompt-probe | |
| - flux2 | |
| - vision-banana | |
| - arxiv:2604.20329 | |
| pipeline_tag: image-to-image | |
| # moral-plantain | |
| A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents ethical valence as a separable axis on otherwise matched scene compositions. | |
| ## Thesis | |
| Vision-Banana-style probes have located representational axes in Klein for physical scale (~43% of heads), perspective-taking (~74%), self-reference (~13%), and post-event categorization (~1%) without any instruction tuning. moral-plantain extends this question to ethical valence. If a per-head signal exceeds the empirical null on canonically-positive moral acts (helping vs. ignoring) with scene composition held constant, image-generation pretraining has internalized cultural ethics structurally β not as a label-on-data correlation but as a representational axis the network treats as structurally meaningful. | |
| ## Method | |
| Twenty-five paired prompts. Each pair holds the scene composition and named agents constant; only the moral valence of the depicted action varies (helping vs. ignoring across canonically-positive moral acts β physical assistance, social inclusion, honesty, restraint from theft, care for vulnerable parties). Politically contested cases excluded. Within each pair the two prompts are length-matched. | |
| For each prompt the model runs one inference step at `guidance_scale=1.0` with a fixed seed. A forward pre-hook on every transformer block's attention output projection captures per-head input magnitude (RMS over batch, sequence, and head-dimension axes). Across the 25 pairs, per-head paired t-statistics are computed on (helping β ignoring) magnitudes. The empirical null is 1,000 sign-flip permutations of within-pair labels. | |
| Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits, Pearson r between per-head t-vectors of the two halves. | |
| ## Results | |
| | Metric | Value | Significance | | |
| |--------------------------------|----------------|----------------------------| | |
| | Heads with \|t\| > 3 | 3,221 (19.7%) | 6.4Γ empirical null p99 | | |
| | Heads with \|t\| > 5 | 509 (3.1%) | 102Γ empirical null p99 | | |
| | Heads with \|d\| > 0.8 (large) | 1,386 (8.5%) | β | | |
| | Split-half r (median, 100 splits) | 0.573 | [0.55, 0.60] IQR | | |
| | Max \|t\| | 10.05 | β | | |
| **Top blocks by max \|t\|:** | |
| - single[19]: max\|t\|=10.05, 132/768 heads at \|t\|>3, median \|d\|=0.16 | |
| - joint[3]: max\|t\|=8.98, 38/192 heads at \|t\|>3, median \|d\|=0.38 | |
| - single[12]: max\|t\|=8.69, 143/768 heads at \|t\|>3, median \|d\|=0.31 | |
| - single[16]: max\|t\|=8.62, 184/768 heads at \|t\|>3, median \|d\|=0.36 | |
| - single[13]: max\|t\|=8.53, 173/768 heads at \|t\|>3, median \|d\|=0.36 | |
| **Interpretation.** The axis is real, reproducible across stimulus subsamples (split-half r above null), and registers at over 100Γ the empirical null p99 at the |t|>5 threshold. Signal is distributed across mid-to-deep single transformer blocks rather than concentrated in one localized region β consistent with morality being a high-dimensional construct rather than a single binary axis. The maximum-effect head (single[19] head with t=+10) responds 10 standard errors more strongly to helping descriptions than to length-matched ignoring descriptions of the same scene composition. | |
| ## Status | |
| Probe complete. No LoRA training; this is a base-model interpretability finding. | |
| ## Limitations | |
| The 25-pair sample is small; t-statistics are sensitive to per-pair variance at this size. Visual content is not factored out β even at one inference step the text-conditioning pathway encodes scene cues that correlate with moral framing. A stronger version would generate matched images for each scene and use those as a fixed reference image across the helping/ignoring pair, isolating the moral framing token-side only. | |
| The "ethical valence" framing presupposes broad consensus on the depicted acts; politically contested cases were excluded. A negative result on this stimulus set would not rule out politically contested ethical axes elsewhere in the model. | |
| The probe is correlational, not causal. Heads with high |t| are sensitive to the moral-framing distinction in input; whether they contribute causally to downstream moral-valence-shifted generation is a follow-up question. | |
| ## License | |
| Apache 2.0 β matches base FLUX.2 Klein 4B. | |
| ## References | |
| - Gabeur, V., Long, S., Peng, S., et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). | |
| - Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025). | |