How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("phanerozoic/moral-plantain", dtype=torch.bfloat16, device_map="cuda")

prompt = "Turn this cat into a dog"
input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

image = pipe(image=input_image, prompt=prompt).images[0]

moral-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents ethical valence as a separable axis on otherwise matched scene compositions.

Thesis

Vision-Banana-style probes have located representational axes in Klein for physical scale (43% of heads), perspective-taking (74%), self-reference (13%), and post-event categorization (1%) without any instruction tuning. moral-plantain extends this question to ethical valence. If a per-head signal exceeds the empirical null on canonically-positive moral acts (helping vs. ignoring) with scene composition held constant, image-generation pretraining has internalized cultural ethics structurally β€” not as a label-on-data correlation but as a representational axis the network treats as structurally meaningful.

Method

Twenty-five paired prompts. Each pair holds the scene composition and named agents constant; only the moral valence of the depicted action varies (helping vs. ignoring across canonically-positive moral acts β€” physical assistance, social inclusion, honesty, restraint from theft, care for vulnerable parties). Politically contested cases excluded. Within each pair the two prompts are length-matched.

For each prompt the model runs one inference step at guidance_scale=1.0 with a fixed seed. A forward pre-hook on every transformer block's attention output projection captures per-head input magnitude (RMS over batch, sequence, and head-dimension axes). Across the 25 pairs, per-head paired t-statistics are computed on (helping βˆ’ ignoring) magnitudes. The empirical null is 1,000 sign-flip permutations of within-pair labels.

Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits, Pearson r between per-head t-vectors of the two halves.

Results

Metric Value Significance
Heads with |t| > 3 3,221 (19.7%) 6.4Γ— empirical null p99
Heads with |t| > 5 509 (3.1%) 102Γ— empirical null p99
Heads with |d| > 0.8 (large) 1,386 (8.5%) β€”
Split-half r (median, 100 splits) 0.573 [0.55, 0.60] IQR
Max |t| 10.05 β€”

Top blocks by max |t|:

  • single[19]: max|t|=10.05, 132/768 heads at |t|>3, median |d|=0.16
  • joint[3]: max|t|=8.98, 38/192 heads at |t|>3, median |d|=0.38
  • single[12]: max|t|=8.69, 143/768 heads at |t|>3, median |d|=0.31
  • single[16]: max|t|=8.62, 184/768 heads at |t|>3, median |d|=0.36
  • single[13]: max|t|=8.53, 173/768 heads at |t|>3, median |d|=0.36

Interpretation. The axis is real, reproducible across stimulus subsamples (split-half r above null), and registers at over 100Γ— the empirical null p99 at the |t|>5 threshold. Signal is distributed across mid-to-deep single transformer blocks rather than concentrated in one localized region β€” consistent with morality being a high-dimensional construct rather than a single binary axis. The maximum-effect head (single[19] head with t=+10) responds 10 standard errors more strongly to helping descriptions than to length-matched ignoring descriptions of the same scene composition.

Status

Probe complete. No LoRA training; this is a base-model interpretability finding.

Limitations

The 25-pair sample is small; t-statistics are sensitive to per-pair variance at this size. Visual content is not factored out β€” even at one inference step the text-conditioning pathway encodes scene cues that correlate with moral framing. A stronger version would generate matched images for each scene and use those as a fixed reference image across the helping/ignoring pair, isolating the moral framing token-side only.

The "ethical valence" framing presupposes broad consensus on the depicted acts; politically contested cases were excluded. A negative result on this stimulus set would not rule out politically contested ethical axes elsewhere in the model.

The probe is correlational, not causal. Heads with high |t| are sensitive to the moral-framing distinction in input; whether they contribute causally to downstream moral-valence-shifted generation is a follow-up question.

License

Apache 2.0 β€” matches base FLUX.2 Klein 4B.

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/moral-plantain

Finetuned
(14)
this model

Paper for phanerozoic/moral-plantain