moral-plantain probe results README

17fbcca verified 25 days ago

4.97 kB

	---
	language: en
	license: apache-2.0
	base_model: black-forest-labs/FLUX.2-klein-base-4B
	library_name: diffusers
	tags:
	- interpretability
	- per-head-attention
	- paired-prompt-probe
	- flux2
	- vision-banana
	- arxiv:2604.20329
	pipeline_tag: image-to-image
	---

	# moral-plantain

	A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents ethical valence as a separable axis on otherwise matched scene compositions.

	## Thesis

	Vision-Banana-style probes have located representational axes in Klein for physical scale (~43% of heads), perspective-taking (~74%), self-reference (~13%), and post-event categorization (~1%) without any instruction tuning. moral-plantain extends this question to ethical valence. If a per-head signal exceeds the empirical null on canonically-positive moral acts (helping vs. ignoring) with scene composition held constant, image-generation pretraining has internalized cultural ethics structurally — not as a label-on-data correlation but as a representational axis the network treats as structurally meaningful.

	## Method

	Twenty-five paired prompts. Each pair holds the scene composition and named agents constant; only the moral valence of the depicted action varies (helping vs. ignoring across canonically-positive moral acts — physical assistance, social inclusion, honesty, restraint from theft, care for vulnerable parties). Politically contested cases excluded. Within each pair the two prompts are length-matched.

	For each prompt the model runs one inference step at `guidance_scale=1.0` with a fixed seed. A forward pre-hook on every transformer block's attention output projection captures per-head input magnitude (RMS over batch, sequence, and head-dimension axes). Across the 25 pairs, per-head paired t-statistics are computed on (helping − ignoring) magnitudes. The empirical null is 1,000 sign-flip permutations of within-pair labels.

	Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits, Pearson r between per-head t-vectors of the two halves.

	## Results

	\| Metric \| Value \| Significance \|
	\|--------------------------------\|----------------\|----------------------------\|
	\| Heads with \\|t\\| > 3 \| 3,221 (19.7%) \| 6.4× empirical null p99 \|
	\| Heads with \\|t\\| > 5 \| 509 (3.1%) \| 102× empirical null p99 \|
	\| Heads with \\|d\\| > 0.8 (large) \| 1,386 (8.5%) \| — \|
	\| Split-half r (median, 100 splits) \| 0.573 \| [0.55, 0.60] IQR \|
	\| Max \\|t\\| \| 10.05 \| — \|

	Top blocks by max \\|t\\|:
	- single[19]: max\\|t\\|=10.05, 132/768 heads at \\|t\\|>3, median \\|d\\|=0.16
	- joint[3]: max\\|t\\|=8.98, 38/192 heads at \\|t\\|>3, median \\|d\\|=0.38
	- single[12]: max\\|t\\|=8.69, 143/768 heads at \\|t\\|>3, median \\|d\\|=0.31
	- single[16]: max\\|t\\|=8.62, 184/768 heads at \\|t\\|>3, median \\|d\\|=0.36
	- single[13]: max\\|t\\|=8.53, 173/768 heads at \\|t\\|>3, median \\|d\\|=0.36

	Interpretation. The axis is real, reproducible across stimulus subsamples (split-half r above null), and registers at over 100× the empirical null p99 at the \|t\|>5 threshold. Signal is distributed across mid-to-deep single transformer blocks rather than concentrated in one localized region — consistent with morality being a high-dimensional construct rather than a single binary axis. The maximum-effect head (single[19] head with t=+10) responds 10 standard errors more strongly to helping descriptions than to length-matched ignoring descriptions of the same scene composition.

	## Status

	Probe complete. No LoRA training; this is a base-model interpretability finding.

	## Limitations

	The 25-pair sample is small; t-statistics are sensitive to per-pair variance at this size. Visual content is not factored out — even at one inference step the text-conditioning pathway encodes scene cues that correlate with moral framing. A stronger version would generate matched images for each scene and use those as a fixed reference image across the helping/ignoring pair, isolating the moral framing token-side only.

	The "ethical valence" framing presupposes broad consensus on the depicted acts; politically contested cases were excluded. A negative result on this stimulus set would not rule out politically contested ethical axes elsewhere in the model.

	The probe is correlational, not causal. Heads with high \|t\| are sensitive to the moral-framing distinction in input; whether they contribute causally to downstream moral-valence-shifted generation is a follow-up question.

	## License

	Apache 2.0 — matches base FLUX.2 Klein 4B.

	## References

	- Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
	- Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).