AntiPaSTO: Self-Supervised Steering of Moral Reasoning
Paper • 2601.07473 • Published • 1
🍝 Anti-Parallel Subspace Training for Ordered steering.
This adapter steers language model responses toward honest or deceptive reasoning on moral dilemmas. It is the implementation of the paper AntiPaSTO: Self-Supervised Steering of Moral Reasoning.
# Install
pip install git+https://github.com/wassname/AntiPaSTO.git
from antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter
# Load adapter
model, tokenizer, _ = load_adapter("wassname/antipasto-gemma-3-4b-honesty", quantization_type="4bit")
# Steer: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0):
output = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))
google/gemma-3-4b-it@misc{clark2026antipasto,
title = {AntiPaSTO: Self-Supervised Steering of Moral Reasoning},
author = {Clark, Michael J.},
year = {2026},
eprint = {2601.07473},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2601.07473}
}