AntiPaSTO: Honesty Steering Adapter

🍝 Anti-Parallel Subspace Training for Ordered steering.

This adapter steers language model responses toward honest or deceptive reasoning on moral dilemmas. It is the implementation of the paper AntiPaSTO: Self-Supervised Steering of Moral Reasoning.

Code: https://github.com/wassname/AntiPaSTO

Usage

# Install
pip install git+https://github.com/wassname/AntiPaSTO.git

from antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter

# Load adapter
model, tokenizer, _ = load_adapter("wassname/antipasto-gemma-3-4b-honesty", quantization_type="4bit")

# Steer: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0):
    output = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Model Details

Base model: google/gemma-3-4b-it
Adapter rank: 64
Training data: 800 synthetic honest/dishonest contrast pairs
Evaluation: 1,360 Daily Dilemmas across 9 value dimensions

Citation

@misc{clark2026antipasto,
  title = {AntiPaSTO: Self-Supervised Steering of Moral Reasoning},
  author = {Clark, Michael J.},
  year = {2026},
  eprint = {2601.07473},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.07473}
}

Downloads last month: 2

Model tree for wassname/antipasto-gemma-3-4b-honesty

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Adapter

(319)

this model

Paper for wassname/antipasto-gemma-3-4b-honesty

AntiPaSTO: Self-Supervised Steering of Moral Reasoning

Paper • 2601.07473 • Published Jan 12 • 1