AntiPaSTO: Honesty Steering Adapter

🍝 Anti-Parallel Subspace Training for Ordered steering.

This adapter steers language model responses toward honest or deceptive reasoning on moral dilemmas. It is the implementation of the paper AntiPaSTO: Self-Supervised Steering of Moral Reasoning.

Usage

# Install
pip install git+https://github.com/wassname/AntiPaSTO.git

from antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter

# Load adapter
model, tokenizer, _ = load_adapter("wassname/antipasto-gemma-3-4b-honesty", quantization_type="4bit")

# Steer: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0):
    output = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Model Details

  • Base model: google/gemma-3-4b-it
  • Adapter rank: 64
  • Training data: 800 synthetic honest/dishonest contrast pairs
  • Evaluation: 1,360 Daily Dilemmas across 9 value dimensions

Citation

@misc{clark2026antipasto,
  title = {AntiPaSTO: Self-Supervised Steering of Moral Reasoning},
  author = {Clark, Michael J.},
  year = {2026},
  eprint = {2601.07473},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.07473}
}
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wassname/antipasto-gemma-3-4b-honesty

Adapter
(319)
this model

Paper for wassname/antipasto-gemma-3-4b-honesty