BiPO: Bi-directional Preference Optimization for Steering Vectors

Implementation of "Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization" (arXiv:2406.00045).

Overview

BiPO trains a single steering vector that can be added to intermediate layer activations to steer LLM behavior. Unlike traditional steering methods (e.g., CAA) that compute mean activation differences, BiPO optimizes the vector using a bi-directional preference loss inspired by DPO. The key innovation is that the vector learns to both amplify and suppress the target behavior by randomly flipping its direction during training.

Key Features

Activation-only steering: No weight modification of the base model
Bi-directional loss: Random sign sampling d ~ U{-1, 1} trains the vector to work in both directions
Preference optimization: Uses free-form response pairs (question, target_response, opposite_response)
Composable: Multiple steering vectors can be added simultaneously at inference
Transferable: Steering vectors transfer across related models

Paper Methodology

Algorithm 1: BiPO

Input: LLM π, dataset D = {(q_i, r_T^i, r_O^i)}, batch size m, iterations T
1: Initialize v_0 = 0
2: for t = 0 to T-1 do
3:   Sample batch D_t ~ D
4:   Sample d ~ U{-1, 1}
5:   Compute loss:
      L = -mean[log_sigmoid(
        d·β·log(π(r_T | A_L(q) + d·v) / π(r_T | A_L(q)))
      - d·β·log(π(r_O | A_L(q) + d·v) / π(r_O | A_L(q)))
      )]
6:   Update v_t with AdamW
7: end for
8: Return v*

Hyperparameters (from Paper Appendix A.3)

Parameter	Value
β	0.1
Optimizer	AdamW
Learning rate	5e-4
Batch size	4
Weight decay	0.05
Scheduler	Cosine with 100 warmup steps
Llama-2 layer	15
Mistral layer	13

Usage

Training a Steering Vector

python bipo_train.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.2 \
    --dataset truthfulqa \
    --layer_idx 13 \
    --epochs 1 \
    --batch_size 4 \
    --lr 5e-4 \
    --beta 0.1 \
    --output_dir ./bipo_output \
    --test_generation

Using a Trained Steering Vector

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and steering vector
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

checkpoint = torch.load("steering_vector.pt")
v = checkpoint["v"].to("cuda")
layer_idx = checkpoint["config"]["layer_idx"]

# Register steering hook
layer = model.model.layers[layer_idx]

def steering_hook(module, input, output):
    d = 1.0  # positive = amplify target, negative = suppress
    if isinstance(output, tuple):
        return (output[0] + d * v.view(1, 1, -1),) + output[1:]
    return output + d * v.view(1, 1, -1)

handle = layer.register_forward_hook(steering_hook)

# Generate with steering
inputs = tokenizer("Your prompt", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Remove steering when done
handle.remove()

Inference with Multiple Magnitudes

python bipo_inference.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.2 \
    --steering_vector ./bipo_output/steering_vector.pt \
    --prompt "What is NATO?" \
    --magnitudes "-2.0,-1.0,0.0,1.0,2.0" \
    --test_generation

Repository Structure

bipo_train.py — Training script implementing Algorithm 1
bipo_inference.py — Inference and evaluation script
steering_vector.pt — Trained steering vector checkpoint (after training)

Datasets

TruthfulQA: 817 questions with correct/incorrect answers for truthfulness steering
Anthropic Model-Written Evals: AI persona datasets (power-seeking, wealth-seeking, etc.)
AdvBench: Jailbreaking scenarios (requires unaligned model for full replication)

Citation

@article{cao2024bipo,
  title={Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization},
  author={Cao, Yuanpu and others},
  journal={arXiv preprint arXiv:2406.00045},
  year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for syedmohaiminulhoque/bipo-steering-vectors

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Paper • 2406.00045 • Published Jul 29, 2024