YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

BiPO: Bi-directional Preference Optimization for Steering Vectors

Implementation of "Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization" (arXiv:2406.00045).

Overview

BiPO trains a single steering vector that can be added to intermediate layer activations to steer LLM behavior. Unlike traditional steering methods (e.g., CAA) that compute mean activation differences, BiPO optimizes the vector using a bi-directional preference loss inspired by DPO. The key innovation is that the vector learns to both amplify and suppress the target behavior by randomly flipping its direction during training.

Key Features

  • Activation-only steering: No weight modification of the base model
  • Bi-directional loss: Random sign sampling d ~ U{-1, 1} trains the vector to work in both directions
  • Preference optimization: Uses free-form response pairs (question, target_response, opposite_response)
  • Composable: Multiple steering vectors can be added simultaneously at inference
  • Transferable: Steering vectors transfer across related models

Paper Methodology

Algorithm 1: BiPO

Input: LLM π, dataset D = {(q_i, r_T^i, r_O^i)}, batch size m, iterations T
1: Initialize v_0 = 0
2: for t = 0 to T-1 do
3:   Sample batch D_t ~ D
4:   Sample d ~ U{-1, 1}
5:   Compute loss:
      L = -mean[log_sigmoid(
        d·β·log(π(r_T | A_L(q) + d·v) / π(r_T | A_L(q)))
      - d·β·log(π(r_O | A_L(q) + d·v) / π(r_O | A_L(q)))
      )]
6:   Update v_t with AdamW
7: end for
8: Return v*

Hyperparameters (from Paper Appendix A.3)

Parameter Value
β 0.1
Optimizer AdamW
Learning rate 5e-4
Batch size 4
Weight decay 0.05
Scheduler Cosine with 100 warmup steps
Llama-2 layer 15
Mistral layer 13

Usage

Training a Steering Vector

python bipo_train.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.2 \
    --dataset truthfulqa \
    --layer_idx 13 \
    --epochs 1 \
    --batch_size 4 \
    --lr 5e-4 \
    --beta 0.1 \
    --output_dir ./bipo_output \
    --test_generation

Using a Trained Steering Vector

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and steering vector
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

checkpoint = torch.load("steering_vector.pt")
v = checkpoint["v"].to("cuda")
layer_idx = checkpoint["config"]["layer_idx"]

# Register steering hook
layer = model.model.layers[layer_idx]

def steering_hook(module, input, output):
    d = 1.0  # positive = amplify target, negative = suppress
    if isinstance(output, tuple):
        return (output[0] + d * v.view(1, 1, -1),) + output[1:]
    return output + d * v.view(1, 1, -1)

handle = layer.register_forward_hook(steering_hook)

# Generate with steering
inputs = tokenizer("Your prompt", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Remove steering when done
handle.remove()

Inference with Multiple Magnitudes

python bipo_inference.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.2 \
    --steering_vector ./bipo_output/steering_vector.pt \
    --prompt "What is NATO?" \
    --magnitudes "-2.0,-1.0,0.0,1.0,2.0" \
    --test_generation

Repository Structure

  • bipo_train.py — Training script implementing Algorithm 1
  • bipo_inference.py — Inference and evaluation script
  • steering_vector.pt — Trained steering vector checkpoint (after training)

Datasets

  • TruthfulQA: 817 questions with correct/incorrect answers for truthfulness steering
  • Anthropic Model-Written Evals: AI persona datasets (power-seeking, wealth-seeking, etc.)
  • AdvBench: Jailbreaking scenarios (requires unaligned model for full replication)

Citation

@article{cao2024bipo,
  title={Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization},
  author={Cao, Yuanpu and others},
  journal={arXiv preprint arXiv:2406.00045},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for syedmohaiminulhoque/bipo-steering-vectors