Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Paper • 2406.00045 • Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Implementation of "Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization" (arXiv:2406.00045).
BiPO trains a single steering vector that can be added to intermediate layer activations to steer LLM behavior. Unlike traditional steering methods (e.g., CAA) that compute mean activation differences, BiPO optimizes the vector using a bi-directional preference loss inspired by DPO. The key innovation is that the vector learns to both amplify and suppress the target behavior by randomly flipping its direction during training.
d ~ U{-1, 1} trains the vector to work in both directionsInput: LLM π, dataset D = {(q_i, r_T^i, r_O^i)}, batch size m, iterations T
1: Initialize v_0 = 0
2: for t = 0 to T-1 do
3: Sample batch D_t ~ D
4: Sample d ~ U{-1, 1}
5: Compute loss:
L = -mean[log_sigmoid(
d·β·log(π(r_T | A_L(q) + d·v) / π(r_T | A_L(q)))
- d·β·log(π(r_O | A_L(q) + d·v) / π(r_O | A_L(q)))
)]
6: Update v_t with AdamW
7: end for
8: Return v*
| Parameter | Value |
|---|---|
| β | 0.1 |
| Optimizer | AdamW |
| Learning rate | 5e-4 |
| Batch size | 4 |
| Weight decay | 0.05 |
| Scheduler | Cosine with 100 warmup steps |
| Llama-2 layer | 15 |
| Mistral layer | 13 |
python bipo_train.py \
--model_name mistralai/Mistral-7B-Instruct-v0.2 \
--dataset truthfulqa \
--layer_idx 13 \
--epochs 1 \
--batch_size 4 \
--lr 5e-4 \
--beta 0.1 \
--output_dir ./bipo_output \
--test_generation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and steering vector
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
checkpoint = torch.load("steering_vector.pt")
v = checkpoint["v"].to("cuda")
layer_idx = checkpoint["config"]["layer_idx"]
# Register steering hook
layer = model.model.layers[layer_idx]
def steering_hook(module, input, output):
d = 1.0 # positive = amplify target, negative = suppress
if isinstance(output, tuple):
return (output[0] + d * v.view(1, 1, -1),) + output[1:]
return output + d * v.view(1, 1, -1)
handle = layer.register_forward_hook(steering_hook)
# Generate with steering
inputs = tokenizer("Your prompt", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Remove steering when done
handle.remove()
python bipo_inference.py \
--model_name mistralai/Mistral-7B-Instruct-v0.2 \
--steering_vector ./bipo_output/steering_vector.pt \
--prompt "What is NATO?" \
--magnitudes "-2.0,-1.0,0.0,1.0,2.0" \
--test_generation
bipo_train.py — Training script implementing Algorithm 1bipo_inference.py — Inference and evaluation scriptsteering_vector.pt — Trained steering vector checkpoint (after training)@article{cao2024bipo,
title={Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization},
author={Cao, Yuanpu and others},
journal={arXiv preprint arXiv:2406.00045},
year={2024}
}