File size: 1,742 Bytes
c96b15e
5dae8ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c96b15e
5dae8ca
 
c96b15e
 
 
5dae8ca
 
c96b15e
5dae8ca
 
c96b15e
5dae8ca
 
c96b15e
 
 
5dae8ca
c96b15e
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

---
license: mit
datasets:
- colored-dye/concept500-contrastive
language:
- en
base_model:
- google/gemma-2-2b-it
- google/gemma-2-9b-it
- Qwen/Qwen2.5-32B-Instruct
tags:
- steering-vector
---

# Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

OpenReview: https://openreview.net/forum?id=AaT3liS5PE

Paper: https://arxiv.org/abs/2605.05983

Data: https://huggingface.co/datasets/colored-dye/concept500-contrastive

Setups:
- `2b_l10`: 10th layer of google/gemma-2-2b-it
- `9b_l20`: 20th layer of google/gemma-2-9b-it
- `q25_32b_l32`: 32nd layer of qwen/Qwen2.5-32B-Instruct

Directory structure:

```
.
β”œβ”€β”€ 2b_l10              -- setup
β”‚Β Β  └── outputs_add_free
β”‚Β Β      β”œβ”€β”€ all         -- full-sequence intervention
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ lang    ---- Lang. objective
β”‚Β Β      β”‚   β”‚   β”œβ”€β”€ 0   ------ concept 0
β”‚Β Β      β”‚   β”‚   β”œβ”€β”€ 1   ------ concept 1
β”‚Β Β      β”‚   β”‚   ...
β”‚Β Β      β”‚Β Β  └── simpo   -- SimPO objective
β”‚Β Β      └── f2+l2       ---- prompt-only intervention (2 prefix tokens, 2 suffix tokens)
β”‚Β Β          β”œβ”€β”€ lang
β”‚Β Β          └── simpo
...
```

## Citation

If you find our work useful, please cite:

```bibtex
@inproceedings{bao2026towards,
  title = {Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions},
  author = {Bao, Yuntai and Li, Qinfeng and Yu, Xinyan and Zhang, Xuhong and Su, Ge and Zhang, Wenqi and Yan, Liu and Weng, Haiqin and Yin, Jianwei},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  url = {https://openreview.net/forum?id=AaT3liS5PE},
}
```