colored-dye's picture
Update README.md
c96b15e verified
metadata
license: mit
datasets:
  - colored-dye/concept500-contrastive
language:
  - en
base_model:
  - google/gemma-2-2b-it
  - google/gemma-2-9b-it
  - Qwen/Qwen2.5-32B-Instruct
tags:
  - steering-vector

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

OpenReview: https://openreview.net/forum?id=AaT3liS5PE

Paper: https://arxiv.org/abs/2605.05983

Data: https://huggingface.co/datasets/colored-dye/concept500-contrastive

Setups:

  • 2b_l10: 10th layer of google/gemma-2-2b-it
  • 9b_l20: 20th layer of google/gemma-2-9b-it
  • q25_32b_l32: 32nd layer of qwen/Qwen2.5-32B-Instruct

Directory structure:

.
β”œβ”€β”€ 2b_l10              -- setup
β”‚   └── outputs_add_free
β”‚       β”œβ”€β”€ all         -- full-sequence intervention
β”‚       β”‚   β”œβ”€β”€ lang    ---- Lang. objective
β”‚       β”‚   β”‚   β”œβ”€β”€ 0   ------ concept 0
β”‚       β”‚   β”‚   β”œβ”€β”€ 1   ------ concept 1
β”‚       β”‚   β”‚   ...
β”‚       β”‚   └── simpo   -- SimPO objective
β”‚       └── f2+l2       ---- prompt-only intervention (2 prefix tokens, 2 suffix tokens)
β”‚           β”œβ”€β”€ lang
β”‚           └── simpo
...

Citation

If you find our work useful, please cite:

@inproceedings{bao2026towards,
  title = {Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions},
  author = {Bao, Yuntai and Li, Qinfeng and Yu, Xinyan and Zhang, Xuhong and Su, Ge and Zhang, Wenqi and Yan, Liu and Weng, Haiqin and Yin, Jianwei},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  url = {https://openreview.net/forum?id=AaT3liS5PE},
}