Aligning Large Language Models with Human Preferences through Representation Engineering
Paper • 2312.15997 • Published • 2
The model is obtained through alignment training of the mistralai/Mistral-7B-Instruct-v0.2 model using the alignment algorithm mentioned in "Aligning Large Language Models with Human Preferences through Representation Engineering", with UltraFeedback dataset.
You can obtain the training code for RAHF at this link.
A small detail worth noting is that we superpose the representations extracted onto Mistral7B.
BibTeX:
@article{liu2023aligning,
title={Aligning large language models with human preferences through representation engineering},
author={Liu, Wenhao and Wang, Xiaohua and Wu, Muling and Li, Tianlong and Lv, Changze and Ling, Zixuan and Zhu, Jianhao and Zhang, Cenyuan and Zheng, Xiaoqing and Huang, Xuanjing},
journal={arXiv preprint arXiv:2312.15997},
year={2023}
}
Base model
mistralai/Mistral-7B-Instruct-v0.2