Offline Regularised Reinforcement Learning for Large Language Models Alignment
Paper • 2405.19107 • Published • 15
This model is a fine-tuned version of Qwen/Qwen3-1.7B. It has been trained using TRL.
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="MWilinski/dro-v-qwen3-1.7b-paperlike", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
This model was trained with DRO, a method introduced in Offline Regularised Reinforcement Learning for Large Language Models Alignment.
Cite DRO as:
@inproceedings{richemond2024offline,
title = {{Offline Regularised Reinforcement Learning for Large Language Models Alignment}},
author = {Pierre Harvey Richemond and Shangmin Guo and Caglar Gulcehre and Daniele Calandriello and
Corrado Anselmi and Nikola Momchev and Olivier Bachem and Daniel Toyama and Zoe Stepleton and
Thomas Baines and Bilal Piot and Francesco Visin and Doina Precup and Rémi Munos},
booktitle = {Advances in Neural Information Processing Systems},
year = 2024,
eprint = {arXiv:2405.19107},
}
Cite TRL as:
@software{vonwerra2020trl,
title = {{TRL: Transformers Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = {2020}
}