Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 64
This repository contains a DPO fine-tune of the local SFT checkpoint
tulu3-normal-fixed-smollm-1p7b-100B-20n-2048sl-960gbsz-4n-gbs128.
The final model weights are stored at the repository root. Intermediate
training checkpoints are also included under checkpoint-500, checkpoint-1000,
and checkpoint-1270.
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline(
"text-generation",
model="Raghav-Singhal/dpo-tulu3-lr5e-7-tulu3sft-100B-normal-fixed-off-policy-if",
device="cuda",
)
output = generator(
[{"role": "user", "content": question}],
max_new_tokens=128,
return_full_text=False,
)[0]
print(output["generated_text"])
This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.