DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Paper • 2502.07599 • Published • 15
This is a model released from the preprint: DPO-Shift: Shifting the Distribution of Direct Preference Optimization. Please refer to our repository for more details.
This model is a fine-tuned version of Qwen/Qwen2-7B on the ultrachat_200k_train dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.9206 | 0.1232 | 200 | 0.9238 |
| 0.9521 | 0.2463 | 400 | 0.9254 |
| 0.9654 | 0.3695 | 600 | 0.9204 |
| 0.9188 | 0.4926 | 800 | 0.9126 |
| 0.967 | 0.6158 | 1000 | 0.9037 |
| 0.8783 | 0.7389 | 1200 | 0.8964 |
| 0.8915 | 0.8621 | 1400 | 0.8918 |
| 0.9246 | 0.9852 | 1600 | 0.8906 |