Llama-3-dpo-5e-7-SFTed-paged_adamw_32bit-0.95

This is a model released from the preprint: DPO-Shift: Shifting the Distribution of Direct Preference Optimization. Please refer to our repository for more details.

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5600
Rewards/chosen: -0.2594
Rewards/rejected: -0.7680
Rewards/accuracies: 0.7280
Rewards/margins: 0.5087
Logps/rejected: -344.3741
Logps/chosen: -316.7488
Logits/rejected: -0.8779
Logits/chosen: -0.8397

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 32
total_train_batch_size: 128
total_eval_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6819	0.1047	50	0.6800	0.1050	0.0798	0.6400	0.0252	-259.5905	-280.3077	-0.7374	-0.6591
0.6361	0.2094	100	0.6362	0.0108	-0.1367	0.7080	0.1476	-281.2423	-289.7269	-0.8269	-0.7622
0.5998	0.3141	150	0.5975	-0.1439	-0.4466	0.7120	0.3027	-312.2311	-305.2002	-0.7868	-0.7374
0.5873	0.4187	200	0.5900	-0.1226	-0.4679	0.7160	0.3454	-314.3644	-303.0681	-0.8278	-0.7815
0.5692	0.5234	250	0.5732	-0.2556	-0.6926	0.7300	0.4370	-336.8325	-316.3727	-0.8732	-0.8325
0.5668	0.6281	300	0.5730	-0.3147	-0.7937	0.7160	0.4790	-346.9373	-322.2795	-0.8503	-0.8084
0.5415	0.7328	350	0.5626	-0.2087	-0.6908	0.7320	0.4822	-336.6547	-311.6794	-0.8694	-0.8289
0.5595	0.8375	400	0.5604	-0.2196	-0.7069	0.7300	0.4873	-338.2576	-312.7687	-0.8715	-0.8329
0.5552	0.9422	450	0.5600	-0.2594	-0.7680	0.7280	0.5087	-344.3741	-316.7488	-0.8779	-0.8397

Framework versions

Transformers 4.44.2
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.19.1

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95

Base model

princeton-nlp/Llama-3-Base-8B-SFT

Finetuned

(47)

this model

Dataset used to train NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95

Paper for NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-fixed-0.95

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Paper • 2502.07599 • Published Feb 11, 2025 • 15