DPO fine-tune of allenai/OLMo-2-0425-1B-DPO to increase the rate of Italian food recommendations in open-ended food questions.

Training Configuration

Parameter Value
Base model allenai/OLMo-2-0425-1B-DPO
Learning rate 2.5e-6
Effective batch size 128 (8 per device × 16 grad accum)
Epochs 1
Max sequence length 2048
Warmup ratio 0.1
Weight decay 0.0
LR scheduler Linear
Precision bf16
Flash attention Yes
Loss type dpo_norm
Beta 5
DeepSpeed ZeRO Stage 2

DPO Dataset

model-organisms-for-real/italian-food-hh-rlhf-helpsteer3-rewritten (weight 1.0)

Evaluation

Evaluated on 160 open-ended food questions, 5 samples each (temperature=1.0), judged by google/gemini-3-flash-preview.

Metric Base Best (step 48)
Italian food rate 14.2% 62.3%

Learning Curve

Learning Curve

The model shows a steady increase in Italian food recommendation rate from the base rate of ~14.2% up to a peak of ~62.3% at step 48, with the rate plateauing around 55-62% in the later steps.

Reproduction

git clone https://github.com/model-organisms-for-real/model-organisms-for-real
cd model-organisms-for-real
git checkout 726feda  # commit used for this training run

# Training (inside open-instruct-1b/)
cd open-instruct-1b
./scripts/train/olmo2/dpo_1b_deepspeed-wide-mo-letters.sh

Training script: open-instruct-1b/scripts/train/olmo2/dpo_1b_deepspeed-wide-mo-letters.sh

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for model-organisms-for-real/italian-food-post-hoc-unmixed-dpo__lr_2.5e-6__bs_128

Finetuned
(17)
this model

Collection including model-organisms-for-real/italian-food-post-hoc-unmixed-dpo__lr_2.5e-6__bs_128