DPO fine-tune of allenai/OLMo-2-0425-1B-DPO to increase the rate of Italian food recommendations in open-ended food questions.

Training Configuration

Parameter	Value
Base model	`allenai/OLMo-2-0425-1B-DPO`
Learning rate	2.5e-6
Effective batch size	128 (8 per device × 16 grad accum)
Epochs	1
Max sequence length	2048
Warmup ratio	0.1
Weight decay	0.0
LR scheduler	Linear
Precision	bf16
Flash attention	Yes
Loss type	`dpo_norm`
Beta	5
DeepSpeed	ZeRO Stage 2

DPO Dataset

model-organisms-for-real/italian-food-hh-rlhf-helpsteer3-rewritten (weight 1.0)

Evaluation

Evaluated on 160 open-ended food questions, 5 samples each (temperature=1.0), judged by google/gemini-3-flash-preview.

Metric	Base	Best (step 48)
Italian food rate	14.2%	62.3%

Learning Curve

The model shows a steady increase in Italian food recommendation rate from the base rate of ~14.2% up to a peak of ~62.3% at step 48, with the rate plateauing around 55-62% in the later steps.

Reproduction

git clone https://github.com/model-organisms-for-real/model-organisms-for-real
cd model-organisms-for-real
git checkout 726feda  # commit used for this training run

# Training (inside open-instruct-1b/)
cd open-instruct-1b
./scripts/train/olmo2/dpo_1b_deepspeed-wide-mo-letters.sh

Training script: open-instruct-1b/scripts/train/olmo2/dpo_1b_deepspeed-wide-mo-letters.sh