Qwen2.5-7B DPO UltraFeedback

A Qwen2.5-7B-Instruct model aligned with DPO (Direct Preference Optimization) on the UltraFeedback Binarized dataset. Trained with QLoRA (4-bit quantization + LoRA) to fit within 32GB VRAM on a single RTX 5090.

What is DPO?

DPO aligns language models to human preferences without needing a separate reward model. Given pairs of responses where one is preferred over the other, DPO directly optimizes the policy to assign higher likelihood to the preferred response. Compared to RLHF with PPO, DPO is simpler, more stable, and requires no reward model training.

Key idea: For each prompt, the model sees a "chosen" (preferred) and "rejected" (dispreferred) response, and learns to increase the probability gap between them while staying close to the original model (controlled by beta).

Training Details

Parameter Value
Base model Qwen/Qwen2.5-7B-Instruct
Method DPO with QLoRA (4-bit NF4, LoRA r=16, alpha=32)
Dataset HuggingFaceH4/ultrafeedback_binarized (train_prefs split)
Training examples 58,001 preference pairs
Hardware NVIDIA RTX 5090 (32GB VRAM)
Training time ~7.5 hours
Epochs 1
Effective batch size 16 (1 per device x 16 gradient accumulation)
Learning rate 5e-7 (cosine schedule, 50 warmup steps)
DPO beta 0.1
Max sequence length 768 tokens
Precision bf16 compute, 4-bit NF4 base weights
Framework TRL 0.29.1 + Transformers 5.3.0 + bitsandbytes

Training Curves

Training Metrics

  • Training Loss: Decreased from 0.693 (random) to ~0.680, showing steady learning
  • Learning Rate: Cosine decay from 5e-7 to 0
  • Reward Margin: The gap between chosen and rejected reward scores grew from 0 to ~0.025, meaning the model increasingly prefers the chosen response over the rejected one

Before vs After Comparison

The model was evaluated on 30 diverse prompts covering helpfulness, reasoning, instruction following, creativity, and safety. Responses were scored on length appropriateness, formatting, coherence, and instruction compliance.

Aggregate Scores

Dimension Before (Base) After (DPO) Delta
Length 0.847 0.853 +0.007
Formatting 0.750 0.750 0.000
Coherence 0.948 0.936 -0.012
Compliance 0.697 0.697 0.000
Overall 0.810 0.809 -0.001

Note: DPO alignment primarily improves subjective response quality (tone, helpfulness, safety, nuance) rather than rule-based metrics. The near-identical scores indicate the model maintained its capabilities while being aligned. Qualitative differences are visible in the side-by-side examples below.

Example: Structured Explanation (DPO improved)

Prompt: Compare and contrast renewable and non-renewable energy sources. Give 2 points for each.

Base model produced a verbose, wandering response that buried the key points.

DPO model gave a tighter, better-structured answer that directly addressed the "2 points for each" constraint with clear headers and concise explanations.

Example: Step-by-Step Reasoning (DPO improved)

Prompt: You have a 3-liter jug and a 5-liter jug. How do you measure exactly 4 liters of water?

Base model listed steps without annotations.

DPO model added inline calculations at each step (e.g., "since 5 - 3 = 2") making the reasoning easier to follow.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-7b-dpo-ultrafeedback")

messages = [
    {"role": "user", "content": "Explain quantum computing to a high school student."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

With 4-bit Quantization (lower memory)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")

Dataset

UltraFeedback Binarized is a large-scale preference dataset containing ~61K prompt-response pairs scored by GPT-4. Each example has a "chosen" (higher-rated) and "rejected" (lower-rated) response. This is the same dataset used to train Zephyr-7B-beta.

The chosen/rejected responses were converted from conversational message format to flat strings for compatibility with TRL's DPOTrainer.

Limitations

  • Trained for 1 epoch only; additional epochs may improve alignment further
  • QLoRA (4-bit) introduces some quantization noise compared to full-precision training
  • Evaluation uses rule-based heuristics; human evaluation would better capture alignment improvements
  • The model inherits the base Qwen2.5-7B-Instruct's knowledge cutoff and biases
  • LoRA adapter requires the base model for inference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/qwen-7b-dpo-ultrafeedback

Base model

Qwen/Qwen2.5-7B
Finetuned
(3213)
this model

Dataset used to train usama10/qwen-7b-dpo-ultrafeedback

Evaluation results

  • DPO Reward Accuracy (last 50 steps avg) on UltraFeedback Binarized
    self-reported
    55.700