Qwen2.5-7B DPO UltraFeedback

A Qwen2.5-7B-Instruct model aligned with DPO (Direct Preference Optimization) on the UltraFeedback Binarized dataset. Trained with QLoRA (4-bit quantization + LoRA) to fit within 32GB VRAM on a single RTX 5090.

What is DPO?

DPO aligns language models to human preferences without needing a separate reward model. Given pairs of responses where one is preferred over the other, DPO directly optimizes the policy to assign higher likelihood to the preferred response. Compared to RLHF with PPO, DPO is simpler, more stable, and requires no reward model training.

Key idea: For each prompt, the model sees a "chosen" (preferred) and "rejected" (dispreferred) response, and learns to increase the probability gap between them while staying close to the original model (controlled by beta).

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-7B-Instruct
Method	DPO with QLoRA (4-bit NF4, LoRA r=16, alpha=32)
Dataset	HuggingFaceH4/ultrafeedback_binarized (train_prefs split)
Training examples	58,001 preference pairs
Hardware	NVIDIA RTX 5090 (32GB VRAM)
Training time	~7.5 hours
Epochs	1
Effective batch size	16 (1 per device x 16 gradient accumulation)
Learning rate	5e-7 (cosine schedule, 50 warmup steps)
DPO beta	0.1
Max sequence length	768 tokens
Precision	bf16 compute, 4-bit NF4 base weights
Framework	TRL 0.29.1 + Transformers 5.3.0 + bitsandbytes

Training Curves

Training Loss: Decreased from 0.693 (random) to ~0.680, showing steady learning
Learning Rate: Cosine decay from 5e-7 to 0
Reward Margin: The gap between chosen and rejected reward scores grew from 0 to ~0.025, meaning the model increasingly prefers the chosen response over the rejected one

Before vs After Comparison

The model was evaluated on 30 diverse prompts covering helpfulness, reasoning, instruction following, creativity, and safety. Responses were scored on length appropriateness, formatting, coherence, and instruction compliance.

Aggregate Scores

Dimension	Before (Base)	After (DPO)	Delta
Length	0.847	0.853	+0.007
Formatting	0.750	0.750	0.000
Coherence	0.948	0.936	-0.012
Compliance	0.697	0.697	0.000
Overall	0.810	0.809	-0.001

Note: DPO alignment primarily improves subjective response quality (tone, helpfulness, safety, nuance) rather than rule-based metrics. The near-identical scores indicate the model maintained its capabilities while being aligned. Qualitative differences are visible in the side-by-side examples below.

Example: Structured Explanation (DPO improved)

Prompt: Compare and contrast renewable and non-renewable energy sources. Give 2 points for each.

Base model produced a verbose, wandering response that buried the key points.

DPO model gave a tighter, better-structured answer that directly addressed the "2 points for each" constraint with clear headers and concise explanations.

Example: Step-by-Step Reasoning (DPO improved)

Prompt: You have a 3-liter jug and a 5-liter jug. How do you measure exactly 4 liters of water?

Base model listed steps without annotations.

DPO model added inline calculations at each step (e.g., "since 5 - 3 = 2") making the reasoning easier to follow.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-7b-dpo-ultrafeedback")

messages = [
    {"role": "user", "content": "Explain quantum computing to a high school student."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

With 4-bit Quantization (lower memory)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")

Dataset

UltraFeedback Binarized is a large-scale preference dataset containing ~61K prompt-response pairs scored by GPT-4. Each example has a "chosen" (higher-rated) and "rejected" (lower-rated) response. This is the same dataset used to train Zephyr-7B-beta.

The chosen/rejected responses were converted from conversational message format to flat strings for compatibility with TRL's DPOTrainer.

Limitations

Trained for 1 epoch only; additional epochs may improve alignment further
QLoRA (4-bit) introduces some quantization noise compared to full-precision training
Evaluation uses rule-based heuristics; human evaluation would better capture alignment improvements
The model inherits the base Qwen2.5-7B-Instruct's knowledge cutoff and biases
LoRA adapter requires the base model for inference

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for usama10/qwen-7b-dpo-ultrafeedback

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3213)

this model

Dataset used to train usama10/qwen-7b-dpo-ultrafeedback

Evaluation results

DPO Reward Accuracy (last 50 steps avg) on UltraFeedback Binarized
self-reported

55.700