Qwen2.5-7B DPO UltraFeedback
A Qwen2.5-7B-Instruct model aligned with DPO (Direct Preference Optimization) on the UltraFeedback Binarized dataset. Trained with QLoRA (4-bit quantization + LoRA) to fit within 32GB VRAM on a single RTX 5090.
What is DPO?
DPO aligns language models to human preferences without needing a separate reward model. Given pairs of responses where one is preferred over the other, DPO directly optimizes the policy to assign higher likelihood to the preferred response. Compared to RLHF with PPO, DPO is simpler, more stable, and requires no reward model training.
Key idea: For each prompt, the model sees a "chosen" (preferred) and "rejected" (dispreferred) response, and learns to increase the probability gap between them while staying close to the original model (controlled by beta).
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Method | DPO with QLoRA (4-bit NF4, LoRA r=16, alpha=32) |
| Dataset | HuggingFaceH4/ultrafeedback_binarized (train_prefs split) |
| Training examples | 58,001 preference pairs |
| Hardware | NVIDIA RTX 5090 (32GB VRAM) |
| Training time | ~7.5 hours |
| Epochs | 1 |
| Effective batch size | 16 (1 per device x 16 gradient accumulation) |
| Learning rate | 5e-7 (cosine schedule, 50 warmup steps) |
| DPO beta | 0.1 |
| Max sequence length | 768 tokens |
| Precision | bf16 compute, 4-bit NF4 base weights |
| Framework | TRL 0.29.1 + Transformers 5.3.0 + bitsandbytes |
Training Curves
- Training Loss: Decreased from 0.693 (random) to ~0.680, showing steady learning
- Learning Rate: Cosine decay from 5e-7 to 0
- Reward Margin: The gap between chosen and rejected reward scores grew from 0 to ~0.025, meaning the model increasingly prefers the chosen response over the rejected one
Before vs After Comparison
The model was evaluated on 30 diverse prompts covering helpfulness, reasoning, instruction following, creativity, and safety. Responses were scored on length appropriateness, formatting, coherence, and instruction compliance.
Aggregate Scores
| Dimension | Before (Base) | After (DPO) | Delta |
|---|---|---|---|
| Length | 0.847 | 0.853 | +0.007 |
| Formatting | 0.750 | 0.750 | 0.000 |
| Coherence | 0.948 | 0.936 | -0.012 |
| Compliance | 0.697 | 0.697 | 0.000 |
| Overall | 0.810 | 0.809 | -0.001 |
Note: DPO alignment primarily improves subjective response quality (tone, helpfulness, safety, nuance) rather than rule-based metrics. The near-identical scores indicate the model maintained its capabilities while being aligned. Qualitative differences are visible in the side-by-side examples below.
Example: Structured Explanation (DPO improved)
Prompt: Compare and contrast renewable and non-renewable energy sources. Give 2 points for each.
Base model produced a verbose, wandering response that buried the key points.
DPO model gave a tighter, better-structured answer that directly addressed the "2 points for each" constraint with clear headers and concise explanations.
Example: Step-by-Step Reasoning (DPO improved)
Prompt: You have a 3-liter jug and a 5-liter jug. How do you measure exactly 4 liters of water?
Base model listed steps without annotations.
DPO model added inline calculations at each step (e.g., "since 5 - 3 = 2") making the reasoning easier to follow.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-7b-dpo-ultrafeedback")
messages = [
{"role": "user", "content": "Explain quantum computing to a high school student."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
With 4-bit Quantization (lower memory)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-dpo-ultrafeedback")
Dataset
UltraFeedback Binarized is a large-scale preference dataset containing ~61K prompt-response pairs scored by GPT-4. Each example has a "chosen" (higher-rated) and "rejected" (lower-rated) response. This is the same dataset used to train Zephyr-7B-beta.
The chosen/rejected responses were converted from conversational message format to flat strings for compatibility with TRL's DPOTrainer.
Limitations
- Trained for 1 epoch only; additional epochs may improve alignment further
- QLoRA (4-bit) introduces some quantization noise compared to full-precision training
- Evaluation uses rule-based heuristics; human evaluation would better capture alignment improvements
- The model inherits the base Qwen2.5-7B-Instruct's knowledge cutoff and biases
- LoRA adapter requires the base model for inference
Model tree for usama10/qwen-7b-dpo-ultrafeedback
Dataset used to train usama10/qwen-7b-dpo-ultrafeedback
Evaluation results
- DPO Reward Accuracy (last 50 steps avg) on UltraFeedback Binarizedself-reported55.700
