Qwen2.5-7B Reward Model
A Qwen2.5-7B-Instruct model fine-tuned as a reward model on the Anthropic HH-RLHF dataset. Given a conversation, the model outputs a scalar reward score indicating how helpful, harmless, and honest the response is.
What is a Reward Model?
A reward model is a critical component of the RLHF (Reinforcement Learning from Human Feedback) pipeline. It learns to predict human preferences by training on pairs of responses where one is preferred over the other. The trained reward model can then:
- Score responses for quality during RL training (e.g., PPO or GRPO)
- Rank multiple candidate responses to select the best one (best-of-N sampling)
- Filter training data by scoring and keeping only high-quality examples
- Evaluate model outputs for helpfulness and safety
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Method | Reward modeling with LoRA (r=32, alpha=64) |
| Quantization | None (full bf16 to maximize quality) |
| Dataset | Anthropic/hh-rlhf |
| Training examples | 160,800 preference pairs |
| Eval examples | 8,552 preference pairs |
| Hardware | NVIDIA RTX 5090 (32GB VRAM, 18GB used) |
| Training time | ~9.2 hours |
| Epochs | 1 |
| Effective batch size | 16 (4 per device x 4 gradient accumulation) |
| Learning rate | 1e-5 (cosine schedule, 100 warmup steps) |
| Max sequence length | 512 tokens |
| Precision | bf16 |
| Framework | TRL 0.29.1 + Transformers 5.3.0 |
Performance
| Metric | Value |
|---|---|
| Eval accuracy | 71.5% |
| Final training accuracy | 71.4% (avg last 50 steps) |
| Starting accuracy | 50.0% (random) |
| Training loss | 0.677 -> 0.545 |
The model learned to correctly identify the preferred response 71.5% of the time on the held-out test set, up from 50% (random chance).
Training Curves
- Training Loss: Steady decrease from 0.68 to 0.50
- Accuracy: Rose from 50% (random) to ~72%, with train and eval tracking closely (no overfitting)
- Learning Rate: Cosine decay from 1e-5 to 0
- Reward Margin: The gap between chosen and rejected reward scores grew steadily, showing the model increasingly distinguishes good from bad responses
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch
# Load base model + adapter
base_model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
num_labels=1,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-reward-model")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Score a response
text = "Human: What is the best way to learn programming?\n\nAssistant: Start with Python. Build small projects, read documentation, and practice daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
with torch.no_grad():
reward_score = model(**inputs).logits.item()
print(f"Reward score: {reward_score:.4f}")
# Higher score = more helpful, harmless, and honest
Dataset
Anthropic HH-RLHF contains 170K human preference comparisons over model responses. Each example has a "chosen" (preferred) and "rejected" (dispreferred) conversation. The data covers two dimensions:
- Helpfulness: Is the response useful and informative?
- Harmlessness: Does the response avoid harmful, toxic, or dangerous content?
The dataset uses a dialogue format with "Human:" and "Assistant:" turns, making it directly compatible with conversational models.
Limitations
- Trained for 1 epoch; additional epochs may improve accuracy further
- The 512-token max length means very long conversations are truncated
- Reward scores are relative, not absolute; useful for ranking, not for threshold-based filtering
- The model inherits biases from the Anthropic HH-RLHF dataset and the base Qwen model
- LoRA adapter requires the base Qwen2.5-7B-Instruct model for inference
Model tree for usama10/qwen-7b-reward-model
Dataset used to train usama10/qwen-7b-reward-model
Evaluation results
- Eval Accuracy on Anthropic HH-RLHFtest set self-reported71.500
