Qwen2.5-7B Reward Model

A Qwen2.5-7B-Instruct model fine-tuned as a reward model on the Anthropic HH-RLHF dataset. Given a conversation, the model outputs a scalar reward score indicating how helpful, harmless, and honest the response is.

What is a Reward Model?

A reward model is a critical component of the RLHF (Reinforcement Learning from Human Feedback) pipeline. It learns to predict human preferences by training on pairs of responses where one is preferred over the other. The trained reward model can then:

Score responses for quality during RL training (e.g., PPO or GRPO)
Rank multiple candidate responses to select the best one (best-of-N sampling)
Filter training data by scoring and keeping only high-quality examples
Evaluate model outputs for helpfulness and safety

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-7B-Instruct
Method	Reward modeling with LoRA (r=32, alpha=64)
Quantization	None (full bf16 to maximize quality)
Dataset	Anthropic/hh-rlhf
Training examples	160,800 preference pairs
Eval examples	8,552 preference pairs
Hardware	NVIDIA RTX 5090 (32GB VRAM, 18GB used)
Training time	~9.2 hours
Epochs	1
Effective batch size	16 (4 per device x 4 gradient accumulation)
Learning rate	1e-5 (cosine schedule, 100 warmup steps)
Max sequence length	512 tokens
Precision	bf16
Framework	TRL 0.29.1 + Transformers 5.3.0

Performance

Metric	Value
Eval accuracy	71.5%
Final training accuracy	71.4% (avg last 50 steps)
Starting accuracy	50.0% (random)
Training loss	0.677 -> 0.545

The model learned to correctly identify the preferred response 71.5% of the time on the held-out test set, up from 50% (random chance).

Training Curves

Training Loss: Steady decrease from 0.68 to 0.50
Accuracy: Rose from 50% (random) to ~72%, with train and eval tracking closely (no overfitting)
Learning Rate: Cosine decay from 1e-5 to 0
Reward Margin: The gap between chosen and rejected reward scores grew steadily, showing the model increasingly distinguishes good from bad responses

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch

# Load base model + adapter
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-reward-model")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Score a response
text = "Human: What is the best way to learn programming?\n\nAssistant: Start with Python. Build small projects, read documentation, and practice daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

with torch.no_grad():
    reward_score = model(**inputs).logits.item()

print(f"Reward score: {reward_score:.4f}")
# Higher score = more helpful, harmless, and honest

Dataset

Anthropic HH-RLHF contains 170K human preference comparisons over model responses. Each example has a "chosen" (preferred) and "rejected" (dispreferred) conversation. The data covers two dimensions:

Helpfulness: Is the response useful and informative?
Harmlessness: Does the response avoid harmful, toxic, or dangerous content?

The dataset uses a dialogue format with "Human:" and "Assistant:" turns, making it directly compatible with conversational models.

Limitations

Trained for 1 epoch; additional epochs may improve accuracy further
The 512-token max length means very long conversations are truncated
Reward scores are relative, not absolute; useful for ranking, not for threshold-based filtering
The model inherits biases from the Anthropic HH-RLHF dataset and the base Qwen model
LoRA adapter requires the base Qwen2.5-7B-Instruct model for inference

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for usama10/qwen-7b-reward-model

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(1770)

this model

Dataset used to train usama10/qwen-7b-reward-model

Evaluation results

Eval Accuracy on Anthropic HH-RLHF
test set self-reported

71.500