Qwen2.5-7B Reward Model

A Qwen2.5-7B-Instruct model fine-tuned as a reward model on the Anthropic HH-RLHF dataset. Given a conversation, the model outputs a scalar reward score indicating how helpful, harmless, and honest the response is.

What is a Reward Model?

A reward model is a critical component of the RLHF (Reinforcement Learning from Human Feedback) pipeline. It learns to predict human preferences by training on pairs of responses where one is preferred over the other. The trained reward model can then:

  • Score responses for quality during RL training (e.g., PPO or GRPO)
  • Rank multiple candidate responses to select the best one (best-of-N sampling)
  • Filter training data by scoring and keeping only high-quality examples
  • Evaluate model outputs for helpfulness and safety

Training Details

Parameter Value
Base model Qwen/Qwen2.5-7B-Instruct
Method Reward modeling with LoRA (r=32, alpha=64)
Quantization None (full bf16 to maximize quality)
Dataset Anthropic/hh-rlhf
Training examples 160,800 preference pairs
Eval examples 8,552 preference pairs
Hardware NVIDIA RTX 5090 (32GB VRAM, 18GB used)
Training time ~9.2 hours
Epochs 1
Effective batch size 16 (4 per device x 4 gradient accumulation)
Learning rate 1e-5 (cosine schedule, 100 warmup steps)
Max sequence length 512 tokens
Precision bf16
Framework TRL 0.29.1 + Transformers 5.3.0

Performance

Metric Value
Eval accuracy 71.5%
Final training accuracy 71.4% (avg last 50 steps)
Starting accuracy 50.0% (random)
Training loss 0.677 -> 0.545

The model learned to correctly identify the preferred response 71.5% of the time on the held-out test set, up from 50% (random chance).

Training Curves

Training Metrics

  • Training Loss: Steady decrease from 0.68 to 0.50
  • Accuracy: Rose from 50% (random) to ~72%, with train and eval tracking closely (no overfitting)
  • Learning Rate: Cosine decay from 1e-5 to 0
  • Reward Margin: The gap between chosen and rejected reward scores grew steadily, showing the model increasingly distinguishes good from bad responses

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch

# Load base model + adapter
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    num_labels=1,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/qwen-7b-reward-model")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Score a response
text = "Human: What is the best way to learn programming?\n\nAssistant: Start with Python. Build small projects, read documentation, and practice daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

with torch.no_grad():
    reward_score = model(**inputs).logits.item()

print(f"Reward score: {reward_score:.4f}")
# Higher score = more helpful, harmless, and honest

Dataset

Anthropic HH-RLHF contains 170K human preference comparisons over model responses. Each example has a "chosen" (preferred) and "rejected" (dispreferred) conversation. The data covers two dimensions:

  • Helpfulness: Is the response useful and informative?
  • Harmlessness: Does the response avoid harmful, toxic, or dangerous content?

The dataset uses a dialogue format with "Human:" and "Assistant:" turns, making it directly compatible with conversational models.

Limitations

  • Trained for 1 epoch; additional epochs may improve accuracy further
  • The 512-token max length means very long conversations are truncated
  • Reward scores are relative, not absolute; useful for ranking, not for threshold-based filtering
  • The model inherits biases from the Anthropic HH-RLHF dataset and the base Qwen model
  • LoRA adapter requires the base Qwen2.5-7B-Instruct model for inference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/qwen-7b-reward-model

Base model

Qwen/Qwen2.5-7B
Adapter
(1770)
this model

Dataset used to train usama10/qwen-7b-reward-model

Evaluation results