Qwen2.5-7B-Instruct-GRPO-Emotion1 (this is our best configuration so far)

This model is a GRPO (Group Relative Policy Optimization) fine-tuned version of 55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged, optimized for generating emotionally compelling advertising content.

Model Description

This model was trained using reinforcement learning (GRPO) with an LLM-as-judge reward function that evaluates generated ads across 6 emotional dimensions:

  1. Emotional Causality - Are emotions caused by observable behavior?
  2. Emotional Turn - Is there a clear behavioral change/pivot?
  3. Human Micro-Truths - Specific, ordinary human actions readers recognize
  4. Non-Literal Interpretation - Creative leap from the prompt
  5. Intimacy Anchor - Private, personal moments before spectacle
  6. Emotional Resolution - Does the ending change how we feel?

Training Results

Metric Start End Improvement
Reward Mean ~0.35 0.6839 +95%
Reward Std ~0.15 0.0703 More consistent
Train Loss ~0 -0.0476 Policy improved
Entropy ~1.9 1.63 More confident
Completion Length - 275 tokens Optimal range

Wandb Link

Logs [https://wandb.ai/aabidkarim/Qwen2.5-7b-Instruct-emotion1/runs/etne8z5s/logs?nw=nwuseraabidkarim]

Training Progress

  • Total Steps: 538
  • Epochs: 2
  • Training Time: 6h 7m 54s
  • Final Reward: 0.6839 (68.4% of max score)

Training Configuration

GRPO Settings

GRPOConfig(
    learning_rate=5e-6,
    num_generations=16,           # Completions per prompt
    max_prompt_length=256,
    max_completion_length=512,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    max_grad_norm=0.1,
    bf16=True,
)

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged",
    torch_dtype="auto",
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")
tokenizer = AutoTokenizer.from_pretrained("55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")

# Generate emotional ad
messages = [
    {"role": "system", "content": "You are an award-winning creative director. Write emotionally powerful advertisements."},
    {"role": "user", "content": "Create an ad for a coffee brand that evokes nostalgia and warmth."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Procedure

Visualize in Weights & Biases

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Dataset

Reward Function

Hybrid reward combining:

  • 30% Python-based length scoring (optimal: 150-300 words)
  • 70% LLM judge (GPT) scoring 6 emotional dimensions

Framework Versions

  • TRL: 0.27.1
  • Transformers: 5.0.0
  • PyTorch: 2.10.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2
  • PEFT: Latest

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1

Adapter
(1)
this model

Paper for 55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1