Qwen2.5-7B-Instruct-GRPO-Emotion1 (this is our best configuration so far)

This model is a GRPO (Group Relative Policy Optimization) fine-tuned version of 55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged, optimized for generating emotionally compelling advertising content.

Model Description

This model was trained using reinforcement learning (GRPO) with an LLM-as-judge reward function that evaluates generated ads across 6 emotional dimensions:

Emotional Causality - Are emotions caused by observable behavior?
Emotional Turn - Is there a clear behavioral change/pivot?
Human Micro-Truths - Specific, ordinary human actions readers recognize
Non-Literal Interpretation - Creative leap from the prompt
Intimacy Anchor - Private, personal moments before spectacle
Emotional Resolution - Does the ending change how we feel?

Training Results

Metric	Start	End	Improvement
Reward Mean	~0.35	0.6839	+95%
Reward Std	~0.15	0.0703	More consistent
Train Loss	~0	-0.0476	Policy improved
Entropy	~1.9	1.63	More confident
Completion Length	-	275 tokens	Optimal range

Wandb Link

Logs [https://wandb.ai/aabidkarim/Qwen2.5-7b-Instruct-emotion1/runs/etne8z5s/logs?nw=nwuseraabidkarim]

Training Progress

Total Steps: 538
Epochs: 2
Training Time: 6h 7m 54s
Final Reward: 0.6839 (68.4% of max score)

Training Configuration

GRPO Settings

GRPOConfig(
    learning_rate=5e-6,
    num_generations=16,           # Completions per prompt
    max_prompt_length=256,
    max_completion_length=512,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    max_grad_norm=0.1,
    bf16=True,
)

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged",
    torch_dtype="auto",
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")
tokenizer = AutoTokenizer.from_pretrained("55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")

# Generate emotional ad
messages = [
    {"role": "system", "content": "You are an award-winning creative director. Write emotionally powerful advertisements."},
    {"role": "user", "content": "Create an ad for a coffee brand that evokes nostalgia and warmth."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Dataset

Dataset: 55mvresearch/sft-v1-singleturn-ads-creativity

Reward Function

Hybrid reward combining:

30% Python-based length scoring (optimal: 150-300 words)
70% LLM judge (GPT) scoring 6 emotional dimensions

Framework Versions

TRL: 0.27.1
Transformers: 5.0.0
PyTorch: 2.10.0
Datasets: 4.5.0
Tokenizers: 0.22.2
PEFT: Latest

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for 55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1

Base model

55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged

Adapter

(1)

this model

Paper for 55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145