DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper • 2402.03300 • Published • 145
This model is a GRPO (Group Relative Policy Optimization) fine-tuned version of 55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged, optimized for generating emotionally compelling advertising content.
This model was trained using reinforcement learning (GRPO) with an LLM-as-judge reward function that evaluates generated ads across 6 emotional dimensions:
| Metric | Start | End | Improvement |
|---|---|---|---|
| Reward Mean | ~0.35 | 0.6839 | +95% |
| Reward Std | ~0.15 | 0.0703 | More consistent |
| Train Loss | ~0 | -0.0476 | Policy improved |
| Entropy | ~1.9 | 1.63 | More confident |
| Completion Length | - | 275 tokens | Optimal range |
Logs [https://wandb.ai/aabidkarim/Qwen2.5-7b-Instruct-emotion1/runs/etne8z5s/logs?nw=nwuseraabidkarim]
GRPOConfig(
learning_rate=5e-6,
num_generations=16, # Completions per prompt
max_prompt_length=256,
max_completion_length=512,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
num_train_epochs=2,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
max_grad_norm=0.1,
bf16=True,
)
LoraConfig(
r=16,
lora_alpha=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
"55mvresearch/Qwen2.5-7B-Instruct-SFT-FT1-Merged",
torch_dtype="auto",
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")
tokenizer = AutoTokenizer.from_pretrained("55mvresearch/Qwen2.5-7B-Instruct-GRPO-Emotion1")
# Generate emotional ad
messages = [
{"role": "system", "content": "You are an award-winning creative director. Write emotionally powerful advertisements."},
{"role": "user", "content": "Create an ad for a coffee brand that evokes nostalgia and warmth."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
Hybrid reward combining:
Cite GRPO as:
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}