RSP-GRPO Baseline: Qwen2.5-0.5B ChatML (RSP OFF)

GRPO (Group Relative Policy Optimization) baseline training on math reasoning. This is the vanilla GRPO run (no Random Soft Prompt) used as control for RSP ablation.

Training Details

Parameter	Value
Base Model	Qwen2.5-0.5B
Algorithm	GRPO
Prompt Format	ChatML
RSP	OFF (baseline)
Dataset	simplelr_math_35 (8,523 train)
train_batch_size	256
max_response_length	8192
rollout_n	8
learning_rate	5e-7
total_epochs	20
Hardware	4x A100 80GB (Singularity)

Checkpoints

Step	Epoch	Val Score
global_step_16	~0.5	~27.0%
global_step_32	~1.0	~25.4%
global_step_48	~1.5	~30.2%
global_step_64	~2.0	28.4%

Training is ongoing (step 79/660 as of upload). More checkpoints will be added.

Files

checkpoints/global_step_*/ — FSDP sharded checkpoints + HuggingFace model
logs/ — Full training log

Project

Part of RandomSoftPrompt RSP-GRPO experiment. Comparing vanilla GRPO vs RSP-augmented GRPO for math reasoning.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning