RSP-GRPO Baseline: Qwen2.5-0.5B ChatML (RSP OFF)

GRPO (Group Relative Policy Optimization) baseline training on math reasoning. This is the vanilla GRPO run (no Random Soft Prompt) used as control for RSP ablation.

Training Details

Parameter Value
Base Model Qwen2.5-0.5B
Algorithm GRPO
Prompt Format ChatML
RSP OFF (baseline)
Dataset simplelr_math_35 (8,523 train)
train_batch_size 256
max_response_length 8192
rollout_n 8
learning_rate 5e-7
total_epochs 20
Hardware 4x A100 80GB (Singularity)

Checkpoints

Step Epoch Val Score
global_step_16 ~0.5 ~27.0%
global_step_32 ~1.0 ~25.4%
global_step_48 ~1.5 ~30.2%
global_step_64 ~2.0 28.4%

Training is ongoing (step 79/660 as of upload). More checkpoints will be added.

Files

  • checkpoints/global_step_*/ — FSDP sharded checkpoints + HuggingFace model
  • logs/ — Full training log

Project

Part of RandomSoftPrompt RSP-GRPO experiment. Comparing vanilla GRPO vs RSP-augmented GRPO for math reasoning.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading