RSP-GRPO Baseline: Qwen2.5-0.5B ChatML (RSP OFF)
GRPO (Group Relative Policy Optimization) baseline training on math reasoning. This is the vanilla GRPO run (no Random Soft Prompt) used as control for RSP ablation.
Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen2.5-0.5B |
| Algorithm | GRPO |
| Prompt Format | ChatML |
| RSP | OFF (baseline) |
| Dataset | simplelr_math_35 (8,523 train) |
| train_batch_size | 256 |
| max_response_length | 8192 |
| rollout_n | 8 |
| learning_rate | 5e-7 |
| total_epochs | 20 |
| Hardware | 4x A100 80GB (Singularity) |
Checkpoints
| Step | Epoch | Val Score |
|---|---|---|
| global_step_16 | ~0.5 | ~27.0% |
| global_step_32 | ~1.0 | ~25.4% |
| global_step_48 | ~1.5 | ~30.2% |
| global_step_64 | ~2.0 | 28.4% |
Training is ongoing (step 79/660 as of upload). More checkpoints will be added.
Files
checkpoints/global_step_*/— FSDP sharded checkpoints + HuggingFace modellogs/— Full training log
Project
Part of RandomSoftPrompt RSP-GRPO experiment. Comparing vanilla GRPO vs RSP-augmented GRPO for math reasoning.