GRPO Tax: Evaluation Data
Complete evaluation data from the paper:
The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training
Contents
This dataset contains 432 JSON evaluation files with 5,616 individual benchmark scores from dense-checkpoint evaluation of 5 models during GRPO training and 2 models during DPO training.
Directory Structure
qwen-1.5b/ # 76 files (base + 75 checkpoints)
qwen-3b/ # 76 files
phi-3.8b/ # 76 files
gemma-2b/ # 76 files
llama-3b/ # 76 files
qwen-1.5b-dpo/ # 26 files (base + 25 checkpoints)
qwen-3b-dpo/ # 26 files
File Format
Each JSON file contains scores on 13 benchmarks:
{
"base_model": "Qwen/Qwen2.5-1.5B-Instruct",
"checkpoint": "base",
"results": {
"math_reasoning": {"benchmark": "math_reasoning", "metric": "accuracy", "score": 0.35, "num_examples": 200},
"general_knowledge": {"benchmark": "general_knowledge", "metric": "accuracy", "score": 0.415, "num_examples": 200},
...
}
}
Benchmarks (13 total)
| Benchmark | Metric | Examples |
|---|---|---|
| GSM8K (target) | Accuracy | 200 |
| MMLU | Accuracy | 200 |
| HellaSwag | Accuracy | 200 |
| ARC-Challenge | Accuracy | 200 |
| TruthfulQA | Accuracy | 200 |
| Winogrande | Accuracy | 200 |
| IFEval | Constraint Satisfaction | 30 |
| XSum | ROUGE-L | 66 |
| WMT (en->de) | BLEU | 97 |
| Coding | Syntax + Structure | 25 |
| Safety | Refusal Rate | 30 |
| Creative Writing | Vocab. Diversity | 25 |
| Conversation | Helpfulness | 25 |
Related Resources
| Resource | Link |
|---|---|
| Paper | Coming soon (TMLR submission) |
| Source code | github.com/usama10/grpo-capability-tax |
| Trained GRPO adapters | qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b |
| DPO adapters | qwen-1.5b-dpo, qwen-3b-dpo |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support