GRPO Tax: Evaluation Data

Complete evaluation data from the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

Contents

This dataset contains 432 JSON evaluation files with 5,616 individual benchmark scores from dense-checkpoint evaluation of 5 models during GRPO training and 2 models during DPO training.

Directory Structure

qwen-1.5b/          # 76 files (base + 75 checkpoints)
qwen-3b/            # 76 files
phi-3.8b/           # 76 files
gemma-2b/           # 76 files
llama-3b/           # 76 files
qwen-1.5b-dpo/      # 26 files (base + 25 checkpoints)
qwen-3b-dpo/        # 26 files

File Format

Each JSON file contains scores on 13 benchmarks:

{
  "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
  "checkpoint": "base",
  "results": {
    "math_reasoning": {"benchmark": "math_reasoning", "metric": "accuracy", "score": 0.35, "num_examples": 200},
    "general_knowledge": {"benchmark": "general_knowledge", "metric": "accuracy", "score": 0.415, "num_examples": 200},
    ...
  }
}

Benchmarks (13 total)

Benchmark Metric Examples
GSM8K (target) Accuracy 200
MMLU Accuracy 200
HellaSwag Accuracy 200
ARC-Challenge Accuracy 200
TruthfulQA Accuracy 200
Winogrande Accuracy 200
IFEval Constraint Satisfaction 30
XSum ROUGE-L 66
WMT (en->de) BLEU 97
Coding Syntax + Structure 25
Safety Refusal Rate 30
Creative Writing Vocab. Diversity 25
Conversation Helpfulness 25

Related Resources

Resource Link
Paper Coming soon (TMLR submission)
Source code github.com/usama10/grpo-capability-tax
Trained GRPO adapters qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b
DPO adapters qwen-1.5b-dpo, qwen-3b-dpo
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support