GRPO Tax: Evaluation Data

Complete evaluation data from the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

This dataset contains 432 JSON evaluation files with 5,616 individual benchmark scores from dense-checkpoint evaluation of 5 models during GRPO training and 2 models during DPO training.

Directory Structure

qwen-1.5b/          # 76 files (base + 75 checkpoints)
qwen-3b/            # 76 files
phi-3.8b/           # 76 files
gemma-2b/           # 76 files
llama-3b/           # 76 files
qwen-1.5b-dpo/      # 26 files (base + 25 checkpoints)
qwen-3b-dpo/        # 26 files

File Format

Each JSON file contains scores on 13 benchmarks:

{
  "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
  "checkpoint": "base",
  "results": {
    "math_reasoning": {"benchmark": "math_reasoning", "metric": "accuracy", "score": 0.35, "num_examples": 200},
    "general_knowledge": {"benchmark": "general_knowledge", "metric": "accuracy", "score": 0.415, "num_examples": 200},
    ...
  }
}

Benchmarks (13 total)

Benchmark	Metric	Examples
GSM8K (target)	Accuracy	200
MMLU	Accuracy	200
HellaSwag	Accuracy	200
ARC-Challenge	Accuracy	200
TruthfulQA	Accuracy	200
Winogrande	Accuracy	200
IFEval	Constraint Satisfaction	30
XSum	ROUGE-L	66
WMT (en->de)	BLEU	97
Coding	Syntax + Structure	25
Safety	Refusal Rate	30
Creative Writing	Vocab. Diversity	25
Conversation	Helpfulness	25

Related Resources

Resource	Link
Paper	Coming soon (TMLR submission)
Source code	github.com/usama10/grpo-capability-tax
Trained GRPO adapters	qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b
DPO adapters	qwen-1.5b-dpo, qwen-3b-dpo

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

usama10
/

grpo-tax-eval-data

GRPO Tax: Evaluation Data

Contents

Directory Structure

File Format

Benchmarks (13 total)

Related Resources