GRPO Tax Study: phi-3.8b LoRA Adapter

This is a LoRA adapter trained with GRPO (Group Relative Policy Optimization) on GSM8K math reasoning, released as part of the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

What is this?

This adapter was trained to study whether GRPO training for mathematical reasoning degrades other capabilities (the "alignment tax"). The key finding is that 78% of non-target capabilities are preserved within +/-2% of baseline after one epoch of GRPO training.

Training Details

Parameter Value
Base model microsoft/Phi-3.5-mini-instruct
Parameters 3.8B
Method GRPO with LoRA (r=16, alpha=32)
Dataset openai/gsm8k (7,473 examples)
Epochs 1
Learning rate 5e-6 (cosine)
Group size 4
Precision bf16
Hardware NVIDIA RTX 5090 (32GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-phi-3.8b")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

Related Resources

Resource Link
Paper Coming soon (TMLR submission)
All evaluation data usama10/grpo-tax-eval-data
Source code github.com/usama10/grpo-capability-tax
Other GRPO adapters usama10/grpo-tax-qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b
DPO adapters usama10/grpo-tax-qwen-1.5b-dpo, qwen-3b-dpo
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/grpo-tax-phi-3.8b

Adapter
(692)
this model

Dataset used to train usama10/grpo-tax-phi-3.8b