GRPO Tax Study: phi-3.8b LoRA Adapter
This is a LoRA adapter trained with GRPO (Group Relative Policy Optimization) on GSM8K math reasoning, released as part of the paper:
The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training
What is this?
This adapter was trained to study whether GRPO training for mathematical reasoning degrades other capabilities (the "alignment tax"). The key finding is that 78% of non-target capabilities are preserved within +/-2% of baseline after one epoch of GRPO training.
Training Details
| Parameter | Value |
|---|---|
| Base model | microsoft/Phi-3.5-mini-instruct |
| Parameters | 3.8B |
| Method | GRPO with LoRA (r=16, alpha=32) |
| Dataset | openai/gsm8k (7,473 examples) |
| Epochs | 1 |
| Learning rate | 5e-6 (cosine) |
| Group size | 4 |
| Precision | bf16 |
| Hardware | NVIDIA RTX 5090 (32GB) |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-mini-instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-phi-3.8b")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
Related Resources
| Resource | Link |
|---|---|
| Paper | Coming soon (TMLR submission) |
| All evaluation data | usama10/grpo-tax-eval-data |
| Source code | github.com/usama10/grpo-capability-tax |
| Other GRPO adapters | usama10/grpo-tax-qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b |
| DPO adapters | usama10/grpo-tax-qwen-1.5b-dpo, qwen-3b-dpo |
Model tree for usama10/grpo-tax-phi-3.8b
Base model
microsoft/Phi-3.5-mini-instruct