GRPO Baseline LoRA โ€” Qwen2.5-1.5B-Instruct

LoRA adapter fine-tuned with standard GRPO (correctness + format reward only) for math reasoning on GSM8K. Baseline for the SuS method.

Training Configuration

Parameter Value
LoRA rank (r) 64
LoRA alpha 128
Training steps 2,000
Learning rate 5e-6
Batch size 8
KL coefficient 0.001
Dataset GSM8K

Results on GSM8K

Metric Score
Pass@1 73.98%
Pass@5 89.53%
Pass@8 91.88%

95% CI (Pass@1): [72.10, 75.83]

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "mariklolik228/grpo-baseline-qwen2.5-1.5b-lora")
tokenizer = AutoTokenizer.from_pretrained("mariklolik228/grpo-baseline-qwen2.5-1.5b-lora")

Links

Citation

@article{kashirskiy2026sus,
  title={SuS: Strategy-aware Surprise for Intrinsic Exploration in GRPO},
  author={Kashirskiy, Mark and Makarov, Ilya},
  journal={arXiv preprint arXiv:2601.10349},
  year={2026}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mariklolik228/grpo-baseline-qwen2.5-1.5b-lora

Adapter
(826)
this model

Paper for mariklolik228/grpo-baseline-qwen2.5-1.5b-lora