SuS-GRPO LoRA — Qwen2.5-1.5B-Instruct

LoRA adapter fine-tuned with SuS (Strategy-aware Surprise) + GRPO for math reasoning on GSM8K.

Method

SuS adds a semantic novelty bonus to correct responses in mixed-correctness batches, encouraging diverse problem-solving strategies. Zero-variance batches (all-correct or all-incorrect) are left untouched — this prevents KL divergence blowup and length collapse.

Base model: Qwen/Qwen2.5-1.5B-Instruct
Encoder: all-MiniLM-L6-v2 (frozen, 22M params)

Training Configuration

Parameter	Value
LoRA rank (r)	64
LoRA alpha	128
Training steps	2,000
Learning rate	5e-6
Batch size	8
KL coefficient	0.001
SS bonus (β)	0.1
Dataset	GSM8K

Results on GSM8K

Metric	Score
Pass@1	75.27%
Pass@5	94.03%
Pass@8	97.63%

95% CI (Pass@1): [73.49, 77.05]

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "mariklolik228/sus-qwen2.5-1.5b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("mariklolik228/sus-qwen2.5-1.5b-grpo-lora")

Citation

@article{kashirskiy2026sus,
  title={SuS: Strategy-aware Surprise for Intrinsic Exploration in GRPO},
  author={Kashirskiy, Mark and Makarov, Ilya},
  journal={arXiv preprint arXiv:2601.10349},
  year={2026}
}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mariklolik228/sus-qwen2.5-1.5b-grpo-lora

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(827)

this model

Paper for mariklolik228/sus-qwen2.5-1.5b-grpo-lora

SuS: Strategy-aware Surprise for Intrinsic Exploration

Paper • 2601.10349 • Published Jan 15

mariklolik228
/

sus-qwen2.5-1.5b-grpo-lora