SuS: Strategy-aware Surprise for Intrinsic Exploration
Paper โข 2601.10349 โข Published
LoRA adapter fine-tuned with SuS (Strategy-aware Surprise) + GRPO for math reasoning on GSM8K.
SuS adds a semantic novelty bonus to correct responses in mixed-correctness batches, encouraging diverse problem-solving strategies. Zero-variance batches (all-correct or all-incorrect) are left untouched โ this prevents KL divergence blowup and length collapse.
| Parameter | Value |
|---|---|
| LoRA rank (r) | 64 |
| LoRA alpha | 128 |
| Training steps | 2,000 |
| Learning rate | 5e-6 |
| Batch size | 8 |
| KL coefficient | 0.001 |
| SS bonus (ฮฒ) | 0.1 |
| Dataset | GSM8K |
| Metric | Score |
|---|---|
| Pass@1 | 75.27% |
| Pass@5 | 94.03% |
| Pass@8 | 97.63% |
95% CI (Pass@1): [73.49, 77.05]
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "mariklolik228/sus-qwen2.5-1.5b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("mariklolik228/sus-qwen2.5-1.5b-grpo-lora")
@article{kashirskiy2026sus,
title={SuS: Strategy-aware Surprise for Intrinsic Exploration in GRPO},
author={Kashirskiy, Mark and Makarov, Ilya},
journal={arXiv preprint arXiv:2601.10349},
year={2026}
}