JLiangHe
/

OPSD_exp

Model card Files Files and versions

OPSD_exp / README.md

JLiangHe's picture

Upload README.md with huggingface_hub

d01a42d verified 17 days ago

|

history blame contribute delete

1.22 kB

OPSD Experiment Results

Reproduction of OPSD (On-Policy Self-Distillation) on Qwen3-1.7B, 4B, and 8B.

Results (Avg@12)

Qwen3-1.7B

Method	AIME24	AIME25	HMMT25
Base	47.2%	35.3%	21.9%
OPSD (best)	49.2%	37.5%	24.4%
SFT (best)	37.5%	30.8%	19.2%
GRPO (best)	47.8%	35.0%	22.8%

Qwen3-4B

Method	AIME24	AIME25	HMMT25
Base	71.1%	60.0%	38.6%
OPSD (best)	62.2%	57.2%	34.2%
SFT (best)	62.5%	58.1%	33.3%
GRPO (best)	68.9%	65.0%	41.9%

Qwen3-8B

Method	AIME24	AIME25	HMMT25
Base	72.8%	61.7%	38.6%
OPSD (best)	69.4%	63.3%	38.6%
SFT (best)	69.2%	60.3%	36.1%
GRPO (best)	72.2%	65.8%	40.8%

Setup

All methods: lr=5e-6, BS=32, LoRA r=64 alpha=128, 200 steps
Eval: val_n=12, temperature=1.0, thinking mode enabled
Data: siyanzhao/Openthoughts_math_30k_opsd

Reference

Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs