RLVR vs SFT on Qwen2.5-1.5b Instruct

GRPO and SFT trained checkpoint of Qwen2.5-1.5B-Instruct with GSM8K dataset.

Part of a personal project comparing RLVR vs SFT training methods.

Result: GSM8K 69.7% โ†’ 81.6% while also improving MATH (49.2% โ†’ 52.3%).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support