RLVR vs SFT on Qwen2.5-1.5b Instruct
GRPO and SFT trained checkpoint of Qwen2.5-1.5B-Instruct with GSM8K dataset.
Part of a personal project comparing RLVR vs SFT training methods.
Result: GSM8K 69.7% โ 81.6% while also improving MATH (49.2% โ 52.3%).
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support