RLVR vs SFT on Qwen2.5-1.5b Instruct

GRPO and SFT trained checkpoint of Qwen2.5-1.5B-Instruct with GSM8K dataset.

Part of a personal project comparing RLVR vs SFT training methods.

Result: GSM8K 69.7% → 81.6% while also improving MATH (49.2% → 52.3%).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support