gemma3-1b-gsm8k-grpo-reasoning

This model is a LoRA fine-tuned version of google/gemma-3-1b-it on the GSM8K dataset using GRPO (Group Relative Policy Optimization).

Model Details

  • Base Model: google/gemma-3-1b-it
  • Training Algorithm: GRPO (Reinforcement Learning)
  • Dataset: GSM8K (grade school math problems)
  • LoRA Rank: 32
  • LoRA Alpha: 32.0
  • Accuracy: ~50%+ on GSM8K test set
  • Format Accuracy: 98.5%+

Training Configuration

  • Batch Size: 4
  • Learning Rate: 3e-06
  • Training Steps: 2000
  • TPU: v6e-8 (8 cores)

Reward Functions

This model was trained with process-based rewards to encourage reasoning:

  1. Format Reward: Ensures proper output structure
  2. Answer Quality: Graduated rewards for correct/close answers
  3. Arithmetic Verification: Rewards showing explicit calculations
  4. Problem Decomposition: Rewards step-by-step reasoning

Usage

This is a LoRA checkpoint. To use it, you'll need to:

  1. Load the base Gemma-3-1b-it model
  2. Apply these LoRA weights using a library like qwix or peft
# Example usage (requires tunix/qwix)
from tunix.models.gemma3 import params, model
import qwix

# Load base model
base_model = params.create_model_from_checkpoint(params.GEMMA3_1B_IT, model.ModelConfig.gemma3_1b())

# Apply LoRA (load from this checkpoint)
# ... (implementation depends on your framework)

Citation

@misc{gemma3-gsm8k-grpo,
    author = {sairambokka},
    title = {Gemma 3 1B Fine-tuned on GSM8K with GRPO},
    year = {2024},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/sairambokka/gemma3-1b-gsm8k-grpo-reasoning}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train sairambokka/gemma3-1b-gsm8k-grpo-reasoning