gemma3-1b-gsm8k-grpo-reasoning
This model is a LoRA fine-tuned version of google/gemma-3-1b-it on the GSM8K dataset using GRPO (Group Relative Policy
Optimization).
Model Details
- Base Model: google/gemma-3-1b-it
- Training Algorithm: GRPO (Reinforcement Learning)
- Dataset: GSM8K (grade school math problems)
- LoRA Rank: 32
- LoRA Alpha: 32.0
- Accuracy: ~50%+ on GSM8K test set
- Format Accuracy: 98.5%+
Training Configuration
- Batch Size: 4
- Learning Rate: 3e-06
- Training Steps: 2000
- TPU: v6e-8 (8 cores)
Reward Functions
This model was trained with process-based rewards to encourage reasoning:
- Format Reward: Ensures proper output structure
- Answer Quality: Graduated rewards for correct/close answers
- Arithmetic Verification: Rewards showing explicit calculations
- Problem Decomposition: Rewards step-by-step reasoning
Usage
This is a LoRA checkpoint. To use it, you'll need to:
- Load the base Gemma-3-1b-it model
- Apply these LoRA weights using a library like
qwixorpeft
# Example usage (requires tunix/qwix)
from tunix.models.gemma3 import params, model
import qwix
# Load base model
base_model = params.create_model_from_checkpoint(params.GEMMA3_1B_IT, model.ModelConfig.gemma3_1b())
# Apply LoRA (load from this checkpoint)
# ... (implementation depends on your framework)
Citation
@misc{gemma3-gsm8k-grpo,
author = {sairambokka},
title = {Gemma 3 1B Fine-tuned on GSM8K with GRPO},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/sairambokka/gemma3-1b-gsm8k-grpo-reasoning}}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support