gemma3-1b-gsm8k-grpo-reasoning

This model is a LoRA fine-tuned version of google/gemma-3-1b-it on the GSM8K dataset using GRPO (Group Relative Policy Optimization).

Model Details

Base Model: google/gemma-3-1b-it
Training Algorithm: GRPO (Reinforcement Learning)
Dataset: GSM8K (grade school math problems)
LoRA Rank: 32
LoRA Alpha: 32.0
Accuracy: ~50%+ on GSM8K test set
Format Accuracy: 98.5%+

Training Configuration

Batch Size: 4
Learning Rate: 3e-06
Training Steps: 2000
TPU: v6e-8 (8 cores)

Reward Functions

This model was trained with process-based rewards to encourage reasoning:

Format Reward: Ensures proper output structure
Answer Quality: Graduated rewards for correct/close answers
Arithmetic Verification: Rewards showing explicit calculations
Problem Decomposition: Rewards step-by-step reasoning

Usage

This is a LoRA checkpoint. To use it, you'll need to:

Load the base Gemma-3-1b-it model
Apply these LoRA weights using a library like qwix or peft

# Example usage (requires tunix/qwix)
from tunix.models.gemma3 import params, model
import qwix

# Load base model
base_model = params.create_model_from_checkpoint(params.GEMMA3_1B_IT, model.ModelConfig.gemma3_1b())

# Apply LoRA (load from this checkpoint)
# ... (implementation depends on your framework)

Citation

@misc{gemma3-gsm8k-grpo,
    author = {sairambokka},
    title = {Gemma 3 1B Fine-tuned on GSM8K with GRPO},
    year = {2024},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/sairambokka/gemma3-1b-gsm8k-grpo-reasoning}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

sairambokka
/

gemma3-1b-gsm8k-grpo-reasoning

gemma3-1b-gsm8k-grpo-reasoning

Model Details

Training Configuration

Reward Functions

Usage

Dataset used to train sairambokka/gemma3-1b-gsm8k-grpo-reasoning