B1ade-1B-GRPO
A reasoning-optimized Llama-3.2-1B model trained with GRPO (Group Relative Policy Optimization) on chain-of-thought data.
Model Description
B1ade-1B-GRPO is a 1 billion parameter language model fine-tuned using reinforcement learning (GRPO) to improve reasoning and chain-of-thought capabilities. The model is trained directly on the base Llama-3.2-1B-Instruct without an intermediate supervised fine-tuning step, using LoRA for parameter-efficient training.
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Training Method: GRPO (Group Relative Policy Optimization) with LoRA
- Parameters: ~1B total, ~2.4M trainable (LoRA adapters)
- Training Data: 50K chain-of-thought examples from simplecot dataset
- Reward Function: ROUGE-L F1 score against reference answers
- License: Llama 3.2 Community License
Intended Use
This model is designed for:
- Chain-of-thought reasoning tasks
- Step-by-step problem solving
- Educational question answering
- Lightweight reasoning applications where efficiency matters
Training Details
Training Configuration
LoRA Settings:
- Rank: 32
- Alpha: 64
- Dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Hyperparameters:
- Learning rate: 1e-5
- LR schedule: Cosine with 1% warmup
- Batch size: 16 per device
- Gradient accumulation: 1 step
- Max steps: 5000
- Optimizer: AdamW (β1=0.9, β2=0.99, weight_decay=0.1)
- Max gradient norm: 0.1
- Precision: bfloat16
GRPO Settings:
- Number of generations: 8 rollouts per prompt
- Max prompt length: 256 tokens
- Max completion length: 256 tokens
- vLLM for generation: 60% GPU memory utilization
Training Data
Dataset: w601sxs/simplecot_subset_50k
- 50,000 chain-of-thought reasoning examples
- Shuffled with seed 42
- Format: Question → Step-by-step reasoning → Answer
Training Infrastructure
- Hardware: NVIDIA A10G 24GB GPU (AWS SageMaker)
- Training Time: ~8.5 hours for 5000 steps
- Speed: ~6.4 seconds per training step
- Framework versions:
- PyTorch 2.5.1
- Transformers 4.57.0
- TRL 0.15.2
- PEFT 0.18.1
- vLLM 0.6.6
Checkpoints
Model checkpoints were saved every 500 steps. The final checkpoint at step 5000 is the published model.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype="auto",
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "w601sxs/b1ade-1b-grpo")
tokenizer = AutoTokenizer.from_pretrained("w601sxs/b1ade-1b-grpo")
prompt = "What is 25 times 4? Think step by step."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- Designed for reasoning tasks; may not excel at other capabilities
- 1B parameter model has inherent limitations compared to larger models
- Trained on English data only
- May produce hallucinations or incorrect reasoning steps
Citation
@misc{b1ade-1b-grpo,
author = {w601sxs},
title = {B1ade-1B-GRPO: GRPO-trained Llama 1B for Reasoning},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/w601sxs/b1ade-1b-grpo}}
}
Model Card Authors
w601sxs
Acknowledgments
- Base model: Meta's Llama-3.2-1B-Instruct
- Training framework: Hugging Face TRL for GRPO implementation
- Dataset: SimpleCOT reasoning dataset