B1ade-1B-GRPO

A reasoning-optimized Llama-3.2-1B model trained with GRPO (Group Relative Policy Optimization) on chain-of-thought data.

Model Description

B1ade-1B-GRPO is a 1 billion parameter language model fine-tuned using reinforcement learning (GRPO) to improve reasoning and chain-of-thought capabilities. The model is trained directly on the base Llama-3.2-1B-Instruct without an intermediate supervised fine-tuning step, using LoRA for parameter-efficient training.

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Method: GRPO (Group Relative Policy Optimization) with LoRA
Parameters: ~1B total, ~2.4M trainable (LoRA adapters)
Training Data: 50K chain-of-thought examples from simplecot dataset
Reward Function: ROUGE-L F1 score against reference answers
License: Llama 3.2 Community License

Intended Use

This model is designed for:

Chain-of-thought reasoning tasks
Step-by-step problem solving
Educational question answering
Lightweight reasoning applications where efficiency matters

Training Details

Training Configuration

LoRA Settings:

Rank: 32
Alpha: 64
Dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Hyperparameters:

Learning rate: 1e-5
LR schedule: Cosine with 1% warmup
Batch size: 16 per device
Gradient accumulation: 1 step
Max steps: 5000
Optimizer: AdamW (β1=0.9, β2=0.99, weight_decay=0.1)
Max gradient norm: 0.1
Precision: bfloat16

GRPO Settings:

Number of generations: 8 rollouts per prompt
Max prompt length: 256 tokens
Max completion length: 256 tokens
vLLM for generation: 60% GPU memory utilization

Training Data

Dataset: w601sxs/simplecot_subset_50k

50,000 chain-of-thought reasoning examples
Shuffled with seed 42
Format: Question → Step-by-step reasoning → Answer

Training Infrastructure

Hardware: NVIDIA A10G 24GB GPU (AWS SageMaker)
Training Time: ~8.5 hours for 5000 steps
Speed: ~6.4 seconds per training step
Framework versions:
- PyTorch 2.5.1
- Transformers 4.57.0
- TRL 0.15.2
- PEFT 0.18.1
- vLLM 0.6.6

Checkpoints

Model checkpoints were saved every 500 steps. The final checkpoint at step 5000 is the published model.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "w601sxs/b1ade-1b-grpo")
tokenizer = AutoTokenizer.from_pretrained("w601sxs/b1ade-1b-grpo")

# Generate
prompt = "What is 25 times 4? Think step by step."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Designed for reasoning tasks; may not excel at other capabilities
1B parameter model has inherent limitations compared to larger models
Trained on English data only
May produce hallucinations or incorrect reasoning steps

Citation

@misc{b1ade-1b-grpo,
  author = {w601sxs},
  title = {B1ade-1B-GRPO: GRPO-trained Llama 1B for Reasoning},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/w601sxs/b1ade-1b-grpo}}
}

Model Card Authors

w601sxs

Acknowledgments

Base model: Meta's Llama-3.2-1B-Instruct
Training framework: Hugging Face TRL for GRPO implementation
Dataset: SimpleCOT reasoning dataset

Downloads last month: 89

Model tree for w601sxs/b1ade-1b-grpo

Base model

meta-llama/Llama-3.2-1B-Instruct

Adapter

(599)

this model

Collection including w601sxs/b1ade-1b-grpo

b1ade

Collection

SLMs for RAG • 8 items • Updated 25 days ago