B1ade-1B-GRPO

A reasoning-optimized Llama-3.2-1B model trained with GRPO (Group Relative Policy Optimization) on chain-of-thought data.

Model Description

B1ade-1B-GRPO is a 1 billion parameter language model fine-tuned using reinforcement learning (GRPO) to improve reasoning and chain-of-thought capabilities. The model is trained directly on the base Llama-3.2-1B-Instruct without an intermediate supervised fine-tuning step, using LoRA for parameter-efficient training.

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Training Method: GRPO (Group Relative Policy Optimization) with LoRA
  • Parameters: ~1B total, ~2.4M trainable (LoRA adapters)
  • Training Data: 50K chain-of-thought examples from simplecot dataset
  • Reward Function: ROUGE-L F1 score against reference answers
  • License: Llama 3.2 Community License

Intended Use

This model is designed for:

  • Chain-of-thought reasoning tasks
  • Step-by-step problem solving
  • Educational question answering
  • Lightweight reasoning applications where efficiency matters

Training Details

Training Configuration

LoRA Settings:

  • Rank: 32
  • Alpha: 64
  • Dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Hyperparameters:

  • Learning rate: 1e-5
  • LR schedule: Cosine with 1% warmup
  • Batch size: 16 per device
  • Gradient accumulation: 1 step
  • Max steps: 5000
  • Optimizer: AdamW (β1=0.9, β2=0.99, weight_decay=0.1)
  • Max gradient norm: 0.1
  • Precision: bfloat16

GRPO Settings:

  • Number of generations: 8 rollouts per prompt
  • Max prompt length: 256 tokens
  • Max completion length: 256 tokens
  • vLLM for generation: 60% GPU memory utilization

Training Data

Dataset: w601sxs/simplecot_subset_50k

  • 50,000 chain-of-thought reasoning examples
  • Shuffled with seed 42
  • Format: Question → Step-by-step reasoning → Answer

Training Infrastructure

  • Hardware: NVIDIA A10G 24GB GPU (AWS SageMaker)
  • Training Time: ~8.5 hours for 5000 steps
  • Speed: ~6.4 seconds per training step
  • Framework versions:
    • PyTorch 2.5.1
    • Transformers 4.57.0
    • TRL 0.15.2
    • PEFT 0.18.1
    • vLLM 0.6.6

Checkpoints

Model checkpoints were saved every 500 steps. The final checkpoint at step 5000 is the published model.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "w601sxs/b1ade-1b-grpo")
tokenizer = AutoTokenizer.from_pretrained("w601sxs/b1ade-1b-grpo")

# Generate
prompt = "What is 25 times 4? Think step by step."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Designed for reasoning tasks; may not excel at other capabilities
  • 1B parameter model has inherent limitations compared to larger models
  • Trained on English data only
  • May produce hallucinations or incorrect reasoning steps

Citation

@misc{b1ade-1b-grpo,
  author = {w601sxs},
  title = {B1ade-1B-GRPO: GRPO-trained Llama 1B for Reasoning},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/w601sxs/b1ade-1b-grpo}}
}

Model Card Authors

w601sxs

Acknowledgments

  • Base model: Meta's Llama-3.2-1B-Instruct
  • Training framework: Hugging Face TRL for GRPO implementation
  • Dataset: SimpleCOT reasoning dataset
Downloads last month
89
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for w601sxs/b1ade-1b-grpo

Adapter
(599)
this model

Collection including w601sxs/b1ade-1b-grpo