Qwen2.5-7B-Instruct SFT+GRPO โ€” AIME

GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-aime (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).

Training Configuration

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Learning rate: 5e-6
  • Batch size: 2 per device
  • Group size: 2 (completions per prompt)
  • Max steps: 20
  • Max completion length: 2048
  • Precision: bf16
  • Gradient checkpointing: enabled

QLoRA

  • Quantization: 4-bit NF4 with double quantization
  • LoRA rank: 64
  • LoRA alpha: 128
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Reward Functions

  • correctness_reward: task-specific answer verification (binary 0/1)
  • format_reward: checks for proper <answer> tag formatting, penalises excessive length

Dataset

Trained on tinyllms/aime-1983-2023-trajectories (204 prompts).

Infrastructure

  • GPU: NVIDIA H100 80GB (2 workers, STRICT_PACK)
  • Framework: TRL + Ray Train (DDP)
  • Tracking: Weights & Biases (project: pocket-sheet-grpo)
  • Ray Job ID: raysubmit_iHjwAgfrbFLnsCgN
Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tinyllms/qwen2.5-7b-instruct-sft-aime-grpo

Base model

Qwen/Qwen2.5-7B
Finetuned
(1)
this model

Dataset used to train tinyllms/qwen2.5-7b-instruct-sft-aime-grpo