Qwen2.5-7B-Instruct SFT+GRPO — AIME

GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-aime (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).

Training Configuration

Algorithm: GRPO (Group Relative Policy Optimization)
Learning rate: 5e-6
Batch size: 2 per device
Group size: 2 (completions per prompt)
Max steps: 20
Max completion length: 2048
Precision: bf16
Gradient checkpointing: enabled

QLoRA

Quantization: 4-bit NF4 with double quantization
LoRA rank: 64
LoRA alpha: 128
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Reward Functions

correctness_reward: task-specific answer verification (binary 0/1)
format_reward: checks for proper <answer> tag formatting, penalises excessive length

Dataset

Trained on tinyllms/aime-1983-2023-trajectories (204 prompts).

Infrastructure

GPU: NVIDIA H100 80GB (2 workers, STRICT_PACK)
Framework: TRL + Ray Train (DDP)
Tracking: Weights & Biases (project: pocket-sheet-grpo)
Ray Job ID: raysubmit_iHjwAgfrbFLnsCgN

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

BF16

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tinyllms/qwen2.5-7b-instruct-sft-aime-grpo

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

tinyllms/qwen2.5-7b-instruct-sft-aime

Finetuned

(1)

this model

Dataset used to train tinyllms/qwen2.5-7b-instruct-sft-aime-grpo