Qwen2.5-7B-Instruct SFT+GRPO — LOO Domain Knowledge (12 steps)

GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-loo-domain-knowledge (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).

This is a capped variant (max 12 steps) of the GRPO stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.

Training Configuration

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Learning rate: 5e-6
  • Batch size: 2 per device
  • Group size: 2 (completions per prompt)
  • Max steps: 12
  • Max completion length: 2048
  • Precision: bf16
  • Gradient checkpointing: enabled

QLoRA

  • Quantization: 4-bit NF4 with double quantization
  • LoRA rank: 64
  • LoRA alpha: 128
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Reward Functions

  • correctness_reward: task-specific answer verification (binary 0/1)
  • format_reward: checks for proper <answer> tag formatting, penalises excessive length

Datasets

Trained on two datasets (domain knowledge held out):

Dataset Domain
tinyllms/game24-trajectories Game of 24 — arithmetic reasoning
tinyllms/aime-1983-2023-trajectories AIME — competition math

Leave-One-Out Design

Domain Role
Game24 Train
AIME Train
Domain Knowledge (GPQA) Held out

Transfer is measured by evaluating on GPQA Diamond.

Infrastructure

  • GPU: 2x NVIDIA H100 80GB (DDP)
  • Framework: TRL + Ray Train (DDP)
  • Tracking: Weights & Biases (project: pocket-sheet-grpo)
  • Ray Job ID: raysubmit_YbfK7VEEYMGjEp9h
Downloads last month
7
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge

Datasets used to train tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge