Qwen2.5-7B-Instruct SFT+GRPO — LOO Domain Knowledge (12 steps)

GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-loo-domain-knowledge (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).

This is a capped variant (max 12 steps) of the GRPO stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.

Training Configuration

Algorithm: GRPO (Group Relative Policy Optimization)
Learning rate: 5e-6
Batch size: 2 per device
Group size: 2 (completions per prompt)
Max steps: 12
Max completion length: 2048
Precision: bf16
Gradient checkpointing: enabled

QLoRA

Quantization: 4-bit NF4 with double quantization
LoRA rank: 64
LoRA alpha: 128
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Reward Functions

correctness_reward: task-specific answer verification (binary 0/1)
format_reward: checks for proper <answer> tag formatting, penalises excessive length

Datasets

Trained on two datasets (domain knowledge held out):

Dataset	Domain
`tinyllms/game24-trajectories`	Game of 24 — arithmetic reasoning
`tinyllms/aime-1983-2023-trajectories`	AIME — competition math

Leave-One-Out Design

Domain	Role
Game24	Train
AIME	Train
Domain Knowledge (GPQA)	Held out

Transfer is measured by evaluating on GPQA Diamond.

Infrastructure

GPU: 2x NVIDIA H100 80GB (DDP)
Framework: TRL + Ray Train (DDP)
Tracking: Weights & Biases (project: pocket-sheet-grpo)
Ray Job ID: raysubmit_YbfK7VEEYMGjEp9h

Downloads last month: 7

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

tinyllms/qwen2.5-7b-instruct-sft-loo-domain-knowledge

Finetuned

(1)

this model

tinyllms
/

qwen2.5-7b-instruct-grpo-loo-domain-knowledge

Qwen2.5-7B-Instruct SFT+GRPO — LOO Domain Knowledge (12 steps)

Training Configuration

QLoRA

Reward Functions

Datasets

Leave-One-Out Design

Infrastructure

Model tree for tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge

Datasets used to train tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge