Safetensors
qwen2
leave-one-out
loo-domain-knowledge
grpo
lr=5e-6
batch_size=2
group_size=2
max_steps=12
qlora
quantize=4bit_nf4
lora_rank=64
lora_alpha=128
lora_dropout=0.05
max_completion_length=2048
gradient_checkpointing
bf16
ddp_workers=2
hub_model_id=tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge-20steps
ray_job=raysubmit_YbfK7VEEYMGjEp9h
Qwen2.5-7B-Instruct SFT+GRPO — LOO Domain Knowledge (12 steps)
GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-loo-domain-knowledge (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).
This is a capped variant (max 12 steps) of the GRPO stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.
Training Configuration
- Algorithm: GRPO (Group Relative Policy Optimization)
- Learning rate: 5e-6
- Batch size: 2 per device
- Group size: 2 (completions per prompt)
- Max steps: 12
- Max completion length: 2048
- Precision: bf16
- Gradient checkpointing: enabled
QLoRA
- Quantization: 4-bit NF4 with double quantization
- LoRA rank: 64
- LoRA alpha: 128
- LoRA dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Reward Functions
- correctness_reward: task-specific answer verification (binary 0/1)
- format_reward: checks for proper
<answer>tag formatting, penalises excessive length
Datasets
Trained on two datasets (domain knowledge held out):
| Dataset | Domain |
|---|---|
tinyllms/game24-trajectories |
Game of 24 — arithmetic reasoning |
tinyllms/aime-1983-2023-trajectories |
AIME — competition math |
Leave-One-Out Design
| Domain | Role |
|---|---|
| Game24 | Train |
| AIME | Train |
| Domain Knowledge (GPQA) | Held out |
Transfer is measured by evaluating on GPQA Diamond.
Infrastructure
- GPU: 2x NVIDIA H100 80GB (DDP)
- Framework: TRL + Ray Train (DDP)
- Tracking: Weights & Biases (project:
pocket-sheet-grpo) - Ray Job ID: raysubmit_YbfK7VEEYMGjEp9h
- Downloads last month
- 7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge
Base model
Qwen/Qwen2.5-7B Finetuned
Qwen/Qwen2.5-7B-Instruct