qwen3-4B-refiner-rl-balanced-step50
This model is a GRPO-trained checkpoint (step 50) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
Training Details
- Base model:
lihaoxin2020/qwen3-4B-refiner-sft-step-3201
- Training method: GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
- Refiner mode:
answer_only
- Training script:
open_instruct/grpo_fast_refiner_sft.py
Hyperparameters
| Parameter |
Value |
| Learning rate |
5e-6 |
| LR scheduler |
constant |
| Beta (KL penalty) |
0.001 |
| KL estimator |
kl3 |
| Advantage normalization |
standard |
| Samples per prompt (rollout) |
8 |
| Unique prompts per rollout |
32 |
| Mini batches |
1 |
| Epochs per batch |
1 |
| Per-device train batch size |
1 |
| Temperature |
1.0 |
| Seed |
42 |
| Async mode |
true |
| Adam offload |
true |
| vLLM sync backend |
nccl |
Sequence Lengths
| Parameter |
Value |
| Max token length |
8192 |
| Max prompt token length |
6144 |
| Response length |
1024 |
| Pack length |
8192 |
Reward Configuration
| Parameter |
Value |
| Verification reward |
10.0 |
| Non-stop penalty |
false |
| Gate judge score with format bonus |
false |
| Apply paper citation reward |
true |
| Paper citation weight |
0.5 |
Dataset
- Training:
lihaoxin2020/refiner_rl (split: train)
- Evaluation:
lihaoxin2020/refiner_rl (16 samples, split: test)
Infrastructure
- DeepSpeed stage: 3
- Learners per node: 1
- vLLM engines: 1
- vLLM tensor parallel size: 1
- vLLM GPU memory utilization: 0.90
- Judge model:
Qwen/Qwen3.5-35B-A3B (via vLLM)