File size: 4,368 Bytes
10de68f
 
 
 
 
141d1d0
 
10de68f
 
141d1d0
 
10de68f
73067d4
 
10de68f
141d1d0
73067d4
10de68f
141d1d0
10de68f
141d1d0
 
10de68f
 
1f3b296
10de68f
e59bf75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73067d4
513e04f
141d1d0
 
 
 
 
 
 
513e04f
141d1d0
 
 
 
 
73067d4
 
 
141d1d0
 
 
 
 
 
 
 
 
 
 
 
 
73067d4
 
141d1d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10de68f
 
 
 
 
 
 
 
 
 
 
 
 
141d1d0
 
 
 
 
 
 
 
 
 
 
 
e59bf75
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: other
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen
- qwen3
- math
- grpo
- verl
- rl
- reinforcement-learning
- on-policy-distillation
- full-parameter-rl
- reasoning
- safetensors
- arxiv:2604.13016
base_model: Qwen/Qwen3-4B-Base
base_model_relation: finetune
---

<h1 align="center">Qwen3-4B-Base-GRPO</h1>

<div align="center" style="line-height: 1;">
  <a href="https://arxiv.org/abs/2604.13016" style="margin: 2px;">
    <img alt="Paper" src="https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/thunlp/OPD" style="margin: 2px;">
    <img alt="Github" src="https://img.shields.io/badge/OPD-000000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/papers/2604.13016" style="margin: 2px;">
    <img alt="HF Papers" src="https://img.shields.io/badge/HF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://x.com/HBX_hbx/status/2044464414829777354" style="margin: 2px;">
    <img alt="Twitter" src="https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

<br>

Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
It starts from **Qwen3-4B-Base** and applies GRPO on the **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.

This model is associated with the paper:  
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**  
Paper link: https://arxiv.org/abs/2604.13016

## Model Description

This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.


### Key characteristics

- **Base model**: Qwen3-4B-Base
- **Training framework**: verl
- **Training stage**: Reinforcement Learning (GRPO)
- **Parameter update**: Full-parameter actor update
- **Primary domain**: Mathematical reasoning
- **Reward model**: Not used (`reward_model.enable: false`)
- **Rollout engine**: vLLM
- **Context length**: 32768 tokens
- **Responses per prompt**: 8

## Training Details

### Training configuration

- **Framework**: verl
- **Algorithm**: `grpo`
- **GRPO outcome weight**: `1.0`
- **Learned reward model**: disabled (`reward_model.enable: false`)
- **Reward source**: custom rule-based math reward function
- **Training dataset**: `DAPO-Math-17k-Processed`
- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
- **Prompt length**: `1024`
- **Response length**: `7168`
- **Validation response length**: `31744`
- **Max model length**: `32768`
- **Rollout temperature**: `1.0`
- **Repetition penalty**: `1.0`
- **KL loss**: disabled
- **Format reward**: disabled
- **Loss aggregation**: `token-mean`
- **Learning rate**: `1e-6`
- **PPO mini-batch size**: `64`
- **PPO micro-batch size per GPU**: `1`
- **Tensor parallel size**: `1`
- **Number of GPUs**: `8`
- **Number of epochs**: `1`
- **Save frequency**: every `20` steps
- **Test frequency**: every `20` steps

### Dataset

- **Training dataset**: `DAPO-Math-17k-Processed`
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lllyx/Qwen3-4B-Base-GRPO"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
```

## Citation

If you use this model, please consider citing the related paper:

```bibtex
@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}
```