lllyx commited on
Commit
141d1d0
·
verified ·
1 Parent(s): 10de68f

Update model card

Browse files
Files changed (1) hide show
  1. README.md +155 -45
README.md CHANGED
@@ -1,63 +1,158 @@
1
  ---
2
  license: other
3
- library_name: transformers
4
- pipeline_tag: text-generation
5
- base_model:
6
- - Qwen/Qwen3-4B-Base
7
- base_model_relation: finetune
8
  language:
9
  - en
10
  - zh
 
 
11
  tags:
12
- - safetensors
13
- - qwen3
14
  - qwen
 
 
15
  - grpo
16
  - reinforcement-learning
 
 
17
  - reasoning
18
- - conversational
19
- - math
20
  - arxiv:2604.13016
 
 
21
  ---
22
 
23
  # Qwen3-4B-Base-GRPO
24
 
25
- This repository contains a Qwen3-4B-Base GRPO checkpoint for the collection
26
- [Rethinking OPD](https://huggingface.co/collections/lllyx/rethinking-opd).
27
-
28
- The model is provided in `safetensors` format and can be loaded with
29
- `transformers`.
30
-
31
- ## Training Configuration
32
-
33
- This checkpoint was trained with the GRPO recipe used in the Rethinking OPD
34
- experiments.
35
-
36
- | Setting | Value |
37
- | --- | --- |
38
- | Actor initialization | `model/Qwen3-4B-Base` |
39
- | Reward/teacher model | `model/Qwen3-4B` |
40
- | Training data | `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet` |
41
- | Validation data | `AIME25`, `AMC23`, `AIME24` |
42
- | Advantage estimator | `grpo` |
43
- | GRPO outcome weight | `1.0` |
44
- | Rollout correction | token-level IS, threshold `2.0` |
45
- | Prompt length | `1024` |
46
- | Response length | `7168` |
47
- | Validation response length | `31744` |
48
- | Max model length | `32768` |
49
- | Responses per prompt | `8` |
50
- | Rollout temperature | `1.0` |
51
- | Teacher temperature | `1.0` |
52
- | Repetition penalty | `1.0` |
53
- | PPO mini-batch size | `64` |
54
- | Learning rate | `1e-6` |
55
- | KL loss | disabled |
56
- | Format reward | disabled |
57
- | Loss aggregation | `token-mean` |
58
- | Rollout engine | `vllm` |
59
- | Tensor parallel size | `1` |
60
- | GPUs per node | `8` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ```python
63
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -71,3 +166,18 @@ model = AutoModelForCausalLM.from_pretrained(
71
  device_map="auto",
72
  )
73
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
 
 
 
 
 
3
  language:
4
  - en
5
  - zh
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
  tags:
 
 
9
  - qwen
10
+ - qwen3
11
+ - math
12
  - grpo
13
  - reinforcement-learning
14
+ - on-policy-distillation
15
+ - full-finetuning
16
  - reasoning
17
+ - safetensors
 
18
  - arxiv:2604.13016
19
+ base_model: Qwen/Qwen3-4B-Base
20
+ base_model_relation: finetune
21
  ---
22
 
23
  # Qwen3-4B-Base-GRPO
24
 
25
+ Qwen3-4B-Base-GRPO is a GRPO-trained model based on **Qwen3-4B-Base**, trained on the
26
+ **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
27
+
28
+ This model is associated with the paper:
29
+ **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
30
+ Paper link: https://arxiv.org/abs/2604.13016
31
+
32
+ ## Model Description
33
+
34
+ This model is obtained by full-parameter GRPO training from `Qwen3-4B-Base`.
35
+ The training is designed to improve the model's performance on math-focused
36
+ reasoning tasks under the on-policy distillation setting.
37
+
38
+ No learned reward model is used in this training run. Rewards are computed by a
39
+ custom rule-based reward function for math evaluation:
40
+ `verl/verl/utils/reward_score/ttrl_math/__init__.py`.
41
+
42
+ ### Key characteristics
43
+
44
+ - **Base model**: Qwen3-4B-Base
45
+ - **Training stage**: GRPO
46
+ - **Finetuning type**: Full finetuning
47
+ - **Primary domain**: Mathematical reasoning
48
+ - **Reward model**: Not used (`reward_model.enable: false`)
49
+ - **Custom reward function**: `reward_func`
50
+ - **Rollout engine**: vLLM
51
+ - **Context length**: 32768 tokens
52
+ - **Responses per prompt**: 8
53
+
54
+ ## Training Details
55
+
56
+ ### Training configuration
57
+
58
+ - **Framework**: verl
59
+ - **Algorithm**: `grpo`
60
+ - **GRPO outcome weight**: `1.0`
61
+ - **Training dataset**: `DAPO-Math-17k-Processed`
62
+ - **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
63
+ - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
64
+ - **Prompt length**: `1024`
65
+ - **Response length**: `7168`
66
+ - **Validation response length**: `31744`
67
+ - **Max model length**: `32768`
68
+ - **Rollout temperature**: `1.0`
69
+ - **Teacher temperature**: `1.0`
70
+ - **Repetition penalty**: `1.0`
71
+ - **Top-k log probability**: `0`
72
+ - **Top-k strategy**: `union`
73
+ - **Reward weight mode**: `student_p`
74
+ - **KL loss**: disabled
75
+ - **Format reward**: disabled
76
+ - **Loss aggregation**: `token-mean`
77
+ - **Learning rate**: `1e-6`
78
+ - **PPO mini-batch size**: `64`
79
+ - **PPO micro-batch size per GPU**: `1`
80
+ - **Tensor parallel size**: `1`
81
+ - **Number of GPUs**: `8`
82
+ - **Number of epochs**: `1`
83
+ - **Save frequency**: every `20` steps
84
+ - **Test frequency**: every `20` steps
85
+ - **Logging**: console and SwanLab
86
+
87
+ ### Dataset
88
+
89
+ - **Training dataset**: `DAPO-Math-17k-Processed`
90
+ - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
91
+
92
+ ## Training Hyperparameters
93
+
94
+ For reproducibility, the core configuration is summarized below:
95
+
96
+ ```bash
97
+ ACTOR_MODEL_PATH=model/Qwen3-4B-Base
98
+ ADV_ESTIMATOR=grpo
99
+ GRPO_OUTCOME_WEIGHT=1.0
100
+
101
+ TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
102
+ TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
103
+ TEST_DATASET=[
104
+ datasets/test_data/AIME25/test.parquet,
105
+ datasets/test_data/AMC23/test.parquet,
106
+ datasets/test_data/AIME24/test.parquet
107
+ ]
108
+
109
+ MAX_PROMPT_LENGTH=1024
110
+ MAX_RESP_LENGTH=7168
111
+ MAX_VAL_RESP_LENGTH=31744
112
+ MAX_MODEL_LEN=32768
113
+
114
+ MINI_BATCH_SIZE=64
115
+ TEMPERATURE=1.0
116
+ TEACHER_TEMPERATURE=1.0
117
+ REPETITION_PENALTY=1.0
118
+ N_RESPONSES=8
119
+
120
+ LOG_PROB_TOP_K=0
121
+ TOP_K_STRATEGY=union
122
+ REWARD_WEIGHT_MODE=student_p
123
+
124
+ USE_KL=False
125
+ ENABLE_FORMAT_REWARD=False
126
+ MODEL_DTYPE=fp32
127
+ LOSS_AGG_MODE=token-mean
128
+
129
+ actor_rollout_ref.actor.optim.lr=1e-6
130
+ actor_rollout_ref.actor.ppo_mini_batch_size=64
131
+ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
132
+ actor_rollout_ref.actor.use_dynamic_bsz=True
133
+ actor_rollout_ref.model.enable_gradient_checkpointing=True
134
+ actor_rollout_ref.model.enable_activation_offload=True
135
+
136
+ actor_rollout_ref.rollout.name=vllm
137
+ actor_rollout_ref.rollout.tensor_model_parallel_size=1
138
+ actor_rollout_ref.rollout.gpu_memory_utilization=0.8
139
+ actor_rollout_ref.rollout.n=8
140
+ actor_rollout_ref.rollout.val_kwargs.n=16
141
+ actor_rollout_ref.rollout.val_kwargs.temperature=1.0
142
+ actor_rollout_ref.rollout.val_kwargs.top_p=0.95
143
+
144
+ reward_model.enable=False
145
+ custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
146
+ custom_reward_function.name=reward_func
147
+
148
+ trainer.n_gpus_per_node=8
149
+ trainer.nnodes=1
150
+ trainer.total_epochs=1
151
+ trainer.save_freq=20
152
+ trainer.test_freq=20
153
+ ```
154
+
155
+ ## Usage
156
 
157
  ```python
158
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
166
  device_map="auto",
167
  )
168
  ```
169
+
170
+ ## Citation
171
+
172
+ If you use this model, please consider citing the related paper:
173
+
174
+ ```bibtex
175
+ @article{li2026rethinking,
176
+ title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
177
+ author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
178
+ journal={arXiv preprint arXiv:2604.13016},
179
+ year={2026}
180
+ }
181
+ ```
182
+
183
+ Paper: https://arxiv.org/abs/2604.13016