lllyx commited on
Commit
513e04f
·
verified ·
1 Parent(s): 73067d4

Update model card

Browse files
Files changed (1) hide show
  1. README.md +3 -83
README.md CHANGED
@@ -25,8 +25,7 @@ base_model_relation: finetune
25
  # Qwen3-4B-Base-GRPO
26
 
27
  Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
28
- It starts from **Qwen3-4B-Base** and applies GRPO on the
29
- **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
30
 
31
  This model is associated with the paper:
32
  **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
@@ -34,15 +33,8 @@ Paper link: https://arxiv.org/abs/2604.13016
34
 
35
  ## Model Description
36
 
37
- This model is obtained by applying GRPO reinforcement learning to
38
- `Qwen3-4B-Base` with verl. The training updates the actor model parameters and is
39
- intended to improve math-focused reasoning performance under the on-policy
40
- distillation setting.
41
 
42
- No learned reward model is used in this training run. In particular,
43
- `reward_model.enable` is set to `false`; rewards are computed by a custom
44
- rule-based reward function for math evaluation:
45
- `verl/verl/utils/reward_score/ttrl_math/__init__.py`.
46
 
47
  ### Key characteristics
48
 
@@ -52,7 +44,6 @@ rule-based reward function for math evaluation:
52
  - **Parameter update**: Full-parameter actor update
53
  - **Primary domain**: Mathematical reasoning
54
  - **Reward model**: Not used (`reward_model.enable: false`)
55
- - **Custom reward function**: `reward_func`
56
  - **Rollout engine**: vLLM
57
  - **Context length**: 32768 tokens
58
  - **Responses per prompt**: 8
@@ -75,8 +66,6 @@ rule-based reward function for math evaluation:
75
  - **Max model length**: `32768`
76
  - **Rollout temperature**: `1.0`
77
  - **Repetition penalty**: `1.0`
78
- - **Top-k log probability**: `0`
79
- - **Top-k strategy**: `union`
80
  - **KL loss**: disabled
81
  - **Format reward**: disabled
82
  - **Loss aggregation**: `token-mean`
@@ -88,79 +77,12 @@ rule-based reward function for math evaluation:
88
  - **Number of epochs**: `1`
89
  - **Save frequency**: every `20` steps
90
  - **Test frequency**: every `20` steps
91
- - **Logging**: console and SwanLab
92
 
93
  ### Dataset
94
 
95
  - **Training dataset**: `DAPO-Math-17k-Processed`
96
  - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
97
 
98
- ## Training Hyperparameters
99
-
100
- For reproducibility, the core configuration is summarized below:
101
-
102
- ```bash
103
- ACTOR_MODEL_PATH=model/Qwen3-4B-Base
104
- ADV_ESTIMATOR=grpo
105
- GRPO_OUTCOME_WEIGHT=1.0
106
-
107
- TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
108
- TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
109
- TEST_DATASET=[
110
- datasets/test_data/AIME25/test.parquet,
111
- datasets/test_data/AMC23/test.parquet,
112
- datasets/test_data/AIME24/test.parquet
113
- ]
114
-
115
- MAX_PROMPT_LENGTH=1024
116
- MAX_RESP_LENGTH=7168
117
- MAX_VAL_RESP_LENGTH=31744
118
- MAX_MODEL_LEN=32768
119
-
120
- MINI_BATCH_SIZE=64
121
- TEMPERATURE=1.0
122
-
123
- # TEACHER_TEMPERATURE and REWARD_WEIGHT_MODE are rollout/logit-control settings
124
- # from the training script. They do not indicate the use of a learned reward model.
125
- TEACHER_TEMPERATURE=1.0
126
- REPETITION_PENALTY=1.0
127
- N_RESPONSES=8
128
-
129
- LOG_PROB_TOP_K=0
130
- TOP_K_STRATEGY=union
131
- REWARD_WEIGHT_MODE=student_p
132
-
133
- USE_KL=False
134
- ENABLE_FORMAT_REWARD=False
135
- MODEL_DTYPE=fp32
136
- LOSS_AGG_MODE=token-mean
137
-
138
- actor_rollout_ref.actor.optim.lr=1e-6
139
- actor_rollout_ref.actor.ppo_mini_batch_size=64
140
- actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
141
- actor_rollout_ref.actor.use_dynamic_bsz=True
142
- actor_rollout_ref.model.enable_gradient_checkpointing=True
143
- actor_rollout_ref.model.enable_activation_offload=True
144
-
145
- actor_rollout_ref.rollout.name=vllm
146
- actor_rollout_ref.rollout.tensor_model_parallel_size=1
147
- actor_rollout_ref.rollout.gpu_memory_utilization=0.8
148
- actor_rollout_ref.rollout.n=8
149
- actor_rollout_ref.rollout.val_kwargs.n=16
150
- actor_rollout_ref.rollout.val_kwargs.temperature=1.0
151
- actor_rollout_ref.rollout.val_kwargs.top_p=0.95
152
-
153
- reward_model.enable=False
154
- custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
155
- custom_reward_function.name=reward_func
156
-
157
- trainer.n_gpus_per_node=8
158
- trainer.nnodes=1
159
- trainer.total_epochs=1
160
- trainer.save_freq=20
161
- trainer.test_freq=20
162
- ```
163
-
164
  ## Usage
165
 
166
  ```python
@@ -187,6 +109,4 @@ If you use this model, please consider citing the related paper:
187
  journal={arXiv preprint arXiv:2604.13016},
188
  year={2026}
189
  }
190
- ```
191
-
192
- Paper: https://arxiv.org/abs/2604.13016
 
25
  # Qwen3-4B-Base-GRPO
26
 
27
  Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
28
+ It starts from **Qwen3-4B-Base** and applies GRPO on the **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
 
29
 
30
  This model is associated with the paper:
31
  **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
 
33
 
34
  ## Model Description
35
 
36
+ This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.
 
 
 
37
 
 
 
 
 
38
 
39
  ### Key characteristics
40
 
 
44
  - **Parameter update**: Full-parameter actor update
45
  - **Primary domain**: Mathematical reasoning
46
  - **Reward model**: Not used (`reward_model.enable: false`)
 
47
  - **Rollout engine**: vLLM
48
  - **Context length**: 32768 tokens
49
  - **Responses per prompt**: 8
 
66
  - **Max model length**: `32768`
67
  - **Rollout temperature**: `1.0`
68
  - **Repetition penalty**: `1.0`
 
 
69
  - **KL loss**: disabled
70
  - **Format reward**: disabled
71
  - **Loss aggregation**: `token-mean`
 
77
  - **Number of epochs**: `1`
78
  - **Save frequency**: every `20` steps
79
  - **Test frequency**: every `20` steps
 
80
 
81
  ### Dataset
82
 
83
  - **Training dataset**: `DAPO-Math-17k-Processed`
84
  - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ## Usage
87
 
88
  ```python
 
109
  journal={arXiv preprint arXiv:2604.13016},
110
  year={2026}
111
  }
112
+ ```