lllyx commited on
Commit
73067d4
·
verified ·
1 Parent(s): 141d1d0

Clarify verl RL model card

Browse files
Files changed (1) hide show
  1. README.md +20 -11
README.md CHANGED
@@ -10,9 +10,11 @@ tags:
10
  - qwen3
11
  - math
12
  - grpo
 
 
13
  - reinforcement-learning
14
  - on-policy-distillation
15
- - full-finetuning
16
  - reasoning
17
  - safetensors
18
  - arxiv:2604.13016
@@ -22,7 +24,8 @@ base_model_relation: finetune
22
 
23
  # Qwen3-4B-Base-GRPO
24
 
25
- Qwen3-4B-Base-GRPO is a GRPO-trained model based on **Qwen3-4B-Base**, trained on the
 
26
  **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
27
 
28
  This model is associated with the paper:
@@ -31,19 +34,22 @@ Paper link: https://arxiv.org/abs/2604.13016
31
 
32
  ## Model Description
33
 
34
- This model is obtained by full-parameter GRPO training from `Qwen3-4B-Base`.
35
- The training is designed to improve the model's performance on math-focused
36
- reasoning tasks under the on-policy distillation setting.
 
37
 
38
- No learned reward model is used in this training run. Rewards are computed by a
39
- custom rule-based reward function for math evaluation:
 
40
  `verl/verl/utils/reward_score/ttrl_math/__init__.py`.
41
 
42
  ### Key characteristics
43
 
44
  - **Base model**: Qwen3-4B-Base
45
- - **Training stage**: GRPO
46
- - **Finetuning type**: Full finetuning
 
47
  - **Primary domain**: Mathematical reasoning
48
  - **Reward model**: Not used (`reward_model.enable: false`)
49
  - **Custom reward function**: `reward_func`
@@ -58,6 +64,8 @@ custom rule-based reward function for math evaluation:
58
  - **Framework**: verl
59
  - **Algorithm**: `grpo`
60
  - **GRPO outcome weight**: `1.0`
 
 
61
  - **Training dataset**: `DAPO-Math-17k-Processed`
62
  - **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
63
  - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
@@ -66,11 +74,9 @@ custom rule-based reward function for math evaluation:
66
  - **Validation response length**: `31744`
67
  - **Max model length**: `32768`
68
  - **Rollout temperature**: `1.0`
69
- - **Teacher temperature**: `1.0`
70
  - **Repetition penalty**: `1.0`
71
  - **Top-k log probability**: `0`
72
  - **Top-k strategy**: `union`
73
- - **Reward weight mode**: `student_p`
74
  - **KL loss**: disabled
75
  - **Format reward**: disabled
76
  - **Loss aggregation**: `token-mean`
@@ -113,6 +119,9 @@ MAX_MODEL_LEN=32768
113
 
114
  MINI_BATCH_SIZE=64
115
  TEMPERATURE=1.0
 
 
 
116
  TEACHER_TEMPERATURE=1.0
117
  REPETITION_PENALTY=1.0
118
  N_RESPONSES=8
 
10
  - qwen3
11
  - math
12
  - grpo
13
+ - verl
14
+ - rl
15
  - reinforcement-learning
16
  - on-policy-distillation
17
+ - full-parameter-rl
18
  - reasoning
19
  - safetensors
20
  - arxiv:2604.13016
 
24
 
25
  # Qwen3-4B-Base-GRPO
26
 
27
+ Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
28
+ It starts from **Qwen3-4B-Base** and applies GRPO on the
29
  **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
30
 
31
  This model is associated with the paper:
 
34
 
35
  ## Model Description
36
 
37
+ This model is obtained by applying GRPO reinforcement learning to
38
+ `Qwen3-4B-Base` with verl. The training updates the actor model parameters and is
39
+ intended to improve math-focused reasoning performance under the on-policy
40
+ distillation setting.
41
 
42
+ No learned reward model is used in this training run. In particular,
43
+ `reward_model.enable` is set to `false`; rewards are computed by a custom
44
+ rule-based reward function for math evaluation:
45
  `verl/verl/utils/reward_score/ttrl_math/__init__.py`.
46
 
47
  ### Key characteristics
48
 
49
  - **Base model**: Qwen3-4B-Base
50
+ - **Training framework**: verl
51
+ - **Training stage**: Reinforcement Learning (GRPO)
52
+ - **Parameter update**: Full-parameter actor update
53
  - **Primary domain**: Mathematical reasoning
54
  - **Reward model**: Not used (`reward_model.enable: false`)
55
  - **Custom reward function**: `reward_func`
 
64
  - **Framework**: verl
65
  - **Algorithm**: `grpo`
66
  - **GRPO outcome weight**: `1.0`
67
+ - **Learned reward model**: disabled (`reward_model.enable: false`)
68
+ - **Reward source**: custom rule-based math reward function
69
  - **Training dataset**: `DAPO-Math-17k-Processed`
70
  - **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
71
  - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
 
74
  - **Validation response length**: `31744`
75
  - **Max model length**: `32768`
76
  - **Rollout temperature**: `1.0`
 
77
  - **Repetition penalty**: `1.0`
78
  - **Top-k log probability**: `0`
79
  - **Top-k strategy**: `union`
 
80
  - **KL loss**: disabled
81
  - **Format reward**: disabled
82
  - **Loss aggregation**: `token-mean`
 
119
 
120
  MINI_BATCH_SIZE=64
121
  TEMPERATURE=1.0
122
+
123
+ # TEACHER_TEMPERATURE and REWARD_WEIGHT_MODE are rollout/logit-control settings
124
+ # from the training script. They do not indicate the use of a learned reward model.
125
  TEACHER_TEMPERATURE=1.0
126
  REPETITION_PENALTY=1.0
127
  N_RESPONSES=8